# HMAC optimization - slight performance improvements through precomputation

### Background

I’ve been working on one project called `orion`

, where I’ve
implemented the HMAC algorithm. That project is written in Rust and depends on
the `std`

library, which means that I’m unable to run that code on embedded devices.

After a couple of times where I had been reviewing my own code, I started to notice some
patterns in the HMAC implementation that seemed to be repetitive. So I set out
to implement HMAC again, without the repetitive patterns and something that could be
run on embedded devices. This implementation is called `rigel`

and is based on HMAC-SHA512, but the
tweaks can be applied to HMAC with any SHA variant.

I set the following goals for my project:

- Keep as much data as possible in a single array to avoid unnecessary allocations
- Achieve better, or at least the same performance as the current alternatives available
(The only alternatives I found that supported embedded devices are RustCrypto and
*ring*. I consider`rust-crypto`

abandoned.) - Have as few dependencies as possible

### The improved implementation

HMAC is defined as follows (RFC 2104):

```
H(K XOR opad, H(K XOR ipad, text))
```

where:

`ipad = 0x36`

`opad = 0x5C`

We know that the blocksize for the hash function `H`

(SHA512 in this case) is 128 and the output size is 64. `K`

is padded against a length of blocksize, so `K XOR opad`

would result in an array with length 128. Since the output size of the hash function `H`

is 64 `H(K XOR ipad, text)`

would result in an array with length 64. So an array of length 192 should be able to hold everything.

As the RFC mentions, `K`

should be padded with zeroes if the length of `K`

is less than the blocksize. The padding of zeroes can be skipped:

```
fn pad_key_to_ipad(key: &[u8]) -> [u8; 192] {
let mut padded_key = [0x36; 192];
if key.len() > 128 {
padded_key[..64].copy_from_slice(&sha2::Sha512::digest(&key));
for itm in padded_key.iter_mut().take(64) {
*itm ^= 0x36;
}
else {
for idx in 0..key.len() {
padded_key[idx] ^= key[idx];
}
}
padded_key
}
```

You can start out with an array of `0x36 * 192`

. This means that if the length of `K`

is less than the blocksize, you can simply iterate through the key and XOR each value with the corresponding index value of the `padded_key`

array. This yields the result of `K XOR ipad`

. If the length of `K`

is greater than the blocksize, simply hash `K`

using `H`

, copy the result into the first 64 bytes of the array and XOR this with the `ipad`

.

The padding of zeroes and XORing this with the `ipad`

has been precomputed, because naturally `0x00 XOR 0x36 = 0x36`

.

Next we need to append the message before hashing, so the following is placed in the main function, using the function `pad_key_to_ipad()`

from the key-padding steps before:

```
let mut buffer: [u8; 192] = pad_key_to_ipad(key);
// First 128 bytes is the ipad
hash_ipad.input(&buffer[..128]);
hash_ipad.input(message);
buffer[128..].copy_from_slice(&hash_ipad.result());
```

So the last 64 bytes of the array `buffer`

are now the result of `H(K XOR ipad, text)`

. The last step would be to compute the result of `K XOR opad`

. Since the XOR function satisfies the associativity property, this can be done rather easily.

The definition of associative property from Wikipedia is as follows:

Formally, a binary operation ∗ on a set S is called associative if it satisfies the associative law: (x ∗ y) ∗ z = x ∗ (y ∗ z) for all x, y, z in S.

Remember that the first 128 bytes of the `buffer`

array are the result of `K XOR ipad`

. Let’s call these 128 bytes `ires`

. To retrieve the key `K`

from `ires`

, we could inverse: `ires XOR ipad`

and then compute the `opad`

by `K XOR opad`

. These two operations are needless however, since it can be done in one operation: `ires XOR 0x6A`

, where `0x6A = (ipad XOR opad)`

. This is possible because of the associative property: `(ires XOR ipad) XOR opad = ires XOR (ipad XOR opad)`

:

```
// Make first 128 bytes the opad
for idx in buffer.iter_mut().take(128) {
*idx ^= 0x6A;
}
```

Now we can simply hash the content of the `buffer`

array to compute the MAC.

#### Streaming API - Update [09-08-2018]

I got a very good comment about using this approach along with a streaming API. I have updated the repository to include such an implementation and updated the benchmarks accordingly.

#### Dropping another 64 bytes of allocation with no performance loss - Update [10-08-2018]

I noticed that the extra 64 bytes of allocation for the `ires`

was excessive. Previously the `ires`

had been stored in the `buffer`

array. Instead of this, the SHA512 hashing struct responsible for retrieving the MAC, can be updated with the `opad`

and `ires`

sequentially. Because of this, you only need an array of `128`

bytes to store the necessary data. Before these last 64 bytes had been dropped, the benchmarks for `rigel`

were:

```
test rigel_one_shot ... bench: 2,094 ns/iter (+/- 182)
test rigel_stream ... bench: 2,174 ns/iter (+/- 121)
```

Now they are:

```
test rigel_one_shot ... bench: 2,093 ns/iter (+/- 42)
test rigel_stream ... bench: 2,161 ns/iter (+/- 58)
```

The repository and performance benchmarks have been updated.

### Performance

The benchmarks are listed in the project repository:

```
test RustCrypto ... bench: 2,727 ns/iter (+/- 91)
test rigel_one_shot ... bench: 2,093 ns/iter (+/- 42)
test rigel_stream ... bench: 2,161 ns/iter (+/- 58)
test ring ... bench: 3,357 ns/iter (+/- 96)
```

This was benchmarked on a MacBook Air 1,6 GHz Intel Core i5, 4GB.

As you can see there are slight performance improvements compared to `RustCrypto`

and some more against *ring*.

In regards to dependencies, I used one ‘crate’ providing the SHA2 primitive and another ‘crate’ that implemented constant-time comparison.

### More

All code for this implementation can be found here.