# HMAC optimization - slight performance improvements through precomputation

### Background

I’ve been working on one project called `orion`

, where I’ve
implemented the HMAC algorithm. That project is written in Rust and depends on
the `std`

library, which means that I’m unable to run that code on embedded devices.

After a couple of times where I had been reviewing my own code, I started to notice some
patterns in the HMAC implementation that seemed to be repetitive. So I set out
to implement HMAC again, without the repetitive patterns and something that could be
run on embedded devices. This implementation is called `rigel`

and is based on HMAC-SHA512, but the
tweaks can be applied to HMAC with any SHA variant.

I set the following goals for my project:

- Keep as much data as possible in a single array to avoid unnecessary allocations
- Achieve better, or at least the same performance as the current alternatives available
(The only alternatives I found that supported embedded devices are RustCrypto and
*ring*. I consider`rust-crypto`

abandoned.) - Have as few dependencies as possible

### The improved implementation

HMAC is defined as follows (RFC 2104):

```
H(K XOR opad, H(K XOR ipad, text))
```

where:

`ipad = 0x36`

`opad = 0x5C`

We know that the blocksize for the hash function `H`

(SHA512 in this case) is 128 and the output size is 64. `K`

is padded against a length of blocksize, so `K XOR opad`

would result in an array with length 128. Since the output size of the hash function `H`

is 64 `H(K XOR ipad, text)`

would result in an array with length 64. So an array of length 192 should be able to hold everything.

As the RFC mentions, `K`

should be padded with zeroes if the length of `K`

is less than the blocksize. The padding of zeroes can be skipped:

```
fn pad_key_to_ipad(key: &[u8]) -> [u8; 192] {
let mut padded_key = [0x36; 192];
if key.len() > 128 {
padded_key[..64].copy_from_slice(&sha2::Sha512::digest(&key));
for itm in padded_key.iter_mut().take(64) {
*itm ^= 0x36;
}
else {
for idx in 0..key.len() {
padded_key[idx] ^= key[idx];
}
}
padded_key
}
```

You can start out with an array of `0x36 * 192`

. This means that if the length of `K`

is less than the blocksize, you can simply iterate through the key and XOR each value with the corresponding index value of the `padded_key`

array. This yields the result of `K XOR ipad`

. If the length of `K`

is greater than the blocksize, simply hash `K`

using `H`

, copy the result into the first 64 bytes of the array and XOR this with the `ipad`

.

The padding of zeroes and XORing this with the `ipad`

has been precomputed, because naturally `0x00 XOR 0x36 = 0x36`

.

Next we need to append the message before hashing, so the following is placed in the main function, using the function `pad_key_to_ipad()`

from the key-padding steps before:

```
let mut buffer: [u8; 192] = pad_key_to_ipad(key);
// First 128 bytes is the ipad
hash_ipad.input(&buffer[..128]);
hash_ipad.input(message);
buffer[128..].copy_from_slice(&hash_ipad.result());
```

So the last 64 bytes of the array `buffer`

are now the result of `H(K XOR ipad, text)`

. The last step would be to compute the result of `K XOR opad`

. Since the XOR function satisfies the associativity property, this can be done rather easily.

The definition of associative property from Wikipedia is as follows:

Formally, a binary operation ∗ on a set S is called associative if it satisfies the associative law: (x ∗ y) ∗ z = x ∗ (y ∗ z) for all x, y, z in S.

Remember that the first 128 bytes of the `buffer`

array are the result of `K XOR ipad`

. Let’s call these 128 bytes `ires`

. To retrieve the key `K`

from `ires`

, we could inverse: `ires XOR ipad`

and then compute the `opad`

by `K XOR opad`

. These two operations are needless however, since it can be done in one operation: `ires XOR 0x6A`

, where `0x6A = (ipad XOR opad)`

. This is possible because of the associative property: `(ires XOR ipad) XOR opad = ires XOR (ipad XOR opad)`

:

```
// Make first 128 bytes the opad
for idx in buffer.iter_mut().take(128) {
*idx ^= 0x6A;
}
```

Now we can simply hash the content of the `buffer`

array to compute the MAC.

#### Streaming API - Update [09-08-2018]

I got a very good comment about using this approach along with a streaming API. I have updated the repository to include such an implementation and updated the benchmarks accordingly.

#### Dropping another 64 bytes of allocation with no performance loss - Update [10-08-2018]

I noticed that the extra 64 bytes of allocation for the `ires`

was excessive. Previously the `ires`

had been stored in the `buffer`

array. Instead of this, the SHA512 hashing struct responsible for retrieving the MAC can be updated with the `opad`

and `ires`

sequentially. Because of this, you only need an array of `128`

bytes to store the necessary data. Before these last 64 bytes had been dropped, the benchmarks for `rigel`

were:

```
test rigel_one_shot ... bench: 2,094 ns/iter (+/- 182)
test rigel_stream ... bench: 2,174 ns/iter (+/- 121)
```

Now they are:

```
test rigel_one_shot ... bench: 2,093 ns/iter (+/- 42)
test rigel_stream ... bench: 2,161 ns/iter (+/- 58)
```

The repository and performance benchmarks have been updated.

### Performance - Update [31-01-2019]

Some time has passed since this blogpost was written. Since then, I’ve integrated some of the key setup ideas into RustCrypto, which made it into the `0.6.3`

release of the hmac crate.

*ring* has also updated its HMAC key setup process based on some of the ideas from this blogpost. These changes made it into the `0.14.4`

release of *ring*.

My other project orion has also grown quite a bit, so that is also included in the new benchmarks for comparison.

The old benchmarks (from 09-08-2018) were as follows:

```
test RustCrypto ... bench: 2,727 ns/iter (+/- 91)
test rigel_one_shot ... bench: 2,093 ns/iter (+/- 42)
test rigel_stream ... bench: 2,161 ns/iter (+/- 58)
test ring ... bench: 3,357 ns/iter (+/- 96)
```

This was benchmarked on a MacBook Air 1,6 GHz Intel Core i5, 4GB.

The new benchmarks (from 31-01-2019) are as follows:

```
test RustCrypto ... bench: 2,168 ns/iter (+/- 141)
test orion ... bench: 2,207 ns/iter (+/- 52)
test rigel_one_shot ... bench: 2,077 ns/iter (+/- 53)
test rigel_stream ... bench: 2,127 ns/iter (+/- 36)
test ring ... bench: 1,463 ns/iter (+/- 37)
```

This was benchmarked on a MacBook Air 1,6 GHz Intel Core i5, 4GB.

*ring* seems to have gained a big performance improvement with its key setup process, adopting some of the ideas in this blogpost. However, this isn’t to say that adopting them has resulted in the above performance improvements alone. Much could have changed since the last benchmarks.

There are some things that should be noted about these benchmarks and the current implementation of rigel. The first is, they are focused on benchmarking the key setup steps of HMAC. It does process data in the benches (once), but it doesn’t cover use cases such as re-using keys for different input. RustCrypto, *ring*, and orion will outperform rigel on that matter.

The second thing is, even though “embedded” has been mentioned once or twice in this post, rigel has never been tested/benchmarked on actual embedded devices. The only meaning behind me mentioning embedded was focusing on not using needless allocations and supporting `no_std`

.

### More

All code for this implementation can be found here.