Skip to content

[aesgcm] improve portable gf128 performance#1340

Open
robinhundt wants to merge 1 commit intocryspen:mainfrom
robinhundt:robin/aes-gcm-portable-perf-improvement
Open

[aesgcm] improve portable gf128 performance#1340
robinhundt wants to merge 1 commit intocryspen:mainfrom
robinhundt:robin/aes-gcm-portable-perf-improvement

Conversation

@robinhundt
Copy link

Hi there,
while looking at the aes-gcm implementation I noticed that the performance of the portable gf128 implementation could be improved. This PR replaces the previous gf128 multiplication with an optimized carry-less 128 x 128 -> 256 bit multiplication followed by the usual reduction by the irreducible polynomial (same reduce as in platform/x64/gf128_core.rs). This improves the throughput of the portable aes-gcm implementation by 50-60% (see benchmarks).

The implementation of the carry-less multiplication is based on the algorithm used in bearssl and adapted from the RustCrypto implementation. Contrary to these implementations, I'm making use of Rust's u128 support, allowing me to skip the Rev trick described in the blog post. Originally, I wrote this code for my cryprot-core library.

Limitations

Constant-time: As the clmul64 implementation uses u128 multiplication, it will not be constant-time on targets where this multiplication is not constant-time. Especially since this implementation would require constant-time 64 x 64 -> 128 bit multiplications (e.g. MULX on x86), I'm doubtful whether this implementation would provide a performance benefit, as those targets will often have PCLMULQDQ (or equivalent) available. There might be some ARM targets which have CT MUL/UMULH instructions but don't have the crypto extensions and don't support PMULL. More information about the problem of constant-time multiplications is available here.

32-bit targets: I'm unsure how this implementation would fare on 32-bit targets compared to e.g. the optimized one in RustCrypto/universal-hashes. Some quick napkin math and godbolt experimentation suggests that my implementation should compile to less MUL instructions, but I haven't investigated this properly.

Comparison to bearssl/RustCrypto implementation: The prior works use T x T -> T carry-less multiplications (for T = 32 or T = 64). I believe my version is faster on at least some targets, but I have not suitably benchmarked this.

Verification of this code: As far as I can tell, the current gf128 implementation is not formally verified. I'm not sure whether this optimized version would be harder to formally verify.

From my current understanding of libcrux, I would not recommend merging this PR as is. Especially the limitations around constant-time guarantees would definitely warrant a closer look. But it shows that there is a significant potential for performance improvement in the portable aes-gcm implementation.

Benchmarks

Baseline:

aes gcm 128/libcrux/128 
                        time:   [8.2022 µs 8.2176 µs 8.2385 µs]
                        thrpt:  [14.817 MiB/s 14.855 MiB/s 14.883 MiB/s]
Found 15 outliers among 100 measurements (15.00%)
  6 (6.00%) high mild
  9 (9.00%) high severe
aes gcm 128/libcrux/1 KB
                        time:   [49.087 µs 49.178 µs 49.299 µs]
                        thrpt:  [19.809 MiB/s 19.858 MiB/s 19.895 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
Benchmarking aes gcm 128/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 49.3s, or reduce sample count to 10.
aes gcm 128/libcrux/10 MB
                        time:   [484.10 ms 484.49 ms 484.96 ms]
                        thrpt:  [20.620 MiB/s 20.640 MiB/s 20.657 MiB/s]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

aes gcm 256/libcrux/128 
                        time:   [9.8674 µs 9.8712 µs 9.8758 µs]
                        thrpt:  [12.361 MiB/s 12.366 MiB/s 12.371 MiB/s]
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe
aes gcm 256/libcrux/1 KB
                        time:   [60.014 µs 60.038 µs 60.066 µs]
                        thrpt:  [16.258 MiB/s 16.266 MiB/s 16.272 MiB/s]
Found 6 outliers among 100 measurements (6.00%)
  4 (4.00%) high mild
  2 (2.00%) high severe
Benchmarking aes gcm 256/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 60.1s, or reduce sample count to 10.
aes gcm 256/libcrux/10 MB
                        time:   [594.81 ms 595.70 ms 596.76 ms]
                        thrpt:  [16.757 MiB/s 16.787 MiB/s 16.812 MiB/s]
Found 14 outliers among 100 measurements (14.00%)
  5 (5.00%) high mild
  9 (9.00%) high severe

This PR:

aes gcm 128/libcrux/128 
                        time:   [5.1399 µs 5.1433 µs 5.1476 µs]
                        thrpt:  [23.714 MiB/s 23.734 MiB/s 23.750 MiB/s]
                 change:
                        time:   [−38.278% −37.510% −36.890%] (p = 0.00 < 0.05)
                        thrpt:  [+58.453% +60.025% +62.017%]
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
aes gcm 128/libcrux/1 KB
                        time:   [30.780 µs 30.796 µs 30.814 µs]
                        thrpt:  [31.693 MiB/s 31.711 MiB/s 31.727 MiB/s]
                 change:
                        time:   [−37.294% −36.720% −35.852%] (p = 0.00 < 0.05)
                        thrpt:  [+55.890% +58.029% +59.474%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  1 (1.00%) low mild
  6 (6.00%) high mild
  7 (7.00%) high severe
Benchmarking aes gcm 128/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 31.5s, or reduce sample count to 10.
aes gcm 128/libcrux/10 MB
                        time:   [307.24 ms 307.56 ms 307.95 ms]
                        thrpt:  [32.473 MiB/s 32.514 MiB/s 32.548 MiB/s]
                 change:
                        time:   [−36.611% −36.519% −36.431%] (p = 0.00 < 0.05)
                        thrpt:  [+57.310% +57.527% +57.757%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

aes gcm 256/libcrux/128 
                        time:   [6.7777 µs 6.8141 µs 6.8582 µs]
                        thrpt:  [17.799 MiB/s 17.914 MiB/s 18.011 MiB/s]
                 change:
                        time:   [−31.512% −31.242% −30.940%] (p = 0.00 < 0.05)
                        thrpt:  [+44.802% +45.437% +46.012%]
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  8 (8.00%) high mild
  6 (6.00%) high severe
aes gcm 256/libcrux/1 KB
                        time:   [40.781 µs 40.835 µs 40.930 µs]
                        thrpt:  [23.860 MiB/s 23.915 MiB/s 23.946 MiB/s]
                 change:
                        time:   [−32.025% −31.895% −31.745%] (p = 0.00 < 0.05)
                        thrpt:  [+46.509% +46.833% +47.113%]
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  4 (4.00%) high mild
  7 (7.00%) high severe
Benchmarking aes gcm 256/libcrux/10 MB: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 41.3s, or reduce sample count to 10.
aes gcm 256/libcrux/10 MB
                        time:   [405.76 ms 406.16 ms 406.64 ms]
                        thrpt:  [24.592 MiB/s 24.621 MiB/s 24.645 MiB/s]
                 change:
                        time:   [−31.954% −31.819% −31.686%] (p = 0.00 < 0.05)
                        thrpt:  [+46.383% +46.669% +46.960%]
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

@robinhundt robinhundt requested a review from a team as a code owner February 19, 2026 17:59
@jschneider-bensch
Copy link
Collaborator

Thank you, that looks very interesting!
I'll make sure to check it out as soon as time allows.

@jschneider-bensch jschneider-bensch added the waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting-on-review Status: Awaiting review from the assignee but also interested parties.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants