[aesgcm] improve portable gf128 performance#1340
Open
robinhundt wants to merge 1 commit intocryspen:mainfrom
Open
[aesgcm] improve portable gf128 performance#1340robinhundt wants to merge 1 commit intocryspen:mainfrom
robinhundt wants to merge 1 commit intocryspen:mainfrom
Conversation
Collaborator
|
Thank you, that looks very interesting! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi there,
while looking at the aes-gcm implementation I noticed that the performance of the portable gf128 implementation could be improved. This PR replaces the previous gf128 multiplication with an optimized carry-less 128 x 128 -> 256 bit multiplication followed by the usual reduction by the irreducible polynomial (same reduce as in
platform/x64/gf128_core.rs). This improves the throughput of the portable aes-gcm implementation by 50-60% (see benchmarks).The implementation of the carry-less multiplication is based on the algorithm used in bearssl and adapted from the RustCrypto implementation. Contrary to these implementations, I'm making use of Rust's u128 support, allowing me to skip the
Revtrick described in the blog post. Originally, I wrote this code for my cryprot-core library.Limitations
Constant-time: As the
clmul64implementation usesu128multiplication, it will not be constant-time on targets where this multiplication is not constant-time. Especially since this implementation would require constant-time 64 x 64 -> 128 bit multiplications (e.g. MULX on x86), I'm doubtful whether this implementation would provide a performance benefit, as those targets will often havePCLMULQDQ(or equivalent) available. There might be some ARM targets which have CTMUL/UMULHinstructions but don't have the crypto extensions and don't support PMULL. More information about the problem of constant-time multiplications is available here.32-bit targets: I'm unsure how this implementation would fare on 32-bit targets compared to e.g. the optimized one in RustCrypto/universal-hashes. Some quick napkin math and godbolt experimentation suggests that my implementation should compile to less MUL instructions, but I haven't investigated this properly.
Comparison to bearssl/RustCrypto implementation: The prior works use T x T -> T carry-less multiplications (for T = 32 or T = 64). I believe my version is faster on at least some targets, but I have not suitably benchmarked this.
Verification of this code: As far as I can tell, the current gf128 implementation is not formally verified. I'm not sure whether this optimized version would be harder to formally verify.
From my current understanding of libcrux, I would not recommend merging this PR as is. Especially the limitations around constant-time guarantees would definitely warrant a closer look. But it shows that there is a significant potential for performance improvement in the portable aes-gcm implementation.
Benchmarks
Baseline:
This PR: