[mldsa] Drop code size in various places for ML-DSA.#208
Open
jadephilipoom wants to merge 7 commits intomasterfrom
Open
[mldsa] Drop code size in various places for ML-DSA.#208jadephilipoom wants to merge 7 commits intomasterfrom
jadephilipoom wants to merge 7 commits intomasterfrom
Conversation
These were not actually necessary. Saves 252 bytes of code size in ML-DSA-87 and improves signing performance slightly (0.02%-0.07% depending on the parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Since the addresses are always the same, we can save some code size and register pointers by putting the loads inside the (i)ntt routines themselves. Saves 396 bytes of code size at a performance costs below 0.03% for all parameters (the slight slowdown happening from places that had previously only loaded the pointer once for several ntts). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Use an early-exit flag to avoid lengthening the time to rejection for a bad signature. Saves a full 2000B of code size for ML-DSA-87 signing, at a performance cost of 1-5% for bookkeeping around the loop. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
By trial and error, determined which .repts in poly.s have a negligible impact on performance and changed them all to use loopi (leaving some hot loops or loops with branches as .repts). Altogether saves 444B of IMEM for a 0.3%-0.6% performance penalty. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Use the register increment features of bn.lid/bn.sid instead of filling up GPRs with sequential constants. Saves 88B of code size, relieves some register pressure, and slightly improves performance (0.04% to 0.11% speedup across operations and parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Leverage the newly reduced register pressure to make (i)ntt exclusively clobber the t0-t5 registers instead of s0-s12, so there's no need to push/pop them. Saves 384B of code size and slightly improves performance (0.4% to 1.1% speedup across operations and parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Improves sign/verify performance by 6-9% across parameter sets (keygen is unaffected) and saves 204B of code size. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
I went through the ML-DSA code to look for places where there might be slack to reduce code size and found some. Overall code size decreases (measured by the
_imem_endsymbol in the first ML-DSA-87 test):Overall performance change (same 3 pseudorandom tests for both versions, tl;dr significant speedups for sign/verify and very slight slowdowns for keygen):
Most of the speedup is from the last commit, which vectorizes
poly_chknorm.Changing out some
.repts withKandLiterations forloopis should also make it easier to combine all these routines into a single binary, which we should do for code size reasons soon (because theloopis can be straightforwardly replaced withloops dependent on a runtime parameter). Almost all of the code size for each of keygen, sign, and verify now is for the shared polynomial and (i)ntt libraries, so we should be able to get one ~16KiB binary with all 9 operation/parameter set combinations available.