Skip to content

[mldsa] Drop code size in various places for ML-DSA.#208

Open
jadephilipoom wants to merge 7 commits intomasterfrom
jadep/mldsa-code-size-improvements
Open

[mldsa] Drop code size in various places for ML-DSA.#208
jadephilipoom wants to merge 7 commits intomasterfrom
jadep/mldsa-code-size-improvements

Conversation

@jadephilipoom
Copy link
Collaborator

@jadephilipoom jadephilipoom commented Feb 27, 2026

I went through the ML-DSA code to look for places where there might be slack to reduce code size and found some. Overall code size decreases (measured by the _imem_end symbol in the first ML-DSA-87 test):

keygen: 14700 -> 13692 (-1008 bytes,  -6.9%)
sign:   19164 -> 15492 (-3672 bytes, -19.2%)
verify: 15832 -> 14140 (-1692 bytes, -10.7%)

Overall performance change (same 3 pseudorandom tests for both versions, tl;dr significant speedups for sign/verify and very slight slowdowns for keygen):

--- mldsa44_keypair ---
Average cycles: 90797 -> 90829 (+32,+0.04%)
Median cycles:  89868 -> 89900 (+32,+0.04%)
--- mldsa44_sign ---
Average cycles: 212042 -> 189422 (-22620,-10.67%)
Median cycles:  177196 -> 156765 (-20431,-11.53%)
--- mldsa44_verify ---
Average cycles: 80540 -> 73833 (-6707,-8.33%)
Median cycles:  80616 -> 73909 (-6707,-8.32%)
--- mldsa65_keypair ---
Average cycles: 153958 -> 154024 (+66,+0.04%)
Median cycles:  154655 -> 154721 (+66,+0.04%)
--- mldsa65_sign ---
Average cycles: 739341 -> 698155 (-41186,-5.57%)
Median cycles:  615763 -> 582420 (-33343,-5.41%)
--- mldsa65_verify ---
Average cycles: 123124 -> 114695 (-8429,-6.85%)
Median cycles:  124268 -> 115839 (-8429,-6.78%)
--- mldsa87_keypair ---
Average cycles: 212071 -> 212197 (+126,+0.06%)
Median cycles:  211261 -> 211387 (+126,+0.06%)
--- mldsa87_sign ---
Average cycles: 916245 -> 857147 (-59098,-6.45%)
Median cycles:  573485 -> 525900 (-47585,-8.30%)
--- mldsa87_verify ---
Average cycles: 192086 -> 180335 (-11751,-6.12%)
Median cycles:  192493 -> 180742 (-11751,-6.10%)

Most of the speedup is from the last commit, which vectorizes poly_chknorm.

Changing out some .repts with K and L iterations for loopis should also make it easier to combine all these routines into a single binary, which we should do for code size reasons soon (because the loopis can be straightforwardly replaced with loops dependent on a runtime parameter). Almost all of the code size for each of keygen, sign, and verify now is for the shared polynomial and (i)ntt libraries, so we should be able to get one ~16KiB binary with all 9 operation/parameter set combinations available.

These were not actually necessary. Saves 252 bytes of code size in ML-DSA-87
and improves signing performance slightly (0.02%-0.07% depending on the
parameters).

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Since the addresses are always the same, we can save some code size and
register pointers by putting the loads inside the (i)ntt routines themselves.
Saves 396 bytes of code size at a performance costs below 0.03% for all
parameters (the slight slowdown happening from places that had previously only
loaded the pointer once for several ntts).

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Use an early-exit flag to avoid lengthening the time to rejection for a bad
signature. Saves a full 2000B of code size for ML-DSA-87 signing, at a
performance cost of 1-5% for bookkeeping around the loop.

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
By trial and error, determined which .repts in poly.s have a negligible impact
on performance and changed them all to use loopi (leaving some hot loops or
loops with branches as .repts). Altogether saves 444B of IMEM for a 0.3%-0.6%
performance penalty.

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Use the register increment features of bn.lid/bn.sid instead of filling up GPRs
with sequential constants. Saves 88B of code size, relieves some register
pressure, and slightly improves performance (0.04% to 0.11% speedup across
operations and parameters).

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Leverage the newly reduced register pressure to make (i)ntt exclusively clobber
the t0-t5 registers instead of s0-s12, so there's no need to push/pop them.
Saves 384B of code size and slightly improves performance (0.4% to 1.1% speedup
across operations and parameters).

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Improves sign/verify performance by 6-9% across parameter sets (keygen is
unaffected) and saves 204B of code size.

Signed-off-by: Jade Philipoom <jadep@zerorisc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant