[mldsa] Drop code size in various places for ML-DSA. by jadephilipoom · Pull Request #208 · zerorisc/expo

jadephilipoom · 2026-02-27T16:26:14Z

I went through the ML-DSA code to look for places where there might be slack to reduce code size and found some. Overall code size decreases (measured by the _imem_end symbol in the first ML-DSA-87 test):

keygen: 14700 -> 13692 (-1008 bytes,  -6.9%)
sign:   19164 -> 15492 (-3672 bytes, -19.2%)
verify: 15832 -> 14140 (-1692 bytes, -10.7%)

Overall performance change (same 3 pseudorandom tests for both versions, tl;dr significant speedups for sign/verify and very slight slowdowns for keygen):

--- mldsa44_keypair ---
Average cycles: 90797 -> 90829 (+32,+0.04%)
Median cycles:  89868 -> 89900 (+32,+0.04%)
--- mldsa44_sign ---
Average cycles: 212042 -> 189422 (-22620,-10.67%)
Median cycles:  177196 -> 156765 (-20431,-11.53%)
--- mldsa44_verify ---
Average cycles: 80540 -> 73833 (-6707,-8.33%)
Median cycles:  80616 -> 73909 (-6707,-8.32%)
--- mldsa65_keypair ---
Average cycles: 153958 -> 154024 (+66,+0.04%)
Median cycles:  154655 -> 154721 (+66,+0.04%)
--- mldsa65_sign ---
Average cycles: 739341 -> 698155 (-41186,-5.57%)
Median cycles:  615763 -> 582420 (-33343,-5.41%)
--- mldsa65_verify ---
Average cycles: 123124 -> 114695 (-8429,-6.85%)
Median cycles:  124268 -> 115839 (-8429,-6.78%)
--- mldsa87_keypair ---
Average cycles: 212071 -> 212197 (+126,+0.06%)
Median cycles:  211261 -> 211387 (+126,+0.06%)
--- mldsa87_sign ---
Average cycles: 916245 -> 857147 (-59098,-6.45%)
Median cycles:  573485 -> 525900 (-47585,-8.30%)
--- mldsa87_verify ---
Average cycles: 192086 -> 180335 (-11751,-6.12%)
Median cycles:  192493 -> 180742 (-11751,-6.10%)

Most of the speedup is from the last commit, which vectorizes poly_chknorm.

Changing out some .repts with K and L iterations for loopis should also make it easier to combine all these routines into a single binary, which we should do for code size reasons soon (because the loopis can be straightforwardly replaced with loops dependent on a runtime parameter). Almost all of the code size for each of keygen, sign, and verify now is for the shared polynomial and (i)ntt libraries, so we should be able to get one ~16KiB binary with all 9 operation/parameter set combinations available.

These were not actually necessary. Saves 252 bytes of code size in ML-DSA-87 and improves signing performance slightly (0.02%-0.07% depending on the parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

Since the addresses are always the same, we can save some code size and register pointers by putting the loads inside the (i)ntt routines themselves. Saves 396 bytes of code size at a performance costs below 0.03% for all parameters (the slight slowdown happening from places that had previously only loaded the pointer once for several ntts). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

Use an early-exit flag to avoid lengthening the time to rejection for a bad signature. Saves a full 2000B of code size for ML-DSA-87 signing, at a performance cost of 1-5% for bookkeeping around the loop. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

By trial and error, determined which .repts in poly.s have a negligible impact on performance and changed them all to use loopi (leaving some hot loops or loops with branches as .repts). Altogether saves 444B of IMEM for a 0.3%-0.6% performance penalty. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

Use the register increment features of bn.lid/bn.sid instead of filling up GPRs with sequential constants. Saves 88B of code size, relieves some register pressure, and slightly improves performance (0.04% to 0.11% speedup across operations and parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

Leverage the newly reduced register pressure to make (i)ntt exclusively clobber the t0-t5 registers instead of s0-s12, so there's no need to push/pop them. Saves 384B of code size and slightly improves performance (0.4% to 1.1% speedup across operations and parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

Improves sign/verify performance by 6-9% across parameter sets (keygen is unaffected) and saves 204B of code size. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

jadephilipoom added 7 commits February 27, 2026 16:41

[mldsa] Remove some unnecessary push/pops.

317ab9d

These were not actually necessary. Saves 252 bytes of code size in ML-DSA-87 and improves signing performance slightly (0.02%-0.07% depending on the parameters). Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

[mldsa] Vectorize poly_chknorm.

cc9ac52

Improves sign/verify performance by 6-9% across parameter sets (keygen is unaffected) and saves 204B of code size. Signed-off-by: Jade Philipoom <jadep@zerorisc.com>

jadephilipoom requested review from phamhnh and pqcfox February 27, 2026 16:26

jadephilipoom mentioned this pull request Mar 2, 2026

[mldsa] Remove stack indirection from all operations. #210

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[mldsa] Drop code size in various places for ML-DSA.#208

[mldsa] Drop code size in various places for ML-DSA.#208
jadephilipoom wants to merge 7 commits intomasterfrom
jadep/mldsa-code-size-improvements

jadephilipoom commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jadephilipoom commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jadephilipoom commented Feb 27, 2026 •

edited

Loading