Refactor squeeze/absorb to use mld_keccakf1600 primitives #436

bremoran · 2025-09-03T12:56:23Z

Squeeze & Absorb currently use plain xor & extract in C. This is fine for 64-bit scalar Keccak, but it doesn't work well for 32-bit architectures, since performance demands bit interleaving. As a result, keccak_absorb and keccak_squeeze should use the existing mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes.

This PR refactors Squeeze & Absorb to use mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes instead of ad-hoc C code for accessing the state.

mkannwischer

Thanks @bremoran. I support the refactoring of fips202.c.
The changes in keccakf1600.c seem not needed for the time being.

The CBMC proofs need to be adjusted (basically you need to add the newly called function to the USE_FUNCTION_CONTRACT= make variable in the respective failing proofs). I can take care of this after addressed my comment above.

mldsa/fips202/keccakf1600.c

github-actions

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`49562` cycles	`50485` cycles	`0.98`
`ML-DSA-44 sign`	`198858` cycles	`222428` cycles	`0.89`
`ML-DSA-44 verify`	`58672` cycles	`72829` cycles	`0.81`
`ML-DSA-65 keypair`	`86329` cycles	`87391` cycles	`0.99`
`ML-DSA-65 sign`	`324629` cycles	`356394` cycles	`0.91`
`ML-DSA-65 verify`	`93809` cycles	`112773` cycles	`0.83`
`ML-DSA-87 keypair`	`139251` cycles	`140140` cycles	`0.99`
`ML-DSA-87 sign`	`395320` cycles	`425766` cycles	`0.93`
`ML-DSA-87 verify`	`148970` cycles	`173236` cycles	`0.86`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`115116` cycles	`116059` cycles	`0.99`
`ML-DSA-44 sign`	`430512` cycles	`454902` cycles	`0.95`
`ML-DSA-44 verify`	`122178` cycles	`136834` cycles	`0.89`
`ML-DSA-65 keypair`	`196935` cycles	`198013` cycles	`0.99`
`ML-DSA-65 sign`	`700849` cycles	`733554` cycles	`0.96`
`ML-DSA-65 verify`	`197607` cycles	`216827` cycles	`0.91`
`ML-DSA-87 keypair`	`324939` cycles	`335358` cycles	`0.97`
`ML-DSA-87 sign`	`884167` cycles	`915279` cycles	`0.97`
`ML-DSA-87 verify`	`328670` cycles	`353531` cycles	`0.93`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`118642` cycles	`120291` cycles	`0.99`
`ML-DSA-44 sign`	`470944` cycles	`487575` cycles	`0.97`
`ML-DSA-44 verify`	`136176` cycles	`145607` cycles	`0.94`
`ML-DSA-65 keypair`	`204732` cycles	`207407` cycles	`0.99`
`ML-DSA-65 sign`	`781331` cycles	`803126` cycles	`0.97`
`ML-DSA-65 verify`	`219238` cycles	`231885` cycles	`0.95`
`ML-DSA-87 keypair`	`334153` cycles	`336321` cycles	`0.99`
`ML-DSA-87 sign`	`960834` cycles	`982698` cycles	`0.98`
`ML-DSA-87 verify`	`354097` cycles	`369410` cycles	`0.96`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`213163` cycles	`214921` cycles	`0.99`
`ML-DSA-44 sign`	`781109` cycles	`796938` cycles	`0.98`
`ML-DSA-44 verify`	`229936` cycles	`239879` cycles	`0.96`
`ML-DSA-65 keypair`	`381392` cycles	`383546` cycles	`0.99`
`ML-DSA-65 sign`	`1284323` cycles	`1316119` cycles	`0.98`
`ML-DSA-65 verify`	`372595` cycles	`384800` cycles	`0.97`
`ML-DSA-87 keypair`	`608888` cycles	`611048` cycles	`1.00`
`ML-DSA-87 sign`	`1642534` cycles	`1662747` cycles	`0.99`
`ML-DSA-87 verify`	`621151` cycles	`637375` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`36212` cycles	`37336` cycles	`0.97`
`ML-DSA-44 sign`	`163085` cycles	`169656` cycles	`0.96`
`ML-DSA-44 verify`	`45930` cycles	`50105` cycles	`0.92`
`ML-DSA-65 keypair`	`65187` cycles	`65461` cycles	`1.00`
`ML-DSA-65 sign`	`269644` cycles	`279581` cycles	`0.96`
`ML-DSA-65 verify`	`73502` cycles	`78137` cycles	`0.94`
`ML-DSA-87 keypair`	`99000` cycles	`101115` cycles	`0.98`
`ML-DSA-87 sign`	`317827` cycles	`328206` cycles	`0.97`
`ML-DSA-87 verify`	`109541` cycles	`116497` cycles	`0.94`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`59095` cycles	`60983` cycles	`0.97`
`ML-DSA-44 sign`	`255045` cycles	`265066` cycles	`0.96`
`ML-DSA-44 verify`	`74279` cycles	`80645` cycles	`0.92`
`ML-DSA-65 keypair`	`104923` cycles	`108431` cycles	`0.97`
`ML-DSA-65 sign`	`423767` cycles	`444769` cycles	`0.95`
`ML-DSA-65 verify`	`119732` cycles	`129801` cycles	`0.92`
`ML-DSA-87 keypair`	`160932` cycles	`163435` cycles	`0.98`
`ML-DSA-87 sign`	`499287` cycles	`511616` cycles	`0.98`
`ML-DSA-87 verify`	`179998` cycles	`189081` cycles	`0.95`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`94596` cycles	`96372` cycles	`0.98`
`ML-DSA-44 sign`	`342989` cycles	`351483` cycles	`0.98`
`ML-DSA-44 verify`	`101490` cycles	`105691` cycles	`0.96`
`ML-DSA-65 keypair`	`163526` cycles	`163710` cycles	`1.00`
`ML-DSA-65 sign`	`570845` cycles	`576859` cycles	`0.99`
`ML-DSA-65 verify`	`163518` cycles	`170256` cycles	`0.96`
`ML-DSA-87 keypair`	`269205` cycles	`273579` cycles	`0.98`
`ML-DSA-87 sign`	`722380` cycles	`735275` cycles	`0.98`
`ML-DSA-87 verify`	`271184` cycles	`280114` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`72744` cycles	`74177` cycles	`0.98`
`ML-DSA-44 sign`	`257248` cycles	`268537` cycles	`0.96`
`ML-DSA-44 verify`	`82419` cycles	`89027` cycles	`0.93`
`ML-DSA-65 keypair`	`124718` cycles	`127704` cycles	`0.98`
`ML-DSA-65 sign`	`418393` cycles	`437619` cycles	`0.96`
`ML-DSA-65 verify`	`133899` cycles	`143019` cycles	`0.94`
`ML-DSA-87 keypair`	`215781` cycles	`212177` cycles	`1.02`
`ML-DSA-87 sign`	`537519` cycles	`542602` cycles	`0.99`
`ML-DSA-87 verify`	`223941` cycles	`230379` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`46361` cycles	`45310` cycles	`1.02`
`ML-DSA-44 sign`	`197761` cycles	`198143` cycles	`1.00`
`ML-DSA-44 verify`	`59051` cycles	`62984` cycles	`0.94`
`ML-DSA-65 keypair`	`76750` cycles	`76871` cycles	`1.00`
`ML-DSA-65 sign`	`306116` cycles	`325750` cycles	`0.94`
`ML-DSA-65 verify`	`87186` cycles	`96266` cycles	`0.91`
`ML-DSA-87 keypair`	`114996` cycles	`116674` cycles	`0.99`
`ML-DSA-87 sign`	`354465` cycles	`366154` cycles	`0.97`
`ML-DSA-87 verify`	`129885` cycles	`139952` cycles	`0.93`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`157050` cycles	`159577` cycles	`0.98`
`ML-DSA-44 sign`	`563354` cycles	`574753` cycles	`0.98`
`ML-DSA-44 verify`	`169123` cycles	`175203` cycles	`0.97`
`ML-DSA-65 keypair`	`269997` cycles	`272720` cycles	`0.99`
`ML-DSA-65 sign`	`929518` cycles	`945985` cycles	`0.98`
`ML-DSA-65 verify`	`274528` cycles	`283415` cycles	`0.97`
`ML-DSA-87 keypair`	`449825` cycles	`453069` cycles	`0.99`
`ML-DSA-87 sign`	`1176593` cycles	`1192316` cycles	`0.99`
`ML-DSA-87 verify`	`459136` cycles	`469703` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`71522` cycles	`73300` cycles	`0.98`
`ML-DSA-44 sign`	`273874` cycles	`282486` cycles	`0.97`
`ML-DSA-44 verify`	`82240` cycles	`87271` cycles	`0.94`
`ML-DSA-65 keypair`	`126531` cycles	`129003` cycles	`0.98`
`ML-DSA-65 sign`	`449807` cycles	`461243` cycles	`0.98`
`ML-DSA-65 verify`	`133375` cycles	`139667` cycles	`0.95`
`ML-DSA-87 keypair`	`205265` cycles	`208392` cycles	`0.98`
`ML-DSA-87 sign`	`553986` cycles	`565901` cycles	`0.98`
`ML-DSA-87 verify`	`215759` cycles	`223269` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`135508` cycles	`136345` cycles	`0.99`
`ML-DSA-44 sign`	`542398` cycles	`554700` cycles	`0.98`
`ML-DSA-44 verify`	`148663` cycles	`155242` cycles	`0.96`
`ML-DSA-65 keypair`	`227809` cycles	`227981` cycles	`1.00`
`ML-DSA-65 sign`	`874643` cycles	`893854` cycles	`0.98`
`ML-DSA-65 verify`	`236078` cycles	`243954` cycles	`0.97`
`ML-DSA-87 keypair`	`374966` cycles	`376878` cycles	`0.99`
`ML-DSA-87 sign`	`1098058` cycles	`1115644` cycles	`0.98`
`ML-DSA-87 verify`	`386905` cycles	`398895` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`120986` cycles	`121632` cycles	`0.99`
`ML-DSA-44 sign`	`451932` cycles	`461721` cycles	`0.98`
`ML-DSA-44 verify`	`131661` cycles	`137157` cycles	`0.96`
`ML-DSA-65 keypair`	`205657` cycles	`205812` cycles	`1.00`
`ML-DSA-65 sign`	`740072` cycles	`752983` cycles	`0.98`
`ML-DSA-65 verify`	`211066` cycles	`216831` cycles	`0.97`
`ML-DSA-87 keypair`	`339147` cycles	`339530` cycles	`1.00`
`ML-DSA-87 sign`	`942332` cycles	`956240` cycles	`0.99`
`ML-DSA-87 verify`	`348162` cycles	`357463` cycles	`0.97`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`119130` cycles	`120799` cycles	`0.99`
`ML-DSA-44 sign`	`472589` cycles	`488399` cycles	`0.97`
`ML-DSA-44 verify`	`136811` cycles	`145989` cycles	`0.94`
`ML-DSA-65 keypair`	`204918` cycles	`208066` cycles	`0.98`
`ML-DSA-65 sign`	`781470` cycles	`805186` cycles	`0.97`
`ML-DSA-65 verify`	`219277` cycles	`232525` cycles	`0.94`
`ML-DSA-87 keypair`	`334609` cycles	`336894` cycles	`0.99`
`ML-DSA-87 sign`	`962569` cycles	`985538` cycles	`0.98`
`ML-DSA-87 verify`	`354138` cycles	`370175` cycles	`0.96`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`76608` cycles	`77954` cycles	`0.98`
`ML-DSA-44 sign`	`293501` cycles	`304386` cycles	`0.96`
`ML-DSA-44 verify`	`89487` cycles	`95538` cycles	`0.94`
`ML-DSA-65 keypair`	`133497` cycles	`135025` cycles	`0.99`
`ML-DSA-65 sign`	`482326` cycles	`496830` cycles	`0.97`
`ML-DSA-65 verify`	`143934` cycles	`151258` cycles	`0.95`
`ML-DSA-87 keypair`	`215860` cycles	`217588` cycles	`0.99`
`ML-DSA-87 sign`	`594391` cycles	`607209` cycles	`0.98`
`ML-DSA-87 verify`	`230110` cycles	`239670` cycles	`0.96`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton4 (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`133082` cycles	`134660` cycles	`0.99`
`ML-DSA-44 sign`	`498260` cycles	`506939` cycles	`0.98`
`ML-DSA-44 verify`	`144896` cycles	`149574` cycles	`0.97`
`ML-DSA-65 keypair`	`226122` cycles	`228568` cycles	`0.99`
`ML-DSA-65 sign`	`813917` cycles	`824817` cycles	`0.99`
`ML-DSA-65 verify`	`231233` cycles	`237103` cycles	`0.98`
`ML-DSA-87 keypair`	`374601` cycles	`377328` cycles	`0.99`
`ML-DSA-87 sign`	`1019872` cycles	`1030998` cycles	`0.99`
`ML-DSA-87 verify`	`383530` cycles	`390919` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton3 (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`138666` cycles	`139731` cycles	`0.99`
`ML-DSA-44 sign`	`495052` cycles	`505692` cycles	`0.98`
`ML-DSA-44 verify`	`148928` cycles	`154387` cycles	`0.96`
`ML-DSA-65 keypair`	`242421` cycles	`245185` cycles	`0.99`
`ML-DSA-65 sign`	`810192` cycles	`823206` cycles	`0.98`
`ML-DSA-65 verify`	`241091` cycles	`248706` cycles	`0.97`
`ML-DSA-87 keypair`	`396138` cycles	`397296` cycles	`1.00`
`ML-DSA-87 sign`	`1031110` cycles	`1044080` cycles	`0.99`
`ML-DSA-87 verify`	`402422` cycles	`411301` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

oqs-bot

Graviton2 (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`213471` cycles	`215538` cycles	`0.99`
`ML-DSA-44 sign`	`782770` cycles	`809773` cycles	`0.97`
`ML-DSA-44 verify`	`230256` cycles	`239814` cycles	`0.96`
`ML-DSA-65 keypair`	`381757` cycles	`384029` cycles	`0.99`
`ML-DSA-65 sign`	`1286038` cycles	`1307857` cycles	`0.98`
`ML-DSA-65 verify`	`373026` cycles	`385278` cycles	`0.97`
`ML-DSA-87 keypair`	`609835` cycles	`611946` cycles	`1.00`
`ML-DSA-87 sign`	`1645687` cycles	`1665959` cycles	`0.99`
`ML-DSA-87 verify`	`621822` cycles	`637587` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`824555` cycles	`828347` cycles	`1.00`
`ML-DSA-44 sign`	`3322292` cycles	`3361635` cycles	`0.99`
`ML-DSA-44 verify`	`910081` cycles	`931593` cycles	`0.98`
`ML-DSA-65 keypair`	`1390153` cycles	`1394523` cycles	`1.00`
`ML-DSA-65 sign`	`5439685` cycles	`5464549` cycles	`1.00`
`ML-DSA-65 verify`	`1455949` cycles	`1479842` cycles	`0.98`
`ML-DSA-87 keypair`	`2307451` cycles	`2308212` cycles	`1.00`
`ML-DSA-87 sign`	`6876862` cycles	`6902480` cycles	`1.00`
`ML-DSA-87 verify`	`2410586` cycles	`2435035` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`293900` cycles	`299048` cycles	`0.98`
`ML-DSA-44 sign`	`1209347` cycles	`1235706` cycles	`0.98`
`ML-DSA-44 verify`	`343802` cycles	`358136` cycles	`0.96`
`ML-DSA-65 keypair`	`497019` cycles	`506696` cycles	`0.98`
`ML-DSA-65 sign`	`1941494` cycles	`2024461` cycles	`0.96`
`ML-DSA-65 verify`	`540122` cycles	`561853` cycles	`0.96`
`ML-DSA-87 keypair`	`850216` cycles	`847773` cycles	`1.00`
`ML-DSA-87 sign`	`2534477` cycles	`2562883` cycles	`0.99`
`ML-DSA-87 verify`	`902552` cycles	`922665` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`466413` cycles	`469956` cycles	`0.99`
`ML-DSA-44 sign`	`2215549` cycles	`2239780` cycles	`0.99`
`ML-DSA-44 verify`	`550719` cycles	`565745` cycles	`0.97`
`ML-DSA-65 keypair`	`776107` cycles	`782939` cycles	`0.99`
`ML-DSA-65 sign`	`3632934` cycles	`3681139` cycles	`0.99`
`ML-DSA-65 verify`	`850930` cycles	`868526` cycles	`0.98`
`ML-DSA-87 keypair`	`1258536` cycles	`1271149` cycles	`0.99`
`ML-DSA-87 sign`	`4510629` cycles	`4570247` cycles	`0.99`
`ML-DSA-87 verify`	`1370939` cycles	`1397775` cycles	`0.98`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`241156` cycles	`233930` cycles	`1.03`
`ML-DSA-44 sign`	`813433` cycles	`832225` cycles	`0.98`
`ML-DSA-44 verify`	`255088` cycles	`267150` cycles	`0.95`
`ML-DSA-65 keypair`	`406266` cycles	`409016` cycles	`0.99`
`ML-DSA-65 sign`	`1339350` cycles	`1339528` cycles	`1.00`
`ML-DSA-65 verify`	`412574` cycles	`424019` cycles	`0.97`
`ML-DSA-87 keypair`	`691161` cycles	`671470` cycles	`1.03`
`ML-DSA-87 sign`	`1740077` cycles	`1682075` cycles	`1.03`
`ML-DSA-87 verify`	`687866` cycles	`697557` cycles	`0.99`

This comment was automatically generated by workflow using github-action-benchmark.

github-actions

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`306953` cycles	`314364` cycles	`0.98`
`ML-DSA-44 sign`	`1214223` cycles	`1243263` cycles	`0.98`
`ML-DSA-44 verify`	`340116` cycles	`357064` cycles	`0.95`
`ML-DSA-65 keypair`	`557834` cycles	`572673` cycles	`0.97`
`ML-DSA-65 sign`	`1949763` cycles	`2004422` cycles	`0.97`
`ML-DSA-65 verify`	`530903` cycles	`554343` cycles	`0.96`
`ML-DSA-87 keypair`	`857966` cycles	`881801` cycles	`0.97`
`ML-DSA-87 sign`	`2440171` cycles	`2515457` cycles	`0.97`
`ML-DSA-87 verify`	`875816` cycles	`915213` cycles	`0.96`

This comment was automatically generated by workflow using github-action-benchmark.

mkannwischer

Thanks for the updates! Amazing speed ups!

I am happy with these changes now. Before we merge the commit history needs to be cleaned up though.

@hanno-becker, could you also have a brief look - particularly the CBMC proof changes?

mldsa/fips202/fips202.c

hanno-becker · 2025-09-07T04:09:49Z

mldsa/fips202/fips202.c

+  if (pos == r)
+  {
+    mld_keccakf1600_permute(s);
+    pos = 0;
+  }


This is a functional change, isn't it? We now permute when running with pos == r and outlen == 0, while previously, this would have been a no-op.

You're correct. I will fix.

There's also the point that the tests didn't catch this => it needs a test as well.

This is not a correctness issue, but one of consistency I think. Previously, it seems that pos == r was allowed and would lead to permutation and pos = 0 reset only upon the next squeeze. With the change -- and it is still the case after your latest change -- the squeeze logic instead permutes and resets to pos = 0 immediately when the block is full. Either is OK, but I think it should be consistent with finalize which does not permute and reset.

If possible, I would like to keep this uniform and in line with mlkem-native. Can you have a think?

Yes, I will issue a PR to mlkem-native as well, since it is also using an assignment loop instead of an accelerated memset.

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite	Current: `bce8611`	Previous: `d0e5eba`	Ratio
`ML-DSA-44 keypair`	`241156` cycles	`233930` cycles	`1.03`
`ML-DSA-87 sign`	`1740077` cycles	`1682075` cycles	`1.03`

This comment was automatically generated by workflow using github-action-benchmark.

hanno-becker · 2025-09-09T03:33:32Z

mldsa/fips202/fips202.c

@@ -199,15 +120,15 @@ static unsigned int keccak_squeeze(uint8_t *out, size_t outlen,
                                   unsigned int pos, unsigned int r)
 __contract__(
  requires((r == SHAKE128_RATE && pos <= SHAKE128_RATE) ||
-    (r == SHAKE256_RATE && pos <= SHAKE256_RATE))
+           (r == SHAKE256_RATE && pos <= SHAKE256_RATE) ||


Why do we restrict r in the first place?

hanno-becker

I support the change per-se, and the speedup is great, thank you @bremoran.

There is one minor issue of consistency, though, with regards to when to permute and reset pos = 0 during squeezing. Previously, we would do this lazily only when we need to squeeze more. Now, we keep pos = r at finalize, but squeeze eagerly permutes and resets to pos = 0. If we can uphold the previous approach (in line with mlkem-native) while still riping the performance benefits, I'd prefer that.

hanno-becker · 2025-09-09T03:45:34Z

mldsa/fips202/fips202.c

+  if ((outlen > 0) && (pos >= r))
+  {
+    mld_keccakf1600_permute(s);
+    pos = 0;
+  }


Would this case not be handled correctly by the first loop iteration already? (xor_bytes being a no-op)

hanno-becker · 2025-09-09T03:45:55Z

mldsa/fips202/fips202.c

+    pos = 0;
+  }
+
+  while (pos + bytes_to_go >= r)


Perhaps the previous behavior can be retained simply by using > here.

Signed-off-by: willieyz <willie.zhao@chelpis.com>

bremoran · 2025-09-09T09:26:24Z

I support the change per-se, and the speedup is great, thank you @bremoran.

There is one minor issue of consistency, though, with regards to when to permute and reset pos = 0 during squeezing. Previously, we would do this lazily only when we need to squeeze more. Now, we keep pos = r at finalize, but squeeze eagerly permutes and resets to pos = 0. If we can uphold the previous approach (in line with mlkem-native) while still riping the performance benefits, I'd prefer that.

Unfortunately mlkem-native and mldsa-native are already quite inconsistent. In fact, the API we're discussing doesn't actually exist in mlkem-native.
mldsa-native provides shake128_squeeze, while mlkem-native provides shake128_squeezeblocks.
Thus, the API used by mldsa-native's shake128_squeeze (keccak_squeeze) is not used by mlkem-native's shake128_squeezeblocks. It uses keccak_squeezeblocks instead.

I'm not sure what consistency you're aiming for here?

It might be worth attempting to realign mldsa-native with mlkem-native if this is the goal, however it will be a much deeper API review than this small PR.

hanno-becker · 2025-09-09T09:58:01Z

I'm not sure what consistency you're aiming for here?

See above: After finalize, pos == r but you don't permute and reset to r = 0 immediately. But the logic in squeeze does that.

bremoran · 2025-09-09T12:56:14Z

I'm not sure what consistency you're aiming for here?

See above: After finalize, pos == r but you don't permute and reset to r = 0 immediately. But the logic in squeeze does that.

I think you may have missed my point. This code does not exist in mlkem-native. That is the source of the "inconsistency." It's rather difficult to be consistent with code that doesn't exist.

The closest code in mlkem-native is:

static void mlk_keccak_squeezeblocks(uint8_t *h, size_t nblocks, uint64_t *s,
                                     uint32_t r)

Note that this does not track pos.

A second implementation is also quite close.

static void mlk_keccak_squeeze_once(uint8_t *h, size_t outlen, uint64_t *s,
                                    uint32_t r)

This also does not track pos.

Of course the mldsa-native code is "inconsistent." There is no equivalent mlkem-native code.

bremoran · 2025-09-09T15:48:47Z

@hanno-becker hopefully this change is more what you had in mind.

Squeeze & Absorb currently use plain xor & extract in C. This is fine for 64-bit scalar Keccak, but it doesn't work well for 32-bit architectures, since performance demands bit interleaving. As a result, keccak_absorb and keccak_squeeze should use the existing mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes. This commit refactors Squeeze & Absorb to use mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes instead of ad-hoc C code for accessing the state. Co-authored-by: Matthias J. Kannwischer <matthias@kannwischer.eu> Signed-off-by: Brendan Moran <brendan.moran@arm.com> Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu> Signed-off-by: Brendan Moran <brendan.moran@arm.com>

bremoran requested a review from a team as a code owner September 3, 2025 12:56

bremoran force-pushed the native-xor branch from 5609412 to 8e2adb4 Compare September 3, 2025 20:50

mkannwischer requested changes Sep 4, 2025

View reviewed changes

mldsa/fips202/keccakf1600.c Outdated Show resolved Hide resolved

bremoran force-pushed the native-xor branch from 6f296d2 to 902a6dc Compare September 5, 2025 14:19

mkannwischer added the benchmark label Sep 5, 2025

github-actions bot reviewed Sep 5, 2025

View reviewed changes

oqs-bot reviewed Sep 5, 2025

View reviewed changes

github-actions bot reviewed Sep 5, 2025

View reviewed changes

mkannwischer approved these changes Sep 6, 2025

View reviewed changes

hanno-becker reviewed Sep 6, 2025

View reviewed changes

mldsa/fips202/fips202.c Outdated Show resolved Hide resolved

hanno-becker reviewed Sep 6, 2025

View reviewed changes

mldsa/fips202/fips202.c Show resolved Hide resolved

hanno-becker reviewed Sep 7, 2025

View reviewed changes

bremoran force-pushed the native-xor branch from bf00fc6 to e81f01f Compare September 8, 2025 11:54

bremoran requested review from mkannwischer and hanno-becker September 8, 2025 11:54

mkannwischer mentioned this pull request Sep 9, 2025

Allow customization of memcpy and memset #441

Open

mkannwischer force-pushed the native-xor branch from a8da8c6 to bce8611 Compare September 9, 2025 01:38

mkannwischer added benchmark and removed benchmark labels Sep 9, 2025

github-actions bot reviewed Sep 9, 2025

View reviewed changes

hanno-becker reviewed Sep 9, 2025

View reviewed changes

hanno-becker requested changes Sep 9, 2025

View reviewed changes

hanno-becker reviewed Sep 9, 2025

View reviewed changes

willieyz added a commit that referenced this pull request Sep 9, 2025

ref: replace memset with mld_memset in PR #436

5fa0bef

Signed-off-by: willieyz <willie.zhao@chelpis.com>

willieyz mentioned this pull request Sep 9, 2025

Allow customization of memcpy and memset #442

Open

bremoran force-pushed the native-xor branch from dada7f1 to 8ac075f Compare September 17, 2025 07:59

bremoran requested a review from hanno-becker September 17, 2025 08:00

Refactor squeeze/absorb to use mld_keccakf1600 primitives #436

Are you sure you want to change the base?

Refactor squeeze/absorb to use mld_keccakf1600 primitives #436

Uh oh!

Conversation

bremoran commented Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mkannwischer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (opt)

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Mac Mini (M1, 2020) benchmarks (no-opt)

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 4th gen (c7i) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Intel Xeon 3rd gen (c6i) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Graviton4

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 3rd gen (c6a) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

AMD EPYC 4th gen (c7a) (no-opt)

Uh oh!

oqs-bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

bremoran commented Sep 3, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

oqs-bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

github-actions bot left a comment •

edited

Loading

mkannwischer left a comment •

edited

Loading

hanno-becker Sep 9, 2025 •

edited

Loading