Skip to content

Conversation

bremoran
Copy link

@bremoran bremoran commented Sep 3, 2025

Squeeze & Absorb currently use plain xor & extract in C. This is fine for 64-bit scalar Keccak, but it doesn't work well for 32-bit architectures, since performance demands bit interleaving. As a result, keccak_absorb and keccak_squeeze should use the existing mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes.

This PR refactors Squeeze & Absorb to use mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes instead of ad-hoc C code for accessing the state.

@bremoran bremoran requested a review from a team as a code owner September 3, 2025 12:56
Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bremoran. I support the refactoring of fips202.c.
The changes in keccakf1600.c seem not needed for the time being.

The CBMC proofs need to be adjusted (basically you need to add the newly called function to the USE_FUNCTION_CONTRACT= make variable in the respective failing proofs). I can take care of this after addressed my comment above.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 49562 cycles 50485 cycles 0.98
ML-DSA-44 sign 198858 cycles 222428 cycles 0.89
ML-DSA-44 verify 58672 cycles 72829 cycles 0.81
ML-DSA-65 keypair 86329 cycles 87391 cycles 0.99
ML-DSA-65 sign 324629 cycles 356394 cycles 0.91
ML-DSA-65 verify 93809 cycles 112773 cycles 0.83
ML-DSA-87 keypair 139251 cycles 140140 cycles 0.99
ML-DSA-87 sign 395320 cycles 425766 cycles 0.93
ML-DSA-87 verify 148970 cycles 173236 cycles 0.86

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mac Mini (M1, 2020) benchmarks (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 115116 cycles 116059 cycles 0.99
ML-DSA-44 sign 430512 cycles 454902 cycles 0.95
ML-DSA-44 verify 122178 cycles 136834 cycles 0.89
ML-DSA-65 keypair 196935 cycles 198013 cycles 0.99
ML-DSA-65 sign 700849 cycles 733554 cycles 0.96
ML-DSA-65 verify 197607 cycles 216827 cycles 0.91
ML-DSA-87 keypair 324939 cycles 335358 cycles 0.97
ML-DSA-87 sign 884167 cycles 915279 cycles 0.97
ML-DSA-87 verify 328670 cycles 353531 cycles 0.93

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 118642 cycles 120291 cycles 0.99
ML-DSA-44 sign 470944 cycles 487575 cycles 0.97
ML-DSA-44 verify 136176 cycles 145607 cycles 0.94
ML-DSA-65 keypair 204732 cycles 207407 cycles 0.99
ML-DSA-65 sign 781331 cycles 803126 cycles 0.97
ML-DSA-65 verify 219238 cycles 231885 cycles 0.95
ML-DSA-87 keypair 334153 cycles 336321 cycles 0.99
ML-DSA-87 sign 960834 cycles 982698 cycles 0.98
ML-DSA-87 verify 354097 cycles 369410 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A76 (Raspberry Pi 5) benchmarks (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 213163 cycles 214921 cycles 0.99
ML-DSA-44 sign 781109 cycles 796938 cycles 0.98
ML-DSA-44 verify 229936 cycles 239879 cycles 0.96
ML-DSA-65 keypair 381392 cycles 383546 cycles 0.99
ML-DSA-65 sign 1284323 cycles 1316119 cycles 0.98
ML-DSA-65 verify 372595 cycles 384800 cycles 0.97
ML-DSA-87 keypair 608888 cycles 611048 cycles 1.00
ML-DSA-87 sign 1642534 cycles 1662747 cycles 0.99
ML-DSA-87 verify 621151 cycles 637375 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 36212 cycles 37336 cycles 0.97
ML-DSA-44 sign 163085 cycles 169656 cycles 0.96
ML-DSA-44 verify 45930 cycles 50105 cycles 0.92
ML-DSA-65 keypair 65187 cycles 65461 cycles 1.00
ML-DSA-65 sign 269644 cycles 279581 cycles 0.96
ML-DSA-65 verify 73502 cycles 78137 cycles 0.94
ML-DSA-87 keypair 99000 cycles 101115 cycles 0.98
ML-DSA-87 sign 317827 cycles 328206 cycles 0.97
ML-DSA-87 verify 109541 cycles 116497 cycles 0.94

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 59095 cycles 60983 cycles 0.97
ML-DSA-44 sign 255045 cycles 265066 cycles 0.96
ML-DSA-44 verify 74279 cycles 80645 cycles 0.92
ML-DSA-65 keypair 104923 cycles 108431 cycles 0.97
ML-DSA-65 sign 423767 cycles 444769 cycles 0.95
ML-DSA-65 verify 119732 cycles 129801 cycles 0.92
ML-DSA-87 keypair 160932 cycles 163435 cycles 0.98
ML-DSA-87 sign 499287 cycles 511616 cycles 0.98
ML-DSA-87 verify 179998 cycles 189081 cycles 0.95

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 4th gen (c7i) (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 94596 cycles 96372 cycles 0.98
ML-DSA-44 sign 342989 cycles 351483 cycles 0.98
ML-DSA-44 verify 101490 cycles 105691 cycles 0.96
ML-DSA-65 keypair 163526 cycles 163710 cycles 1.00
ML-DSA-65 sign 570845 cycles 576859 cycles 0.99
ML-DSA-65 verify 163518 cycles 170256 cycles 0.96
ML-DSA-87 keypair 269205 cycles 273579 cycles 0.98
ML-DSA-87 sign 722380 cycles 735275 cycles 0.98
ML-DSA-87 verify 271184 cycles 280114 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 72744 cycles 74177 cycles 0.98
ML-DSA-44 sign 257248 cycles 268537 cycles 0.96
ML-DSA-44 verify 82419 cycles 89027 cycles 0.93
ML-DSA-65 keypair 124718 cycles 127704 cycles 0.98
ML-DSA-65 sign 418393 cycles 437619 cycles 0.96
ML-DSA-65 verify 133899 cycles 143019 cycles 0.94
ML-DSA-87 keypair 215781 cycles 212177 cycles 1.02
ML-DSA-87 sign 537519 cycles 542602 cycles 0.99
ML-DSA-87 verify 223941 cycles 230379 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 46361 cycles 45310 cycles 1.02
ML-DSA-44 sign 197761 cycles 198143 cycles 1.00
ML-DSA-44 verify 59051 cycles 62984 cycles 0.94
ML-DSA-65 keypair 76750 cycles 76871 cycles 1.00
ML-DSA-65 sign 306116 cycles 325750 cycles 0.94
ML-DSA-65 verify 87186 cycles 96266 cycles 0.91
ML-DSA-87 keypair 114996 cycles 116674 cycles 0.99
ML-DSA-87 sign 354465 cycles 366154 cycles 0.97
ML-DSA-87 verify 129885 cycles 139952 cycles 0.93

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel Xeon 3rd gen (c6i) (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 157050 cycles 159577 cycles 0.98
ML-DSA-44 sign 563354 cycles 574753 cycles 0.98
ML-DSA-44 verify 169123 cycles 175203 cycles 0.97
ML-DSA-65 keypair 269997 cycles 272720 cycles 0.99
ML-DSA-65 sign 929518 cycles 945985 cycles 0.98
ML-DSA-65 verify 274528 cycles 283415 cycles 0.97
ML-DSA-87 keypair 449825 cycles 453069 cycles 0.99
ML-DSA-87 sign 1176593 cycles 1192316 cycles 0.99
ML-DSA-87 verify 459136 cycles 469703 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 71522 cycles 73300 cycles 0.98
ML-DSA-44 sign 273874 cycles 282486 cycles 0.97
ML-DSA-44 verify 82240 cycles 87271 cycles 0.94
ML-DSA-65 keypair 126531 cycles 129003 cycles 0.98
ML-DSA-65 sign 449807 cycles 461243 cycles 0.98
ML-DSA-65 verify 133375 cycles 139667 cycles 0.95
ML-DSA-87 keypair 205265 cycles 208392 cycles 0.98
ML-DSA-87 sign 553986 cycles 565901 cycles 0.98
ML-DSA-87 verify 215759 cycles 223269 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 3rd gen (c6a) (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 135508 cycles 136345 cycles 0.99
ML-DSA-44 sign 542398 cycles 554700 cycles 0.98
ML-DSA-44 verify 148663 cycles 155242 cycles 0.96
ML-DSA-65 keypair 227809 cycles 227981 cycles 1.00
ML-DSA-65 sign 874643 cycles 893854 cycles 0.98
ML-DSA-65 verify 236078 cycles 243954 cycles 0.97
ML-DSA-87 keypair 374966 cycles 376878 cycles 0.99
ML-DSA-87 sign 1098058 cycles 1115644 cycles 0.98
ML-DSA-87 verify 386905 cycles 398895 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AMD EPYC 4th gen (c7a) (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 120986 cycles 121632 cycles 0.99
ML-DSA-44 sign 451932 cycles 461721 cycles 0.98
ML-DSA-44 verify 131661 cycles 137157 cycles 0.96
ML-DSA-65 keypair 205657 cycles 205812 cycles 1.00
ML-DSA-65 sign 740072 cycles 752983 cycles 0.98
ML-DSA-65 verify 211066 cycles 216831 cycles 0.97
ML-DSA-87 keypair 339147 cycles 339530 cycles 1.00
ML-DSA-87 sign 942332 cycles 956240 cycles 0.99
ML-DSA-87 verify 348162 cycles 357463 cycles 0.97

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 119130 cycles 120799 cycles 0.99
ML-DSA-44 sign 472589 cycles 488399 cycles 0.97
ML-DSA-44 verify 136811 cycles 145989 cycles 0.94
ML-DSA-65 keypair 204918 cycles 208066 cycles 0.98
ML-DSA-65 sign 781470 cycles 805186 cycles 0.97
ML-DSA-65 verify 219277 cycles 232525 cycles 0.94
ML-DSA-87 keypair 334609 cycles 336894 cycles 0.99
ML-DSA-87 sign 962569 cycles 985538 cycles 0.98
ML-DSA-87 verify 354138 cycles 370175 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 76608 cycles 77954 cycles 0.98
ML-DSA-44 sign 293501 cycles 304386 cycles 0.96
ML-DSA-44 verify 89487 cycles 95538 cycles 0.94
ML-DSA-65 keypair 133497 cycles 135025 cycles 0.99
ML-DSA-65 sign 482326 cycles 496830 cycles 0.97
ML-DSA-65 verify 143934 cycles 151258 cycles 0.95
ML-DSA-87 keypair 215860 cycles 217588 cycles 0.99
ML-DSA-87 sign 594391 cycles 607209 cycles 0.98
ML-DSA-87 verify 230110 cycles 239670 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton4 (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 133082 cycles 134660 cycles 0.99
ML-DSA-44 sign 498260 cycles 506939 cycles 0.98
ML-DSA-44 verify 144896 cycles 149574 cycles 0.97
ML-DSA-65 keypair 226122 cycles 228568 cycles 0.99
ML-DSA-65 sign 813917 cycles 824817 cycles 0.99
ML-DSA-65 verify 231233 cycles 237103 cycles 0.98
ML-DSA-87 keypair 374601 cycles 377328 cycles 0.99
ML-DSA-87 sign 1019872 cycles 1030998 cycles 0.99
ML-DSA-87 verify 383530 cycles 390919 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton3 (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 138666 cycles 139731 cycles 0.99
ML-DSA-44 sign 495052 cycles 505692 cycles 0.98
ML-DSA-44 verify 148928 cycles 154387 cycles 0.96
ML-DSA-65 keypair 242421 cycles 245185 cycles 0.99
ML-DSA-65 sign 810192 cycles 823206 cycles 0.98
ML-DSA-65 verify 241091 cycles 248706 cycles 0.97
ML-DSA-87 keypair 396138 cycles 397296 cycles 1.00
ML-DSA-87 sign 1031110 cycles 1044080 cycles 0.99
ML-DSA-87 verify 402422 cycles 411301 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@oqs-bot oqs-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Graviton2 (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 213471 cycles 215538 cycles 0.99
ML-DSA-44 sign 782770 cycles 809773 cycles 0.97
ML-DSA-44 verify 230256 cycles 239814 cycles 0.96
ML-DSA-65 keypair 381757 cycles 384029 cycles 0.99
ML-DSA-65 sign 1286038 cycles 1307857 cycles 0.98
ML-DSA-65 verify 373026 cycles 385278 cycles 0.97
ML-DSA-87 keypair 609835 cycles 611946 cycles 1.00
ML-DSA-87 sign 1645687 cycles 1665959 cycles 0.99
ML-DSA-87 verify 621822 cycles 637587 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SpacemiT K1 8 (Banana Pi F3) benchmarks (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 824555 cycles 828347 cycles 1.00
ML-DSA-44 sign 3322292 cycles 3361635 cycles 0.99
ML-DSA-44 verify 910081 cycles 931593 cycles 0.98
ML-DSA-65 keypair 1390153 cycles 1394523 cycles 1.00
ML-DSA-65 sign 5439685 cycles 5464549 cycles 1.00
ML-DSA-65 verify 1455949 cycles 1479842 cycles 0.98
ML-DSA-87 keypair 2307451 cycles 2308212 cycles 1.00
ML-DSA-87 sign 6876862 cycles 6902480 cycles 1.00
ML-DSA-87 verify 2410586 cycles 2435035 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 293900 cycles 299048 cycles 0.98
ML-DSA-44 sign 1209347 cycles 1235706 cycles 0.98
ML-DSA-44 verify 343802 cycles 358136 cycles 0.96
ML-DSA-65 keypair 497019 cycles 506696 cycles 0.98
ML-DSA-65 sign 1941494 cycles 2024461 cycles 0.96
ML-DSA-65 verify 540122 cycles 561853 cycles 0.96
ML-DSA-87 keypair 850216 cycles 847773 cycles 1.00
ML-DSA-87 sign 2534477 cycles 2562883 cycles 0.99
ML-DSA-87 verify 902552 cycles 922665 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 466413 cycles 469956 cycles 0.99
ML-DSA-44 sign 2215549 cycles 2239780 cycles 0.99
ML-DSA-44 verify 550719 cycles 565745 cycles 0.97
ML-DSA-65 keypair 776107 cycles 782939 cycles 0.99
ML-DSA-65 sign 3632934 cycles 3681139 cycles 0.99
ML-DSA-65 verify 850930 cycles 868526 cycles 0.98
ML-DSA-87 keypair 1258536 cycles 1271149 cycles 0.99
ML-DSA-87 sign 4510629 cycles 4570247 cycles 0.99
ML-DSA-87 verify 1370939 cycles 1397775 cycles 0.98

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 241156 cycles 233930 cycles 1.03
ML-DSA-44 sign 813433 cycles 832225 cycles 0.98
ML-DSA-44 verify 255088 cycles 267150 cycles 0.95
ML-DSA-65 keypair 406266 cycles 409016 cycles 0.99
ML-DSA-65 sign 1339350 cycles 1339528 cycles 1.00
ML-DSA-65 verify 412574 cycles 424019 cycles 0.97
ML-DSA-87 keypair 691161 cycles 671470 cycles 1.03
ML-DSA-87 sign 1740077 cycles 1682075 cycles 1.03
ML-DSA-87 verify 687866 cycles 697557 cycles 0.99

This comment was automatically generated by workflow using github-action-benchmark.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 306953 cycles 314364 cycles 0.98
ML-DSA-44 sign 1214223 cycles 1243263 cycles 0.98
ML-DSA-44 verify 340116 cycles 357064 cycles 0.95
ML-DSA-65 keypair 557834 cycles 572673 cycles 0.97
ML-DSA-65 sign 1949763 cycles 2004422 cycles 0.97
ML-DSA-65 verify 530903 cycles 554343 cycles 0.96
ML-DSA-87 keypair 857966 cycles 881801 cycles 0.97
ML-DSA-87 sign 2440171 cycles 2515457 cycles 0.97
ML-DSA-87 verify 875816 cycles 915213 cycles 0.96

This comment was automatically generated by workflow using github-action-benchmark.

Copy link
Contributor

@mkannwischer mkannwischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! Amazing speed ups!

I am happy with these changes now. Before we merge the commit history needs to be cleaned up though.

@hanno-becker, could you also have a brief look - particularly the CBMC proof changes?

Comment on lines 151 to 146
if (pos == r)
{
mld_keccakf1600_permute(s);
pos = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a functional change, isn't it? We now permute when running with pos == r and outlen == 0, while previously, this would have been a no-op.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I will fix.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's also the point that the tests didn't catch this => it needs a test as well.

Copy link
Contributor

@hanno-becker hanno-becker Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a correctness issue, but one of consistency I think. Previously, it seems that pos == r was allowed and would lead to permutation and pos = 0 reset only upon the next squeeze. With the change -- and it is still the case after your latest change -- the squeeze logic instead permutes and resets to pos = 0 immediately when the block is full. Either is OK, but I think it should be consistent with finalize which does not permute and reset.

If possible, I would like to keep this uniform and in line with mlkem-native. Can you have a think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will issue a PR to mlkem-native as well, since it is also using an assignment loop instead of an accelerated memset.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.

Benchmark suite Current: bce8611 Previous: d0e5eba Ratio
ML-DSA-44 keypair 241156 cycles 233930 cycles 1.03
ML-DSA-87 sign 1740077 cycles 1682075 cycles 1.03

This comment was automatically generated by workflow using github-action-benchmark.

@@ -199,15 +120,15 @@ static unsigned int keccak_squeeze(uint8_t *out, size_t outlen,
unsigned int pos, unsigned int r)
__contract__(
requires((r == SHAKE128_RATE && pos <= SHAKE128_RATE) ||
(r == SHAKE256_RATE && pos <= SHAKE256_RATE))
(r == SHAKE256_RATE && pos <= SHAKE256_RATE) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we restrict r in the first place?

Copy link
Contributor

@hanno-becker hanno-becker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support the change per-se, and the speedup is great, thank you @bremoran.

There is one minor issue of consistency, though, with regards to when to permute and reset pos = 0 during squeezing. Previously, we would do this lazily only when we need to squeeze more. Now, we keep pos = r at finalize, but squeeze eagerly permutes and resets to pos = 0. If we can uphold the previous approach (in line with mlkem-native) while still riping the performance benefits, I'd prefer that.

Comment on lines 142 to 146
if ((outlen > 0) && (pos >= r))
{
mld_keccakf1600_permute(s);
pos = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this case not be handled correctly by the first loop iteration already? (xor_bytes being a no-op)

pos = 0;
}

while (pos + bytes_to_go >= r)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps the previous behavior can be retained simply by using > here.

willieyz added a commit that referenced this pull request Sep 9, 2025
Signed-off-by: willieyz <willie.zhao@chelpis.com>
@bremoran
Copy link
Author

bremoran commented Sep 9, 2025

I support the change per-se, and the speedup is great, thank you @bremoran.

There is one minor issue of consistency, though, with regards to when to permute and reset pos = 0 during squeezing. Previously, we would do this lazily only when we need to squeeze more. Now, we keep pos = r at finalize, but squeeze eagerly permutes and resets to pos = 0. If we can uphold the previous approach (in line with mlkem-native) while still riping the performance benefits, I'd prefer that.

Unfortunately mlkem-native and mldsa-native are already quite inconsistent. In fact, the API we're discussing doesn't actually exist in mlkem-native.
mldsa-native provides shake128_squeeze, while mlkem-native provides shake128_squeezeblocks.
Thus, the API used by mldsa-native's shake128_squeeze (keccak_squeeze) is not used by mlkem-native's shake128_squeezeblocks. It uses keccak_squeezeblocks instead.

I'm not sure what consistency you're aiming for here?

It might be worth attempting to realign mldsa-native with mlkem-native if this is the goal, however it will be a much deeper API review than this small PR.

@hanno-becker
Copy link
Contributor

I'm not sure what consistency you're aiming for here?

See above: After finalize, pos == r but you don't permute and reset to r = 0 immediately. But the logic in squeeze does that.

@bremoran
Copy link
Author

bremoran commented Sep 9, 2025

I'm not sure what consistency you're aiming for here?

See above: After finalize, pos == r but you don't permute and reset to r = 0 immediately. But the logic in squeeze does that.

I think you may have missed my point. This code does not exist in mlkem-native. That is the source of the "inconsistency." It's rather difficult to be consistent with code that doesn't exist.

The closest code in mlkem-native is:

static void mlk_keccak_squeezeblocks(uint8_t *h, size_t nblocks, uint64_t *s,
                                     uint32_t r)

Note that this does not track pos.

A second implementation is also quite close.

static void mlk_keccak_squeeze_once(uint8_t *h, size_t outlen, uint64_t *s,
                                    uint32_t r)

This also does not track pos.

Of course the mldsa-native code is "inconsistent." There is no equivalent mlkem-native code.

@bremoran
Copy link
Author

bremoran commented Sep 9, 2025

@hanno-becker hopefully this change is more what you had in mind.

Squeeze & Absorb currently use plain xor & extract in C. This is fine for
64-bit scalar Keccak, but it doesn't work well for 32-bit architectures,
since performance demands bit interleaving. As a result, keccak_absorb and
keccak_squeeze should use the existing mld_keccakf1600_xor_bytes and
mld_keccakf1600_extract_bytes.

This commit refactors Squeeze & Absorb to use mld_keccakf1600_xor_bytes and mld_keccakf1600_extract_bytes instead of ad-hoc C code for accessing the state.

Co-authored-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Signed-off-by: Brendan Moran <brendan.moran@arm.com>
Signed-off-by: Matthias J. Kannwischer <matthias@kannwischer.eu>
Signed-off-by: Brendan Moran <brendan.moran@arm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants