scrypt: SSE2/simd128 RoMix data layout optimization #622

eternal-flame-AD · 2025-08-01T03:29:37Z

Prearranged data into 128bit lanes so we don't have to transpose back and forth in the BlockMix Salsa20 kernel on SSE2.

The permute constants are the same as https://github.com/RustCrypto/stream-ciphers/blob/07ee501ac9067abe0679a596aa771a575baec68e/salsa20/src/backends/soft.rs#L54-L57 read column wise.

After:

> cargo bench 
test scrypt_15_8_1 ... bench: 180,070,625.10 ns/iter (+/- 4,549,929.06)
> RUSTFLAGS="-Ctarget-feature=+simd128" cargo bench --target wasm32-wasip1    
test scrypt_15_8_1 ... bench: 118,944,571.20 ns/iter (+/- 3,098,151.70)
> ssh cheap_vps cargo bench
test scrypt_15_8_1 ... bench: 304,886,161.00 ns/iter (+/- 6,625,867.19)

Before:

> cargo bench
test scrypt_15_8_1 ... bench: 230,760,302.00 ns/iter (+/- 8,838,571.54)
> RUSTFLAGS="-Ctarget-feature=+simd128" cargo bench --target wasm32-wasip1    
test scrypt_15_8_1 ... bench: 190,474,545.40 ns/iter (+/- 5,895,216.01)
> ssh cheap_vps cargo bench
test scrypt_15_8_1 ... bench: 409,880,353.40 ns/iter (+/- 17,629,444.54)

Picked from my own performance oriented implementation: https://github.com/eternal-flame-AD/scrypt-opt

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD · 2025-08-01T12:46:27Z

I tested on: wasm32-wasip1 (wasmtime), aarch64-unknown-linux-musl (QEMU) and x86_64, plus a unit test to make sure the new kernels yield the same result as the original code (not moved into block_mix::soft). So it should cover all code paths I added.

tarcieri · 2025-08-01T12:48:40Z

Sidebar: huh interesting, I wasn't aware of that wasmtime target but it's very cool you can run WASM benchmarks from the CLI like that. I assume tests work too? If so we should make use of that.

eternal-flame-AD · 2025-08-01T12:53:30Z

@tarcieri it should be just:

[target.wasm32-wasip1]
runner = "wasmtime"

Then cargo test and cargo bench just works. Setting CARGO_TARGET_WASM32_WASIP1_RUNNER=wasmtime is the same.

You can definitely set it up for CI (in another PR probably).

newpavlov

Note that we usually prefer the module.rs style instead of module/mod.rs.

scrypt/src/block_mix/mod.rs

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD · 2025-08-01T19:15:06Z

I am benchmarking on real A64 hardware and noticed there is something wrong with the Arm NEON/asimd performance-almost +100% runtime compared to soft on both Raspberry Pi and my phone.

I can't immediately see what's wrong with it except the L/S are unaligned (shouldn't be that bad), tried some equivalent ARX and load/store sequences didn't fix it either.

Assembly looks correct as well and no obvious reasons why it should be that slow... Unless I am missing something probably just bad interaction with hardware and we outta just remove it, bad luck.

The current A64 assembly: https://gist.github.com/eternal-flame-AD/540d80d33e1ac596740744fe8cd6c18f

If someone has a MacBook with Apple Sillion the results might be different, I am suspecting the aSIMD instruction latency (usually 2x-4x the SSE2 equivalent) is too bad for deliberately serial algorithms like this.

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD · 2025-09-12T00:02:50Z

Hi, just checking on the status on this PR, is there intention to proceed?

tarcieri · 2025-09-12T00:31:07Z

@eternal-flame-AD it looks interesting, there's just a lot on our plate right now

eternal-flame-AD · 2025-09-12T00:36:02Z

Okay no worries, just making sure it isn't lost.

I will check back in a month if I didn't hear back

tarcieri · 2025-09-12T00:46:27Z

@eternal-flame-AD in the meantime can you rebase the PR/merge master?

tarcieri · 2025-10-28T14:27:34Z

Sorry for the delay reviewing this. We just had another PR that popped up which uses rayon to implement parallelization which is fairly small, but I was curious if you had an opinion on how it would interact with this one before I merge it: #733

(this PR is slightly less trivial due to its comparative size and use of intrinsics)

eternal-flame-AD · 2025-10-28T14:54:43Z

Hi, I think algorithmically this should just fit right in as it is just conceptually a loop code motion, but if you are asking about advice on how to do multithreading for scrypt in general, there are a couple approaches/considerations that might be more optimal:

Since the PR you mentioned has rayon feature added to default it can cause memory allocation to suddenly multiply in downstream code, might be desirable to have it not be default without a parallelism limit.
There is indeed a more efficient way to do P>1 than just spawning one thread per chunk (pipeline the front sequential half of the previous chunk with the random half of the next chunk and use AVX2 for dual-buffer Salsa20), whether it is profitable for a general purpose library depends on your take in the library's complexity limit, you will have to structure the code like this which is very different from the current rayon approach: https://github.com/eternal-flame-AD/scrypt-opt/blob/main/examples/large_p.rs

tarcieri · 2025-10-28T17:07:17Z

@eternal-flame-AD thanks for the input! I went ahead and merged #733 if you don't mind rebasing/merging master.

pipeline the front sequential half of the previous chunk with the random half of the next chunk and use AVX2 for dual-buffer Salsa20

Oh interesting, I see you have this now: https://github.com/eternal-flame-AD/scrypt-opt/blob/main/src/salsa20/x86_64.rs

Would you be interested in contributing that to https://github.com/rustcrypto/stream-ciphers? It would also be useful for yescrypt

eternal-flame-AD · 2025-10-28T18:25:45Z

Hi, sure I can take a look this weekend, just to be clear do you mean a backend for the existing generic API or a separate API for this kind of alternate data layout operation?

The former might not be trivially profitable without AVX2 permutation assist even for single buffer for the reduced round variants.

tarcieri · 2025-10-28T18:39:56Z

@eternal-flame-AD the main motivation I'd be interested in would be accelerating yescrypt, so if this data layout works for that application (I think it should?) that would be fine.

SIMD support for the generic API would be gravy, provided it actually had an advantage.

eternal-flame-AD · 2025-10-28T19:05:04Z

Thanks, I will prioritize just porting an alternate layout API then.

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD added 4 commits July 31, 2025 22:17

scrypt: sse2 RoMix optimization

9a017e9

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

use different function names for the preshuffled version

e991eb9

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

wasm32 kernel

c65ad4e

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

fix unused warning

804c622

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD changed the title ~~scrypt: SSE2 RoMix data layout optimization~~ scrypt: SSE2/simd128 RoMix data layout optimization Aug 1, 2025

move arch-dependent impls to block_mix module

bae8e98

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

newpavlov reviewed Aug 1, 2025

View reviewed changes

scrypt/src/block_mix/mod.rs Outdated Show resolved Hide resolved

eternal-flame-AD added 2 commits August 1, 2025 08:06

apply suggestions

94bef2f

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

neon: use multiple-register load/store

c05a6d4

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

remove neon backend for performance regression

aea4297

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD requested a review from newpavlov September 12, 2025 00:04

Merge branch 'master' into scrypt-sse2

1deeceb

eternal-flame-AD force-pushed the scrypt-sse2 branch from b8a3705 to 1deeceb Compare September 12, 2025 00:49

fixup an inaccurate comment

304da78

eternal-flame-AD force-pushed the scrypt-sse2 branch from 46aec91 to 304da78 Compare September 12, 2025 01:04

Merge branch 'master' into scrypt-sse2

46d6a01

Signed-off-by: eternal-flame-AD <yume@yumechi.jp>

eternal-flame-AD mentioned this pull request Oct 29, 2025

WIP: Alternate data layout Salsa20 for (ye)scrypt RustCrypto/stream-ciphers#473

Draft

Uh oh!

scrypt: SSE2/simd128 RoMix data layout optimization #622

Are you sure you want to change the base?

scrypt: SSE2/simd128 RoMix data layout optimization #622

Conversation

eternal-flame-AD commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eternal-flame-AD commented Aug 1, 2025

Uh oh!

tarcieri commented Aug 1, 2025

Uh oh!

eternal-flame-AD commented Aug 1, 2025

Uh oh!

newpavlov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eternal-flame-AD commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eternal-flame-AD commented Sep 12, 2025

Uh oh!

tarcieri commented Sep 12, 2025

Uh oh!

eternal-flame-AD commented Sep 12, 2025

Uh oh!

tarcieri commented Sep 12, 2025

Uh oh!

tarcieri commented Oct 28, 2025

Uh oh!

eternal-flame-AD commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Oct 28, 2025

Uh oh!

eternal-flame-AD commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tarcieri commented Oct 28, 2025

Uh oh!

eternal-flame-AD commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eternal-flame-AD commented Aug 1, 2025 •

edited

Loading

eternal-flame-AD commented Aug 1, 2025 •

edited

Loading

eternal-flame-AD commented Oct 28, 2025 •

edited

Loading

eternal-flame-AD commented Oct 28, 2025 •

edited

Loading