-
Couldn't load subscription status.
- Fork 100
scrypt: SSE2/simd128 RoMix data layout optimization #622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
|
I tested on: |
|
Sidebar: huh interesting, I wasn't aware of that wasmtime target but it's very cool you can run WASM benchmarks from the CLI like that. I assume tests work too? If so we should make use of that. |
|
@tarcieri it should be just: [target.wasm32-wasip1]
runner = "wasmtime"Then You can definitely set it up for CI (in another PR probably). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we usually prefer the module.rs style instead of module/mod.rs.
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
|
I am benchmarking on real A64 hardware and noticed there is something wrong with the Arm NEON/asimd performance-almost +100% runtime compared to soft on both Raspberry Pi and my phone. I can't immediately see what's wrong with it except the L/S are unaligned (shouldn't be that bad), tried some equivalent ARX and load/store sequences didn't fix it either. Assembly looks correct as well and no obvious reasons why it should be that slow... Unless I am missing something probably just bad interaction with hardware and we outta just remove it, bad luck. The current A64 assembly: https://gist.github.com/eternal-flame-AD/540d80d33e1ac596740744fe8cd6c18f If someone has a MacBook with Apple Sillion the results might be different, I am suspecting the aSIMD instruction latency (usually 2x-4x the SSE2 equivalent) is too bad for deliberately serial algorithms like this. |
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
|
Hi, just checking on the status on this PR, is there intention to proceed? |
|
@eternal-flame-AD it looks interesting, there's just a lot on our plate right now |
|
Okay no worries, just making sure it isn't lost. I will check back in a month if I didn't hear back |
|
@eternal-flame-AD in the meantime can you rebase the PR/merge |
b8a3705 to
1deeceb
Compare
46aec91 to
304da78
Compare
|
Sorry for the delay reviewing this. We just had another PR that popped up which uses (this PR is slightly less trivial due to its comparative size and use of intrinsics) |
|
Hi, I think algorithmically this should just fit right in as it is just conceptually a loop code motion, but if you are asking about advice on how to do multithreading for scrypt in general, there are a couple approaches/considerations that might be more optimal:
|
|
@eternal-flame-AD thanks for the input! I went ahead and merged #733 if you don't mind rebasing/merging master.
Oh interesting, I see you have this now: https://github.com/eternal-flame-AD/scrypt-opt/blob/main/src/salsa20/x86_64.rs Would you be interested in contributing that to https://github.com/rustcrypto/stream-ciphers? It would also be useful for |
|
Hi, sure I can take a look this weekend, just to be clear do you mean a backend for the existing generic API or a separate API for this kind of alternate data layout operation? The former might not be trivially profitable without AVX2 permutation assist even for single buffer for the reduced round variants. |
|
@eternal-flame-AD the main motivation I'd be interested in would be accelerating SIMD support for the generic API would be gravy, provided it actually had an advantage. |
|
Thanks, I will prioritize just porting an alternate layout API then. |
Signed-off-by: eternal-flame-AD <yume@yumechi.jp>
Prearranged data into 128bit lanes so we don't have to transpose back and forth in the BlockMix Salsa20 kernel on SSE2.
The permute constants are the same as https://github.com/RustCrypto/stream-ciphers/blob/07ee501ac9067abe0679a596aa771a575baec68e/salsa20/src/backends/soft.rs#L54-L57 read column wise.
After:
Before:
Picked from my own performance oriented implementation: https://github.com/eternal-flame-AD/scrypt-opt