-
Notifications
You must be signed in to change notification settings - Fork 20
prover: better parallelization strategy for poseidon table #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Do you have an intuition on why: pub fn all_poseidon_16_indexes(poseidons_16: &[WitnessPoseidon16]) -> [Vec<F>; 3] {
let ((addr_a, addr_b), addr_c) = rayon::join(
|| {
rayon::join(
|| {
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_a))
.collect()
},
|| {
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_b))
.collect()
},
)
},
|| {
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_output))
.collect()
},
);
[addr_a, addr_b, addr_c]
}would be faster than: pub fn all_poseidon_16_indexes(poseidons_16: &[WitnessPoseidon16]) -> [Vec<F>; 3] {
[
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_a))
.collect::<Vec<_>>(),
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_b))
.collect::<Vec<_>>(),
poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_output))
.collect::<Vec<_>>(),
]
}? I am not a rayon expert but it my basic understanding tells me it should not improve perf ? |
|
In the example you point to, |
|
Sometimes, indicating to the compiler that this part poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_a))
.collect()can be executed in parallel to this part poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_input_b))
.collect()which can also be executed in parallel to poseidons_16
.par_iter()
.map(|p| F::from_usize(p.addr_output))
.collect()can give a better repartition of the work between threads but this is from my experience by experimenting which things in other repos. I guess this is not always the absolute truth. If in this case this brings nothing to the perfs then this is useless to merge because this brings a bit more line/complexity to the code. Roughly the proposed strategy is the same for the other functions |
|
Hmm ok. I will run some benchmarks when I find some time, making this PR wait should not cause conflict issue think |
Yes no problem, this is an independent file, so we should be able to play with it without any interaction. |
6283fa4 to
4b0ae0a
Compare
@TomWambsgans Let me clean this description now that I've a clearer view of where this thing is used.
Using this kind of join parallelization gave us some benefits for some specific scenarios in the past.
For example this sort of trick is used here: https://github.com/WizardOfMenlo/whir/blob/22c675807fc9295fef68a11945713dc3e184e1c1/src/ntt/transpose.rs#L148-L161
The IA seemed to validate this idea by inspecting the prover execution (just ran Claude Code with this proposed idea to validate).
Let me know if you see any sort of benefit when compared to your traditional benchmarks.