-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@mccullocht maybe we only do the byte part of the comparisons off-heap? Then apply the corrections all on heap. I would assume applying corrections is pretty cheap, but even then if we did it in bulk, maybe on-heap bulk correction application is pretty fast? |
I did move just the vector dot product off-heap and I'm not planning to do anything clever with the corrections. I'm not sure that would pay off anyway -- you'd have to transpose from row view to column view to parallelize that work, and it would be 128-bit on x86 which may not go well. I was assuming that accessing the corrective terms was messing with performance but larger jfr stacks point at a more mysterious culprit. This PR spends more time in lane reduction (???) and 128-bit loads of data from the memory segment (probably memory latency). For the latter case its weird that I don't see anything on the baseline when I know that copying to the heap should be inducing a similar hit.
|
@mccullocht you might find this interesting: #15272 |
I plan to try the simplest thing first and just copy the dot product code for byte[] x MemorySegment to see if that yields an improvement, then go from there. I hoped that the JVM would monomorphize these calls but I guess not. |
9e1d76e
to
e7178bc
Compare
Ok, repeated the experiment with a specific byte[] x MemorySegment implementation. In the luceneutil benchmarks I'm not suffering from the same inlining/pollution issues. The profiles remain the same where suddenly |
Partial implementation of #15155
So far this is not any faster than the alternative. On an AMD RYZEN AI MAX+ 395
DO NOT MERGE
Performance observations: on an avx512 host the profiles are quite different. the original path spends a most time in dotProductBody512 followed by Int512Vector.reduceLanes(). the new path spends much more time in reduceLanes() but also spends more time loading from a memory segment for the input vectors -- a 128 bit load from a memory segment instead of a heap array. this could be memory latency but in that case why doesn't the load into the heap array show up in the profile?