-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Performance regression in vector computations due to call site pollution #15272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution
We use the same underlying function for a variety of combinations of on and off-heap vectors -- which may lead to its call site being polluted, and compiled functions being sub-optimal To demonstrate this, I added some type pollution to Benchmark without pollution:
Benchmark with pollution:
|
On printing some JVM inlining internals, I came across entries like:
..so I turned off JVM method inlining for those internal classes using some compiler directives (so that their callers are compiled separately, into more type-appropriate methods) Here are the results:
Looks like we can regain most of the performance drop from call site pollution! |
Also wanted to note: we fixed a similar JVM inlining issue in #14874 -- and the I undid those changes locally (move away from
i.e. the above PR was net-net positive, but left some more performance to be reclaimed by sidestepping call site pollution (which we hope to do here) |
It's annoying that the JVM only allows this kind of control via command-line args rather than annotations that can be placed in code. In fact there are such annotations (@inline and @DontInline) but they are not available to lowly users, only for JVM internal code. |
It's possible that wrapping It might be best to use a script to generate code for the 3 different input cases. |
I tried adding a ByteVectorLoader implementation that stores both possible representations and switches between them in the loop (think With pollution:
Without pollution:
With union vector representation:
I also tried sealing ByteVectorLoader and switching on the class type. On most run this is faster than the union vector representation, but on some jvm runs it is ~10x slower :/. |
I was trying to figure out why #14863 adversely affected indexing performance of byte quantized vectors in nightly benchmarks (see #14863 (comment)), when it was supposed to speed things up by doing vector computations off-heap -- and may have found something interesting!