Skip to content

Conversation

kaivalnp
Copy link
Contributor

@kaivalnp kaivalnp commented Oct 1, 2025

I was trying to figure out why #14863 adversely affected indexing performance of byte quantized vectors in nightly benchmarks (see #14863 (comment)), when it was supposed to speed things up by doing vector computations off-heap -- and may have found something interesting!

- Pollute VectorScorerBenchmark with on-heap dot products
- Add benchmark function with compiler directives that work around the pollution
@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

We use the same underlying function for a variety of combinations of on and off-heap vectors -- which may lead to its call site being polluted, and compiled functions being sub-optimal

To demonstrate this, I added some type pollution to VectorScorerBenchmark

Benchmark without pollution:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  7.543 ± 0.062  ops/us

Benchmark with pollution:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  6.327 ± 0.057  ops/us

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

On printing some JVM inlining internals, I came across entries like:

 @ 35   org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::length (6 bytes)   inline (hot)   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::length (13 bytes)   inline (hot)   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes)    \-> TypeProfile (1028/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$MemorySegmentLoader   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes)    \-> TypeProfile (42488/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$ArrayLoader

..so I turned off JVM method inlining for those internal classes using some compiler directives (so that their callers are compiled separately, into more type-appropriate methods)

Here are the results:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  6.374 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductMemSegWithVectorDirectives    1024  thrpt   15  7.401 ± 0.005  ops/us

Looks like we can regain most of the performance drop from call site pollution!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

Long-term, we probably want to rewrite PanamaVectorUtilSupport so that these directives aren't needed?

In the short-term, how can we make users aware of this issue / workaround?
For example, the directives probably need to be added to files like this (for OpenSearch) or this (for ElasticSearch)

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 2, 2025

Also wanted to note: we fixed a similar JVM inlining issue in #14874 -- and the PanamaVectorUtilSupport$ArrayLoader and PanamaVectorUtilSupport$MemorySegmentLoader classes referenced in this PR were actually added there^

I undid those changes locally (move away from ByteVectorLoader, wrap byte[] in a MemorySegment and have all functions work on two MemorySegments), but that degrades performance further:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  1.592 ± 0.065  ops/us

i.e. the above PR was net-net positive, but left some more performance to be reclaimed by sidestepping call site pollution (which we hope to do here)

@msokolov
Copy link
Contributor

msokolov commented Oct 2, 2025

It's annoying that the JVM only allows this kind of control via command-line args rather than annotations that can be placed in code. In fact there are such annotations (@inline and @DontInline) but they are not available to lowly users, only for JVM internal code.

@mccullocht
Copy link
Contributor

It's possible that wrapping byte[] in MemorySegment is just moving the inlining issue around. I tried something similar in my PR for off-heap scoring and got similar results.

It might be best to use a script to generate code for the 3 different input cases.

@mccullocht
Copy link
Contributor

I tried adding a ByteVectorLoader implementation that stores both possible representations and switches between them in the loop (think if (segment != null) { <segment stuff> } else { <array stuff> }. This was not successful.

With pollution:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  25.176 ± 2.890  ops/us

Without pollution:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  32.198 ± 0.077  ops/us

With union vector representation:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  28.138 ± 0.108  ops/us

I also tried sealing ByteVectorLoader and switching on the class type. On most run this is faster than the union vector representation, but on some jvm runs it is ~10x slower :/.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants