Performance regression in vector computations due to call site pollution #15272

kaivalnp · 2025-10-01T20:18:25Z

I was trying to figure out why #14863 adversely affected indexing performance of byte quantized vectors in nightly benchmarks (see #14863 (comment)), when it was supposed to speed things up by doing vector computations off-heap -- and may have found something interesting!

- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution

kaivalnp · 2025-10-01T20:18:55Z

We use the same underlying function for a variety of combinations of on and off-heap vectors -- which may lead to its call site being polluted, and compiled functions being sub-optimal

To demonstrate this, I added some type pollution to VectorScorerBenchmark

Benchmark without pollution:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  7.543 ± 0.062  ops/us

Benchmark with pollution:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  6.327 ± 0.057  ops/us

kaivalnp · 2025-10-01T20:30:00Z

On printing some JVM inlining internals, I came across entries like:

 @ 35   org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$ArrayLoader::length (6 bytes)   inline (hot)   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader::length (13 bytes)   inline (hot)   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes)    \-> TypeProfile (1028/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$MemorySegmentLoader   callee changed to  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (250 bytes)    \-> TypeProfile (42488/43516 counts) = org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport$ArrayLoader

..so I turned off JVM method inlining for those internal classes using some compiler directives (so that their callers are compiled separately, into more type-appropriate methods)

Here are the results:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  6.374 ± 0.026  ops/us
VectorScorerBenchmark.binaryDotProductMemSegWithVectorDirectives    1024  thrpt   15  7.401 ± 0.005  ops/us

Looks like we can regain most of the performance drop from call site pollution!

kaivalnp · 2025-10-01T20:42:10Z

Long-term, we probably want to rewrite PanamaVectorUtilSupport so that these directives aren't needed?

In the short-term, how can we make users aware of this issue / workaround?
For example, the directives probably need to be added to files like this (for OpenSearch) or this (for ElasticSearch)

kaivalnp · 2025-10-02T15:19:16Z

Also wanted to note: we fixed a similar JVM inlining issue in #14874 -- and the PanamaVectorUtilSupport$ArrayLoader and PanamaVectorUtilSupport$MemorySegmentLoader classes referenced in this PR were actually added there^

I undid those changes locally (move away from ByteVectorLoader, wrap byte[] in a MemorySegment and have all functions work on two MemorySegments), but that degrades performance further:

Benchmark                                                         (size)   Mode  Cnt  Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg                        1024  thrpt   15  1.592 ± 0.065  ops/us

i.e. the above PR was net-net positive, but left some more performance to be reclaimed by sidestepping call site pollution (which we hope to do here)

msokolov · 2025-10-02T15:29:54Z

It's annoying that the JVM only allows this kind of control via command-line args rather than annotations that can be placed in code. In fact there are such annotations (@inline and @DontInline) but they are not available to lowly users, only for JVM internal code.

mccullocht · 2025-10-02T16:46:39Z

It's possible that wrapping byte[] in MemorySegment is just moving the inlining issue around. I tried something similar in my PR for off-heap scoring and got similar results.

It might be best to use a script to generate code for the 3 different input cases.

mccullocht · 2025-10-03T23:03:02Z

I tried adding a ByteVectorLoader implementation that stores both possible representations and switches between them in the loop (think if (segment != null) { <segment stuff> } else { <array stuff> }. This was not successful.

With pollution:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  25.176 ± 2.890  ops/us

Without pollution:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  32.198 ± 0.077  ops/us

With union vector representation:

Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorScorerBenchmark.binaryDotProductMemSeg    1024  thrpt   15  28.138 ± 0.108  ops/us

I also tried sealing ByteVectorLoader and switching on the class type. On most run this is faster than the union vector representation, but on some jvm runs it is ~10x slower :/.

Demonstration

85af468

- Pollute VectorScorerBenchmark with on-heap dot products - Add benchmark function with compiler directives that work around the pollution

kaivalnp mentioned this pull request Oct 1, 2025

Implement off-heap quantized scoring #14863

Merged

benwtrent mentioned this pull request Oct 1, 2025

Implement off-heap scoring for OSQ 4, 7, and 8 bit representations #15257

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance regression in vector computations due to call site pollution #15272

Performance regression in vector computations due to call site pollution #15272

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 2, 2025

Uh oh!

msokolov commented Oct 2, 2025

Uh oh!

mccullocht commented Oct 2, 2025

Uh oh!

mccullocht commented Oct 3, 2025

Uh oh!

Uh oh!

Performance regression in vector computations due to call site pollution #15272

Are you sure you want to change the base?

Performance regression in vector computations due to call site pollution #15272

Conversation

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

kaivalnp commented Oct 2, 2025

Uh oh!

msokolov commented Oct 2, 2025

Uh oh!

mccullocht commented Oct 2, 2025

Uh oh!

mccullocht commented Oct 3, 2025

Uh oh!

Uh oh!