Skip to content

Conversation

mccullocht
Copy link
Contributor

@mccullocht mccullocht commented Sep 29, 2025

Partial implementation of #15155

So far this is not any faster than the alternative. On an AMD RYZEN AI MAX+ 395

baseline:
Results:
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.913        1.635   1.630        0.997  1000000   100     100       32        250     8 bits     6824      0.00      Infinity            0.04             1         3759.67      3677.368      747.681       HNSW

candidate:
Results:
recall  latency(ms)  netCPU  avgCpuCount     nDoc  topK  fanout  maxConn  beamWidth  quantized  visited  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.913        1.671   1.661        0.994  1000000   100     100       32        250     8 bits     6824      0.00      Infinity            0.04             1         3759.67      3677.368      747.681       HNSW

DO NOT MERGE
Performance observations: on an avx512 host the profiles are quite different. the original path spends a most time in dotProductBody512 followed by Int512Vector.reduceLanes(). the new path spends much more time in reduceLanes() but also spends more time loading from a memory segment for the input vectors -- a 128 bit load from a memory segment instead of a heap array. this could be memory latency but in that case why doesn't the load into the heap array show up in the profile?

@benwtrent
Copy link
Member

@mccullocht maybe we only do the byte part of the comparisons off-heap? Then apply the corrections all on heap. I would assume applying corrections is pretty cheap, but even then if we did it in bulk, maybe on-heap bulk correction application is pretty fast?

@mccullocht
Copy link
Contributor Author

I did move just the vector dot product off-heap and I'm not planning to do anything clever with the corrections. I'm not sure that would pay off anyway -- you'd have to transpose from row view to column view to parallelize that work, and it would be 128-bit on x86 which may not go well.

I was assuming that accessing the corrective terms was messing with performance but larger jfr stacks point at a more mysterious culprit. This PR spends more time in lane reduction (???) and 128-bit loads of data from the memory segment (probably memory latency). For the latter case its weird that I don't see anything on the baseline when I know that copying to the heap should be inducing a similar hit.

baseline:
36.66%        12745         org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code]
25.90%        9005          org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code]
8.64%         3005          jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code]
                              at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.util.VectorUtil#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer#quantizedScore() [Inlined code]
                              at org.apache.lucene.codecs.lucene104.Lucene104ScalarQuantizedVectorScorer$1#score() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]

candidate:
33.93%        11848         jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code]
                              at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [Inlined code]
23.97%        8369          jdk.incubator.vector.IntVector#reduceLanesTemplate() [Inlined code]
                              at jdk.incubator.vector.Int512Vector#reduceLanes() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.RandomVectorScorer#bulkScore() [Inlined code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#searchLevel() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.AbstractHnswGraphSearcher#search() [JIT compiled code]
                              at org.apache.lucene.util.hnsw.HnswGraphSearcher#search() [JIT compiled code]
13.33%        4655          jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegmentScopedInternal() [Inlined code]
                              at jdk.internal.misc.ScopedMemoryAccess#loadFromMemorySegment() [Inlined code]
                              at jdk.incubator.vector.ByteVector#fromMemorySegment0Template() [Inlined code]
                              at jdk.incubator.vector.Byte128Vector#fromMemorySegment0() [Inlined code]
                              at jdk.incubator.vector.ByteVector#fromMemorySegment() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport$MemorySegmentLoader#load() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody512() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#dotProductBody() [Inlined code]
                              at org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport#uint8DotProduct() [Inlined code]
                              at org.apache.lucene.internal.vectorization.Lucene104MemorySegmentScalarQuantizedVectorScorer$RandomVectorScorerImpl#score() [JIT compiled code]

@benwtrent
Copy link
Member

@mccullocht you might find this interesting: #15272

@mccullocht
Copy link
Contributor Author

I plan to try the simplest thing first and just copy the dot product code for byte[] x MemorySegment to see if that yields an improvement, then go from there. I hoped that the JVM would monomorphize these calls but I guess not.

@mccullocht
Copy link
Contributor Author

Ok, repeated the experiment with a specific byte[] x MemorySegment implementation. In the luceneutil benchmarks I'm not suffering from the same inlining/pollution issues. The profiles remain the same where suddenly reduceLanes() becomes very expensive. I have not tried this on other hardware (e.g. a Mac) yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants