Skip to content

Conversation

kaivalnp
Copy link
Contributor

Description

Off-heap scoring for quantized vectors! Related to #13515

This scorer is in-line with Lucene99MemorySegmentFlatVectorsScorer, and will automatically be used with PanamaVectorizationProvider (i.e. on adding jdk.incubator.vector). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..

I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..

Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@kaivalnp
Copy link
Contributor Author

I ran some benchmarks on Cohere vectors (768d) for 7-bit and 4-bit (compressed) quantization..

main without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.860        2.815   2.806        0.997  100000   100      50       64        250     7 bits     44.07       2269.17           46.79             1          373.72       366.592       73.624       HNSW
 0.545        3.193   3.185        0.997  100000   100      50       64        250     4 bits     47.26       2115.95           50.04             1          338.13       329.971       37.003       HNSW

main with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.863        1.904   1.886        0.991  100000   100      50       64        250     7 bits     28.65       3490.65           29.66             1          373.69       366.592       73.624       HNSW
 0.545        1.313   1.305        0.994  100000   100      50       64        250     4 bits     22.86       4373.88           17.84             1          338.13       329.971       37.003       HNSW

This PR without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        2.774   2.765        0.997  100000   100      50       64        250     7 bits     44.60       2242.00           46.71             1          373.73       366.592       73.624       HNSW
 0.545        3.147   3.139        0.997  100000   100      50       64        250     4 bits     47.93       2086.51           50.20             1          338.11       329.971       37.003       HNSW

This PR with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        1.612   1.603        0.994  100000   100      50       64        250     7 bits     22.99       4349.53           24.78             1          373.70       366.592       73.624       HNSW
 0.545        1.277   1.269        0.994  100000   100      50       64        250     4 bits     21.60       4630.49           17.41             1          338.11       329.971       37.003       HNSW

I did see slight fluctuation across runs, but the search time was ~10% faster for 7-bit and very slightly faster for 4-bit (compressed). Indexing and force merge times have improved by ~15%

@kaivalnp
Copy link
Contributor Author

FYI I observed a strange phenomenon where if the query vector is on heap like:

this.query = MemorySegment.ofArray(targetBytes);

instead of the current off-heap implementation in this PR:

this.query = Arena.ofAuto().allocateFrom(JAVA_BYTE, targetBytes);

..then we see a performance regression:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        3.043   3.034        0.997  100000   100      50       64        250     7 bits     23.25       4301.82           25.29             1          373.70       366.592       73.624       HNSW
 0.545        2.060   2.049        0.995  100000   100      50       64        250     4 bits     22.19       4506.33           17.99             1          338.17       329.971       37.003       HNSW

Maybe I'm missing something obvious, but I haven't found the root cause yet..

@ChrisHegarty
Copy link
Contributor

..then we see a performance regression:
...
Maybe I'm missing something obvious, but I haven't found the root cause yet..

yeah. I've seen similar before. You might be hitting a problem with the loop bound not being hoisted. I will try to take a look.

@kaivalnp
Copy link
Contributor Author

Thanks @ChrisHegarty! I saw that we use a heap-backed MemorySegment while scoring byte vectors -- so I opened #14874 to investigate if we can improve performance by moving to an off-heap query

Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Jul 15, 2025
@kaivalnp
Copy link
Contributor Author

After this conversation I re-ran some benchmarks with -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* to force inlining of dot product functions

main:

recall latency index time (s) force merge (s) quantization index
0.530 1.675 75.65 59.97 4 bit fresh
0.860 1.844 81.75 71.79 7 bit fresh
0.530 1.582 - - 4 bit no reindex
0.860 1.859 - - 7 bit no reindex
0.529 1.682 79.32 62.62 4 bit reindex
0.859 1.821 103.78 48.36 7 bit reindex

This PR:

recall latency index time (s) force merge (s) quantization index query type
0.529 2.132 131.86 85.86 4 bit fresh MemorySegment.ofArray
0.858 2.797 133.71 80.81 7 bit fresh MemorySegment.ofArray
0.529 2.081 - - 4 bit no reindex MemorySegment.ofArray
0.858 1.670 - - 7 bit no reindex MemorySegment.ofArray
0.529 2.140 130.82 85.08 4 bit reindex MemorySegment.ofArray
0.858 2.883 132.66 83.05 7 bit reindex MemorySegment.ofArray
0.529 1.511 164.22 110.18 4 bit fresh Arena.ofAuto().allocateFrom
0.859 1.728 132.41 81.35 7 bit fresh Arena.ofAuto().allocateFrom
0.529 1.551 - - 4 bit no reindex Arena.ofAuto().allocateFrom
0.859 1.704 - - 7 bit no reindex Arena.ofAuto().allocateFrom
0.529 1.574 164.04 112.60 4 bit reindex Arena.ofAuto().allocateFrom
0.859 1.774 135.20 83.72 7 bit reindex Arena.ofAuto().allocateFrom

Looks like the changes in this PR have a small benefit on the search side, but slow down indexing by a lot..

@github-actions github-actions bot removed the Stale label Jul 16, 2025
@kaivalnp kaivalnp marked this pull request as draft July 29, 2025 20:01
@kaivalnp kaivalnp force-pushed the off-heap-quantized-scoring branch from 82e1faa to 0167f13 Compare July 29, 2025 20:02
@kaivalnp
Copy link
Contributor Author

We had some interesting findings in #14874, so I updated this PR to reflect those..

Benchmarks for 768d Cohere vectors (dot product similarity, 4-bit one is compressed):

main with -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.607   1.606        1.000  100000   100      50       64        250     7 bits     39.43       2536.33           18.73             1          373.37       366.592       73.624       HNSW
 0.542        1.211   1.210        0.999  100000   100      50       64        250     4 bits     34.09       2933.76           15.33             1          337.76       329.971       37.003       HNSW

main without -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.448   1.447        0.999  100000   100      50       64        250     7 bits      0.00      Infinity            0.12             1          373.37       366.592       73.624       HNSW
 0.542        1.075   1.074        0.999  100000   100      50       64        250     4 bits      0.00      Infinity            0.12             1          337.76       329.971       37.003       HNSW

This PR with -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.394   1.392        0.999  100000   100      50       64        250     7 bits     36.29       2755.73           17.35             1          373.36       366.592       73.624       HNSW
 0.541        1.135   1.134        0.999  100000   100      50       64        250     4 bits     34.85       2869.11           15.53             1          337.74       329.971       37.003       HNSW

This PR without -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.329   1.328        0.999  100000   100      50       64        250     7 bits      0.00      Infinity            0.12             1          373.36       366.592       73.624       HNSW
 0.541        1.141   1.139        0.998  100000   100      50       64        250     4 bits      0.00      Infinity            0.13             1          337.74       329.971       37.003       HNSW

We see some speedup in both indexing and search for 7-bit compression, not so much for the 4-bit one

I wrote specific functions in PanamaVectorUtilSupport to compute the "dot product" between two compressed (i.e. "packed") 4-bit integers (so no need to copy to heap and decompress) to be used during indexing, these are inspired from another function which assumed one of the vectors to be uncompressed (i.e. "unpacked")

The drawback here is that we'll need specific functions for comparing "packed" versions of the query / documents for other similarity functions like "euclidean" as well -- looking for inputs on whether the gains justify maintaining those functions..

Also tagging people who might be interested in these changes, maybe @benwtrent (since you were exploring something similar in #13497) or @ChrisHegarty?

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Aug 5, 2025

Thanks for the review @thecoop! I've tried to address your comments above, do let me know if you have more feedback..

@mikemccand
Copy link
Member

This looks like a compelling optimization? And the "search in the same JVM that did indexing" issue is separately already resolved? Since this is such a risk with Java (hotspot head-fake vulnerability), maybe knnPerfTest.py in nightly should run both ways, so we can spot future hotspot head-fake vulnerabilities?

Were there other pending concerns? I cannot really tell from the discussion...

@kaivalnp
Copy link
Contributor Author

And the "search in the same JVM that did indexing" issue is separately already resolved?

Yes, that was fixed in #14874

Were there other pending concerns?

One consequence of this change: since we don't want to copy 4-bit values to the Java heap, we need to define functions in PanamaVectorUtilSupport to compare two 4-bit vectors for all similarities in combinations of unpacked-unpacked, unpacked-packed, packed-packed (I've added dot product functions in this PR, need to add Euclidean and Cosine now..)

The question I had was whether the benefits are compelling enough to maintain these functions..

@benwtrent
Copy link
Member

The question I had was whether the benefits are compelling enough to maintain these functions..

I think the goal for 4bit is that we just have the "compressed" version only.

@kaivalnp kaivalnp force-pushed the off-heap-quantized-scoring branch from 6ce7f50 to ac04f41 Compare September 10, 2025 06:42
@kaivalnp kaivalnp marked this pull request as ready for review September 10, 2025 06:42
@kaivalnp
Copy link
Contributor Author

Sorry for the delay here!

I ran the following benchmarks on 768d Cohere vectors for all vector similarities, with 4-bit (compressed) and 7-bit quantization. I needed to run 10k queries for reliable results (saw some variance in the default case of 1k queries)

cosine

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.544        3.103   3.102        0.999  200000   100      50       32        200     4 bits     14.45      13842.75             4          670.05       659.943       74.005       HNSW
 0.505        4.499   4.497        1.000  200000   100      50       32        200     7 bits     14.03      14257.20             4          745.36       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.543        2.854   2.852        0.999  200000   100      50       32        200     4 bits     14.57      13724.95             4          670.06       659.943       74.005       HNSW
 0.506        3.978   3.976        0.999  200000   100      50       32        200     7 bits     13.41      14912.02             4          745.09       733.185      147.247       HNSW

dot_product

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.528        3.522   3.520        1.000  200000   100      50       32        200     4 bits     14.03      14258.22             4          674.69       659.943       74.005       HNSW
 0.881        4.303   4.301        1.000  200000   100      50       32        200     7 bits     14.41      13880.21             4          746.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.528        3.218   3.217        1.000  200000   100      50       32        200     4 bits     13.60      14706.96             4          674.64       659.943       74.005       HNSW
 0.882        3.915   3.913        1.000  200000   100      50       32        200     7 bits     15.15      13205.68             4          746.44       733.185      147.247       HNSW

euclidean

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.550        7.581   7.579        1.000  200000   100      50       32        200     4 bits     13.09      15284.68             4          667.46       659.943       74.005       HNSW
 0.936        3.938   3.937        1.000  200000   100      50       32        200     7 bits     12.88      15532.77             4          739.76       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.550        2.422   2.420        0.999  200000   100      50       32        200     4 bits     13.27      15070.45             4          667.45       659.943       74.005       HNSW
 0.936        3.666   3.664        0.999  200000   100      50       32        200     7 bits     12.66      15796.54             4          739.73       733.185      147.247       HNSW

mip

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.529        3.537   3.536        1.000  200000   100      50       32        200     4 bits     14.30      13988.95             4          674.69       659.943       74.005       HNSW
 0.882        4.280   4.278        1.000  200000   100      50       32        200     7 bits     14.18      14109.35             4          746.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.529        3.332   3.330        0.999  200000   100      50       32        200     4 bits     13.89      14401.96             4          674.65       659.943       74.005       HNSW
 0.882        3.876   3.874        0.999  200000   100      50       32        200     7 bits     13.87      14423.77             4          746.43       733.185      147.247       HNSW

The speedup vector search time for 4 bit euclidean (=68%) seems amazing, because we used to decompress the bits into a byte and use the same squareDistance function, which did not take into account that the max value of the inputs could be in the [0, 15] range, and we can make some optimizations with this information.

We see ~10% speedup in search time for everything else, while indexing is kind of unaffected.

Sharing JMH benchmarks (also because it checks for correctness of functions):

java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte*" -p size=1024
Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.378 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15   0.472 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.378 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.448 ± 0.005  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  16.180 ± 0.082  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.947 ± 0.045  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.642 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  14.142 ± 0.031  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.463 ± 0.003  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.022 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.340 ± 0.039  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.749 ± 0.055  ops/us

@kaivalnp kaivalnp requested a review from benwtrent September 10, 2025 06:43
Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks great. One question on tests.

I think we will get even nicer performance through adding bulk scoring methods :). Though that can be in a later PR I think.

b[i] = (byte) random().nextInt(16);
}

assertIntReturningProviders(p -> p.int4DotProduct(a, false, pack(b), true));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add some more tests here to cover:

  • cosine
  • square distance

And their both packed, both unpacked versions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cosine does not have a separate function in PanamaVectorUtilSupport (it re-uses dot product) -- but I've added tests for square distance (and their packed versions)

Kaival Parikh added 2 commits September 10, 2025 18:55
- Also add tests, fix failures
# Conflicts:
#	lucene/CHANGES.txt
@github-actions github-actions bot added this to the 10.4.0 milestone Sep 10, 2025
@kaivalnp
Copy link
Contributor Author

Thanks @benwtrent !

I realized that this PR would give incorrect results if 8-bit quantization was used (added recently in #15148) -- because it used the dotProduct / squareDistance functions which assume input bytes to be signed

I switched them over to uint8DotProduct and uint8SquareDistance for correctness with 8-bit quantization. I think the previous results with 7-bit quantization still hold, because the signed / unsigned functions produce the same output for 7-bit integers

I'll try to run a test with 8-bit quantization too, I realized this PR will implicitly support it :)

# Conflicts:
#	lucene/CHANGES.txt
@kaivalnp
Copy link
Contributor Author

I'll try to run a test with 8-bit quantization too, I realized this PR will implicitly support it :)

Here are the benchmarks:

cosine

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.584        1.470   1.469        0.999  200000   100      50       32        200     8 bits     10.32      19372.34           14.67             1          747.53       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.583        1.316   1.314        0.998  200000   100      50       32        200     8 bits     10.54      18978.93           13.66             1          747.16       733.185      147.247       HNSW

dot_product

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.444   1.442        0.999  200000   100      50       32        200     8 bits     10.38      19267.82           14.96             1          747.12       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.315   1.314        0.999  200000   100      50       32        200     8 bits     11.21      17836.44           14.28             1          747.16       733.185      147.247       HNSW

euclidean

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.933        1.360   1.359        0.999  200000   100      50       32        200     8 bits     14.75      13562.08           34.43             1          740.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.935        1.288   1.286        0.998  200000   100      50       32        200     8 bits     10.04      19920.32           12.47             1          740.53       733.185      147.247       HNSW

mip

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.857        1.400   1.399        0.999  200000   100      50       32        200     8 bits     10.70      18686.35           14.36             1          747.11       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.358   1.357        0.999  200000   100      50       32        200     8 bits     10.47      19109.50           14.30             1          747.15       733.185      147.247       HNSW

Indexing and force-merge are non-trivially faster (30% and 60% respectively) for euclidean, not sure if this is an outlier..
Search is slightly faster (3-10%) for all vector similarities

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Thank you!

Square distance improvements are interesting! I was expecting this for int4 as we didn't really have optimized paths there. But wow!

@kaivalnp
Copy link
Contributor Author

Thanks @benwtrent :)

If there's no other comments, could someone help with merging?

@benwtrent benwtrent merged commit 9d8685f into apache:main Sep 16, 2025
8 checks passed
benwtrent pushed a commit that referenced this pull request Sep 16, 2025
Off-heap scoring for quantized vectors! Related to #13515

This scorer is in-line with [`Lucene99MemorySegmentFlatVectorsScorer`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentFlatVectorsScorer.java#L30), and will automatically be used with [`PanamaVectorizationProvider`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L30C13-L30C40) (i.e. on adding `jdk.incubator.vector`). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..

I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..
@kaivalnp
Copy link
Contributor Author

Thank you @benwtrent !

@kaivalnp kaivalnp deleted the off-heap-quantized-scoring branch September 16, 2025 17:39
mccullocht added a commit that referenced this pull request Sep 17, 2025
conflict between #14863 and #15169 that wasn't caught in testing before merge.
@benwtrent
Copy link
Member

@kaivalnp strangely, this added some indexing regression: https://benchmarks.mikemccandless.com/2025.09.16.18.04.08.html

I would expect things to be pretty much the same :/ I haven't dug into it yet.

@kaivalnp
Copy link
Contributor Author

strangely, this added some indexing regression

I was just about to post this :)
I'll try digging into it soon!

@kaivalnp
Copy link
Contributor Author

kaivalnp commented Oct 1, 2025

I'll try digging into it soon!

@benwtrent I opened #15272

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants