Implement off-heap quantized scoring #14863

kaivalnp · 2025-06-29T10:22:06Z

Description

Off-heap scoring for quantized vectors! Related to #13515

This scorer is in-line with Lucene99MemorySegmentFlatVectorsScorer, and will automatically be used with PanamaVectorizationProvider (i.e. on adding jdk.incubator.vector). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here..

I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..

github-actions · 2025-06-29T10:22:59Z

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

kaivalnp · 2025-06-29T10:33:43Z

I ran some benchmarks on Cohere vectors (768d) for 7-bit and 4-bit (compressed) quantization..

main without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.860        2.815   2.806        0.997  100000   100      50       64        250     7 bits     44.07       2269.17           46.79             1          373.72       366.592       73.624       HNSW
 0.545        3.193   3.185        0.997  100000   100      50       64        250     4 bits     47.26       2115.95           50.04             1          338.13       329.971       37.003       HNSW

main with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.863        1.904   1.886        0.991  100000   100      50       64        250     7 bits     28.65       3490.65           29.66             1          373.69       366.592       73.624       HNSW
 0.545        1.313   1.305        0.994  100000   100      50       64        250     4 bits     22.86       4373.88           17.84             1          338.13       329.971       37.003       HNSW

This PR without jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        2.774   2.765        0.997  100000   100      50       64        250     7 bits     44.60       2242.00           46.71             1          373.73       366.592       73.624       HNSW
 0.545        3.147   3.139        0.997  100000   100      50       64        250     4 bits     47.93       2086.51           50.20             1          338.11       329.971       37.003       HNSW

This PR with jdk.incubator.vector:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.861        1.612   1.603        0.994  100000   100      50       64        250     7 bits     22.99       4349.53           24.78             1          373.70       366.592       73.624       HNSW
 0.545        1.277   1.269        0.994  100000   100      50       64        250     4 bits     21.60       4630.49           17.41             1          338.11       329.971       37.003       HNSW

I did see slight fluctuation across runs, but the search time was ~10% faster for 7-bit and very slightly faster for 4-bit (compressed). Indexing and force merge times have improved by ~15%

kaivalnp · 2025-06-29T10:46:33Z

FYI I observed a strange phenomenon where if the query vector is on heap like:

this.query = MemorySegment.ofArray(targetBytes);

instead of the current off-heap implementation in this PR:

this.query = Arena.ofAuto().allocateFrom(JAVA_BYTE, targetBytes);

..then we see a performance regression:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        3.043   3.034        0.997  100000   100      50       64        250     7 bits     23.25       4301.82           25.29             1          373.70       366.592       73.624       HNSW
 0.545        2.060   2.049        0.995  100000   100      50       64        250     4 bits     22.19       4506.33           17.99             1          338.17       329.971       37.003       HNSW

Maybe I'm missing something obvious, but I haven't found the root cause yet..

ChrisHegarty · 2025-06-30T14:04:21Z

..then we see a performance regression:
...
Maybe I'm missing something obvious, but I haven't found the root cause yet..

yeah. I've seen similar before. You might be hitting a problem with the loop bound not being hoisted. I will try to take a look.

kaivalnp · 2025-06-30T16:05:31Z

Thanks @ChrisHegarty! I saw that we use a heap-backed MemorySegment while scoring byte vectors -- so I opened #14874 to investigate if we can improve performance by moving to an off-heap query

github-actions · 2025-07-15T00:28:31Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

kaivalnp · 2025-07-15T17:02:38Z

After this conversation I re-ran some benchmarks with -XX:CompileCommand=inline,*PanamaVectorUtilSupport.* to force inlining of dot product functions

main:

recall	latency	index time (s)	force merge (s)	quantization	index
0.530	1.675	75.65	59.97	4 bit	fresh
0.860	1.844	81.75	71.79	7 bit	fresh
0.530	1.582	-	-	4 bit	no reindex
0.860	1.859	-	-	7 bit	no reindex
0.529	1.682	79.32	62.62	4 bit	reindex
0.859	1.821	103.78	48.36	7 bit	reindex

This PR:

recall	latency	index time (s)	force merge (s)	quantization	index	query type
0.529	2.132	131.86	85.86	4 bit	fresh	`MemorySegment.ofArray`
0.858	2.797	133.71	80.81	7 bit	fresh	`MemorySegment.ofArray`
0.529	2.081	-	-	4 bit	no reindex	`MemorySegment.ofArray`
0.858	1.670	-	-	7 bit	no reindex	`MemorySegment.ofArray`
0.529	2.140	130.82	85.08	4 bit	reindex	`MemorySegment.ofArray`
0.858	2.883	132.66	83.05	7 bit	reindex	`MemorySegment.ofArray`
0.529	1.511	164.22	110.18	4 bit	fresh	`Arena.ofAuto().allocateFrom`
0.859	1.728	132.41	81.35	7 bit	fresh	`Arena.ofAuto().allocateFrom`
0.529	1.551	-	-	4 bit	no reindex	`Arena.ofAuto().allocateFrom`
0.859	1.704	-	-	7 bit	no reindex	`Arena.ofAuto().allocateFrom`
0.529	1.574	164.04	112.60	4 bit	reindex	`Arena.ofAuto().allocateFrom`
0.859	1.774	135.20	83.72	7 bit	reindex	`Arena.ofAuto().allocateFrom`

Looks like the changes in this PR have a small benefit on the search side, but slow down indexing by a lot..

kaivalnp · 2025-07-29T20:03:16Z

We had some interesting findings in #14874, so I updated this PR to reflect those..

Benchmarks for 768d Cohere vectors (dot product similarity, 4-bit one is compressed):

main with -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.607   1.606        1.000  100000   100      50       64        250     7 bits     39.43       2536.33           18.73             1          373.37       366.592       73.624       HNSW
 0.542        1.211   1.210        0.999  100000   100      50       64        250     4 bits     34.09       2933.76           15.33             1          337.76       329.971       37.003       HNSW

main without -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.448   1.447        0.999  100000   100      50       64        250     7 bits      0.00      Infinity            0.12             1          373.37       366.592       73.624       HNSW
 0.542        1.075   1.074        0.999  100000   100      50       64        250     4 bits      0.00      Infinity            0.12             1          337.76       329.971       37.003       HNSW

This PR with -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.394   1.392        0.999  100000   100      50       64        250     7 bits     36.29       2755.73           17.35             1          373.36       366.592       73.624       HNSW
 0.541        1.135   1.134        0.999  100000   100      50       64        250     4 bits     34.85       2869.11           15.53             1          337.74       329.971       37.003       HNSW

This PR without -reindex:

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.862        1.329   1.328        0.999  100000   100      50       64        250     7 bits      0.00      Infinity            0.12             1          373.36       366.592       73.624       HNSW
 0.541        1.141   1.139        0.998  100000   100      50       64        250     4 bits      0.00      Infinity            0.13             1          337.74       329.971       37.003       HNSW

We see some speedup in both indexing and search for 7-bit compression, not so much for the 4-bit one

I wrote specific functions in PanamaVectorUtilSupport to compute the "dot product" between two compressed (i.e. "packed") 4-bit integers (so no need to copy to heap and decompress) to be used during indexing, these are inspired from another function which assumed one of the vectors to be uncompressed (i.e. "unpacked")

The drawback here is that we'll need specific functions for comparing "packed" versions of the query / documents for other similarity functions like "euclidean" as well -- looking for inputs on whether the gains justify maintaining those functions..

Also tagging people who might be interested in these changes, maybe @benwtrent (since you were exploring something similar in #13497) or @ChrisHegarty?

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java

kaivalnp · 2025-08-05T17:08:36Z

Thanks for the review @thecoop! I've tried to address your comments above, do let me know if you have more feedback..

lucene/core/src/java/org/apache/lucene/util/FloatToFloatFunction.java

mikemccand · 2025-08-18T14:38:30Z

This looks like a compelling optimization? And the "search in the same JVM that did indexing" issue is separately already resolved? Since this is such a risk with Java (hotspot head-fake vulnerability), maybe knnPerfTest.py in nightly should run both ways, so we can spot future hotspot head-fake vulnerabilities?

Were there other pending concerns? I cannot really tell from the discussion...

kaivalnp · 2025-08-19T13:53:47Z

And the "search in the same JVM that did indexing" issue is separately already resolved?

Yes, that was fixed in #14874

Were there other pending concerns?

One consequence of this change: since we don't want to copy 4-bit values to the Java heap, we need to define functions in PanamaVectorUtilSupport to compare two 4-bit vectors for all similarities in combinations of unpacked-unpacked, unpacked-packed, packed-packed (I've added dot product functions in this PR, need to add Euclidean and Cosine now..)

The question I had was whether the benefits are compelling enough to maintain these functions..

benwtrent · 2025-08-25T14:13:56Z

The question I had was whether the benefits are compelling enough to maintain these functions..

I think the goal for 4bit is that we just have the "compressed" version only.

kaivalnp · 2025-09-10T06:43:40Z

Sorry for the delay here!

I ran the following benchmarks on 768d Cohere vectors for all vector similarities, with 4-bit (compressed) and 7-bit quantization. I needed to run 10k queries for reliable results (saw some variance in the default case of 1k queries)

`cosine`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.544        3.103   3.102        0.999  200000   100      50       32        200     4 bits     14.45      13842.75             4          670.05       659.943       74.005       HNSW
 0.505        4.499   4.497        1.000  200000   100      50       32        200     7 bits     14.03      14257.20             4          745.36       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.543        2.854   2.852        0.999  200000   100      50       32        200     4 bits     14.57      13724.95             4          670.06       659.943       74.005       HNSW
 0.506        3.978   3.976        0.999  200000   100      50       32        200     7 bits     13.41      14912.02             4          745.09       733.185      147.247       HNSW

`dot_product`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.528        3.522   3.520        1.000  200000   100      50       32        200     4 bits     14.03      14258.22             4          674.69       659.943       74.005       HNSW
 0.881        4.303   4.301        1.000  200000   100      50       32        200     7 bits     14.41      13880.21             4          746.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.528        3.218   3.217        1.000  200000   100      50       32        200     4 bits     13.60      14706.96             4          674.64       659.943       74.005       HNSW
 0.882        3.915   3.913        1.000  200000   100      50       32        200     7 bits     15.15      13205.68             4          746.44       733.185      147.247       HNSW

`euclidean`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.550        7.581   7.579        1.000  200000   100      50       32        200     4 bits     13.09      15284.68             4          667.46       659.943       74.005       HNSW
 0.936        3.938   3.937        1.000  200000   100      50       32        200     7 bits     12.88      15532.77             4          739.76       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.550        2.422   2.420        0.999  200000   100      50       32        200     4 bits     13.27      15070.45             4          667.45       659.943       74.005       HNSW
 0.936        3.666   3.664        0.999  200000   100      50       32        200     7 bits     12.66      15796.54             4          739.73       733.185      147.247       HNSW

`mip`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.529        3.537   3.536        1.000  200000   100      50       32        200     4 bits     14.30      13988.95             4          674.69       659.943       74.005       HNSW
 0.882        4.280   4.278        1.000  200000   100      50       32        200     7 bits     14.18      14109.35             4          746.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.529        3.332   3.330        0.999  200000   100      50       32        200     4 bits     13.89      14401.96             4          674.65       659.943       74.005       HNSW
 0.882        3.876   3.874        0.999  200000   100      50       32        200     7 bits     13.87      14423.77             4          746.43       733.185      147.247       HNSW

The speedup vector search time for 4 bit euclidean (=68%) seems amazing, because we used to decompress the bits into a byte and use the same squareDistance function, which did not take into account that the max value of the inputs could be in the [0, 15] range, and we can make some optimizations with this information.

We see ~10% speedup in search time for everything else, while indexing is kind of unaffected.

Sharing JMH benchmarks (also because it checks for correctness of functions):

java --module-path lucene/benchmark-jmh/build/benchmarks --module org.apache.lucene.benchmark.jmh "VectorUtilBenchmark.binaryHalfByte*" -p size=1024

Benchmark                                                       (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedScalar      1024  thrpt   15   2.378 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductBothPackedVector      1024  thrpt   15   0.472 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductScalar                1024  thrpt   15   2.378 ± 0.002  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedScalar    1024  thrpt   15   2.448 ± 0.005  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductSinglePackedVector    1024  thrpt   15  16.180 ± 0.082  ops/us
VectorUtilBenchmark.binaryHalfByteDotProductVector                1024  thrpt   15  20.947 ± 0.045  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedScalar          1024  thrpt   15   1.642 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareBothPackedVector          1024  thrpt   15  14.142 ± 0.031  ops/us
VectorUtilBenchmark.binaryHalfByteSquareScalar                    1024  thrpt   15   2.463 ± 0.003  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedScalar        1024  thrpt   15   2.022 ± 0.001  ops/us
VectorUtilBenchmark.binaryHalfByteSquareSinglePackedVector        1024  thrpt   15  16.340 ± 0.039  ops/us
VectorUtilBenchmark.binaryHalfByteSquareVector                    1024  thrpt   15  18.749 ± 0.055  ops/us

benwtrent

Overall, this looks great. One question on tests.

I think we will get even nicer performance through adding bulk scoring methods :). Though that can be in a later PR I think.

benwtrent · 2025-09-10T14:18:42Z

lucene/core/src/test/org/apache/lucene/internal/vectorization/TestVectorUtilSupport.java

      b[i] = (byte) random().nextInt(16);
    }

-    assertIntReturningProviders(p -> p.int4DotProduct(a, false, pack(b), true));


could you add some more tests here to cover:

cosine

square distance

And their both packed, both unpacked versions?

cosine does not have a separate function in PanamaVectorUtilSupport (it re-uses dot product) -- but I've added tests for square distance (and their packed versions)

- Also add tests, fix failures

# Conflicts: # lucene/CHANGES.txt

kaivalnp · 2025-09-10T19:12:50Z

Thanks @benwtrent !

I realized that this PR would give incorrect results if 8-bit quantization was used (added recently in #15148) -- because it used the dotProduct / squareDistance functions which assume input bytes to be signed

I switched them over to uint8DotProduct and uint8SquareDistance for correctness with 8-bit quantization. I think the previous results with 7-bit quantization still hold, because the signed / unsigned functions produce the same output for 7-bit integers

I'll try to run a test with 8-bit quantization too, I realized this PR will implicitly support it :)

# Conflicts: # lucene/CHANGES.txt

kaivalnp · 2025-09-14T17:27:18Z

I'll try to run a test with 8-bit quantization too, I realized this PR will implicitly support it :)

Here are the benchmarks:

`cosine`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.584        1.470   1.469        0.999  200000   100      50       32        200     8 bits     10.32      19372.34           14.67             1          747.53       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.583        1.316   1.314        0.998  200000   100      50       32        200     8 bits     10.54      18978.93           13.66             1          747.16       733.185      147.247       HNSW

`dot_product`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.444   1.442        0.999  200000   100      50       32        200     8 bits     10.38      19267.82           14.96             1          747.12       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.315   1.314        0.999  200000   100      50       32        200     8 bits     11.21      17836.44           14.28             1          747.16       733.185      147.247       HNSW

`euclidean`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.933        1.360   1.359        0.999  200000   100      50       32        200     8 bits     14.75      13562.08           34.43             1          740.41       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.935        1.288   1.286        0.998  200000   100      50       32        200     8 bits     10.04      19920.32           12.47             1          740.53       733.185      147.247       HNSW

`mip`

main

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.857        1.400   1.399        0.999  200000   100      50       32        200     8 bits     10.70      18686.35           14.36             1          747.11       733.185      147.247       HNSW

This PR

recall  latency(ms)  netCPU  avgCpuCount    nDoc  topK  fanout  maxConn  beamWidth  quantized  index(s)  index_docs/s  force_merge(s)  num_segments  index_size(MB)  vec_disk(MB)  vec_RAM(MB)  indexType
 0.858        1.358   1.357        0.999  200000   100      50       32        200     8 bits     10.47      19109.50           14.30             1          747.15       733.185      147.247       HNSW

Indexing and force-merge are non-trivially faster (30% and 60% respectively) for euclidean, not sure if this is an outlier..
Search is slightly faster (3-10%) for all vector similarities

benwtrent

This is great. Thank you!

Square distance improvements are interesting! I was expecting this for int4 as we didn't really have optimized paths there. But wow!

kaivalnp · 2025-09-16T16:08:36Z

Thanks @benwtrent :)

If there's no other comments, could someone help with merging?

Off-heap scoring for quantized vectors! Related to #13515 This scorer is in-line with [`Lucene99MemorySegmentFlatVectorsScorer`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/Lucene99MemorySegmentFlatVectorsScorer.java#L30), and will automatically be used with [`PanamaVectorizationProvider`](https://github.com/apache/lucene/blob/77f0d1f6d6762ca6ac9af5acc0c950365050d939/lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorizationProvider.java#L30C13-L30C40) (i.e. on adding `jdk.incubator.vector`). Note that the computations are already vectorized, but we're avoiding the unnecessary copy to heap here.. I added off-heap Dot Product functions for two compressed 4-bit ints (i.e. no need to "decompress" them) -- I can try to come up with similar ones for Euclidean if this approach seems fine..

kaivalnp · 2025-09-16T17:39:35Z

Thank you @benwtrent !

conflict between #14863 and #15169 that wasn't caught in testing before merge.

benwtrent · 2025-09-22T13:17:33Z

@kaivalnp strangely, this added some indexing regression: https://benchmarks.mikemccandless.com/2025.09.16.18.04.08.html

I would expect things to be pretty much the same :/ I haven't dug into it yet.

kaivalnp · 2025-09-22T13:20:50Z

strangely, this added some indexing regression

I was just about to post this :)
I'll try digging into it soon!

kaivalnp · 2025-10-01T20:33:11Z

I'll try digging into it soon!

@benwtrent I opened #15272

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Jun 29, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Jun 29, 2025

github-actions bot added the module:core/codecs label Jun 29, 2025

This was referenced Jun 29, 2025

Feature/scalar quantized off heap scoring #13497

Closed

Examine adding more off-heap vector scoring #13515

Open

kaivalnp mentioned this pull request Jun 30, 2025

Fix off-heap byte vector scoring at query time #14874

Merged

github-actions bot added the Stale label Jul 15, 2025

github-actions bot removed the Stale label Jul 16, 2025

kaivalnp marked this pull request as draft July 29, 2025 20:01

kaivalnp force-pushed the off-heap-quantized-scoring branch from 82e1faa to 0167f13 Compare July 29, 2025 20:02

thecoop reviewed Aug 4, 2025

View reviewed changes

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java Outdated Show resolved Hide resolved

thecoop reviewed Aug 4, 2025

View reviewed changes

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java Outdated Show resolved Hide resolved

thecoop reviewed Aug 4, 2025

View reviewed changes

lucene/core/src/java24/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java Outdated Show resolved Hide resolved

thecoop reviewed Aug 4, 2025

View reviewed changes

...g/apache/lucene/internal/vectorization/Lucene99MemorySegmentScalarQuantizedVectorScorer.java Outdated Show resolved Hide resolved

thecoop reviewed Aug 7, 2025

View reviewed changes

lucene/core/src/java/org/apache/lucene/util/FloatToFloatFunction.java Show resolved Hide resolved

Perform scoring for 4 and 7 bit quantized vectors off-heap

ac04f41

kaivalnp force-pushed the off-heap-quantized-scoring branch from 6ce7f50 to ac04f41 Compare September 10, 2025 06:42

kaivalnp marked this pull request as ready for review September 10, 2025 06:42

kaivalnp requested a review from benwtrent September 10, 2025 06:43

benwtrent reviewed Sep 10, 2025

View reviewed changes

Kaival Parikh added 2 commits September 10, 2025 18:55

Changes related to 8-bit quantization

bb69453

- Also add tests, fix failures

Merge branch 'main' into off-heap-quantized-scoring

6d548c7

# Conflicts: # lucene/CHANGES.txt

github-actions bot added this to the 10.4.0 milestone Sep 10, 2025

Merge branch 'main' into off-heap-quantized-scoring

b52ce45

# Conflicts: # lucene/CHANGES.txt

benwtrent approved these changes Sep 15, 2025

View reviewed changes

benwtrent merged commit 9d8685f into apache:main Sep 16, 2025
8 checks passed

kaivalnp deleted the off-heap-quantized-scoring branch September 16, 2025 17:39

mccullocht mentioned this pull request Sep 17, 2025

Add a new codec to implement OSQ for 4 and 8 bit quantized vectors #15169

Merged

mccullocht added a commit that referenced this pull request Sep 17, 2025

Fix compilation reference to int4 dot product (#15199)

13a7e1e

conflict between #14863 and #15169 that wasn't caught in testing before merge.

kaivalnp mentioned this pull request Oct 1, 2025

Performance regression in vector computations due to call site pollution #15272

Draft

Implement off-heap quantized scoring #14863

Implement off-heap quantized scoring #14863

Uh oh!

Conversation

kaivalnp commented Jun 29, 2025

Description

Uh oh!

github-actions bot commented Jun 29, 2025

Uh oh!

kaivalnp commented Jun 29, 2025

Uh oh!

kaivalnp commented Jun 29, 2025

Uh oh!

ChrisHegarty commented Jun 30, 2025

Uh oh!

kaivalnp commented Jun 30, 2025

Uh oh!

github-actions bot commented Jul 15, 2025

Uh oh!

kaivalnp commented Jul 15, 2025

Uh oh!

kaivalnp commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaivalnp commented Aug 5, 2025

Uh oh!

Uh oh!

mikemccand commented Aug 18, 2025

Uh oh!

kaivalnp commented Aug 19, 2025

Uh oh!

benwtrent commented Aug 25, 2025

Uh oh!

kaivalnp commented Sep 10, 2025

cosine

dot_product

euclidean

mip

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

benwtrent Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

kaivalnp Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

kaivalnp commented Sep 10, 2025

Uh oh!

kaivalnp commented Sep 14, 2025

cosine

dot_product

euclidean

mip

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

kaivalnp commented Sep 16, 2025

Uh oh!

Uh oh!

kaivalnp commented Sep 16, 2025

Uh oh!

benwtrent commented Sep 22, 2025

Uh oh!

kaivalnp commented Sep 22, 2025

Uh oh!

kaivalnp commented Oct 1, 2025

Uh oh!

Uh oh!

`cosine`

`dot_product`

`euclidean`

`mip`

`cosine`

`dot_product`

`euclidean`

`mip`