Dynamic IVF Index Implementation #223

ibhati · 2025-11-18T21:25:19Z

This PR enables support for Dynamic IVF (Inverted File) Index for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive memory optimizations, thread configuration improvements, and full Python bindings.

Key Features

🚀 Dynamic IVF Index

Dynamic Operations: Support for adding and deleting vectors after index construction
Blocked Storage: Uses BlockedData with configurable block sizes for efficient memory management
Thread Control: Separate configuration for clustering threads, index threads, and intra-query threads

TODOs for IVF:

Enable efficient Load/Save (for both static and dynamic IVF)
Complete IVF documentation
Batch iterator with IVF
Benchmarking support for dynamic ivf

added in the cluster. Added comprhensive tests for IVF

Copilot

Pull request overview

This PR introduces Dynamic IVF (Inverted File) Index support for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive test coverage, memory optimizations through blocked storage, flexible thread configuration, and full Python bindings. Key enhancements include a new train_only parameter for k-means clustering that decouples centroid training from cluster assignment, and get_distance API for computing distances between queries and indexed vectors.

Key Changes

Dynamic IVF index implementation with insert/delete support via BlockedData storage
Enhanced k-means clustering with train_only mode and improved training data selection
New get_distance API for distance computation in both static and dynamic IVF indices

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/svs/index/ivf/kmeans.cpp	Adds comprehensive test coverage for k-means train_only functionality, edge cases, reproducibility, and parameter variations
tests/svs/index/ivf/hierarchical_kmeans.cpp	Extends hierarchical k-means tests with train_only verification, level1 cluster configurations, and comparison tests
tests/svs/index/ivf/dynamic_ivf.cpp	New test suite for dynamic IVF operations including add/delete, search, compaction, threading, and get_distance
tests/svs/index/ivf/common.cpp	Adds utility function tests and verifies train_only + cluster_assignment workflow
tests/integration/ivf/index_search.cpp	Integration test for get_distance functionality in static IVF
tests/integration/ivf/index_build.cpp	Integration test for train_only build workflow
tests/CMakeLists.txt	Registers dynamic_ivf.cpp test file
include/svs/orchestrators/ivf.h	Adds get_distance API to IVFInterface
include/svs/orchestrators/dynamic_ivf.h	New orchestrator for dynamic IVF with mutable operations
include/svs/index/ivf/kmeans.h	Implements train_only parameter and improved training data selection
include/svs/index/ivf/index.h	Adds ID mapping and get_distance support to static IVF
include/svs/index/ivf/hierarchical_kmeans.h	Implements train_only parameter for hierarchical k-means
include/svs/index/ivf/extensions.h	Adds get_distance_ext CPO for extensible distance computation
include/svs/index/ivf/dynamic_ivf.h	Core dynamic IVF implementation with BlockedData storage
include/svs/index/ivf/common.h	Enhanced compute_matmul validation, improved type handling, and new cluster_assignment utility
include/svs/index/ivf/clustering.h	Adds cluster indexing operator for DenseClusteredDataset
examples/python/example_ivf_dynamic.py	Python example demonstrating dynamic IVF usage
examples/python/example_ivf.py	Python example for static IVF clustering and assembly
bindings/python/tests/test_ivf.py	Adds get_distance test coverage
bindings/python/src/python_bindings.cpp	Registers dynamic_ivf Python bindings
bindings/python/src/ivf.cpp	Updates clustering to support Float16/BFloat16 variants
bindings/python/src/dynamic_ivf.cpp	Python bindings for dynamic IVF operations
bindings/python/include/svs/python/ivf.h	Adds get_distance to Python IVF interface
bindings/python/include/svs/python/dynamic_ivf.h	Header for dynamic IVF Python bindings
bindings/python/CMakeLists.txt	Registers dynamic_ivf.cpp source file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-12-11T23:42:36Z

include/svs/index/ivf/common.h

 void compute_matmul(
    const T* data, const T* centroids, float* results, size_t m, size_t n, size_t k
 ) {
+    // Validate parameters to avoid Intel MKL errors


The early return for zero dimensions is correct, but the comment could be more specific about what would happen without this check. Consider clarifying that MKL functions may have undefined behavior or produce errors with zero dimensions.

Suggested change

// Validate parameters to avoid Intel MKL errors

// Early return for zero dimensions.

// Calling Intel MKL functions (e.g., cblas_sgemm) with zero dimensions may result in undefined behavior or runtime errors.

// This check ensures we avoid such cases.

Copilot · 2025-12-11T23:42:37Z

include/svs/index/ivf/kmeans.h

+    // or training_fraction of data, whichever is larger.
+    // This ensures we have enough training data even for small datasets


The logic for min_training_data calculates the minimum between num_centroids * MIN_TRAINING_SAMPLE_MULTIPLIER and data.size(), which caps the minimum at the dataset size. However, the comment says "Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of centroids, or training_fraction of data, whichever is larger," which doesn't accurately describe the interaction with the std::min operation. The comment should clarify that the minimum is capped at the dataset size.

Suggested change

// or training_fraction of data, whichever is larger.

// This ensures we have enough training data even for small datasets

// but no more than the dataset size. (The minimum is capped at data.size().)

// This ensures we have enough training data even for small datasets,

// without exceeding the available data.

Copilot · 2025-12-11T23:42:37Z

include/svs/index/ivf/hierarchical_kmeans.h

-    if (num_training_data < num_clusters || num_training_data > data.size()) {
+    // Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of clusters,
+    // or training_fraction of data, whichever is larger.
+    // This ensures we have enough training data even for small datasets


Similar to the kmeans.h comment, the documentation doesn't fully describe the std::min operation that caps min_training_data at data.size(). The comment should be more precise about how the minimum training data is calculated with respect to the dataset size constraint.

Suggested change

// This ensures we have enough training data even for small datasets

// The minimum is capped at data.size() using std::min, so we never use more training data than available.

Copilot · 2025-12-11T23:42:37Z

tests/svs/index/ivf/kmeans.cpp

+    // Verify train_only doesn't take significantly longer (allow some variance)
+    // In practice, train_only should be faster, but we allow it to be up to 50% longer due
+    // to variance
+    CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5);


Performance tests comparing execution times can be flaky, especially with a 50% tolerance that allows train_only mode to be slower despite the comment stating it should be faster. Consider either removing this assertion or documenting why such a wide tolerance is acceptable for a feature that's expected to improve performance.

Suggested change

// Verify train_only doesn't take significantly longer (allow some variance)

// In practice, train_only should be faster, but we allow it to be up to 50% longer due

// to variance

CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5);

// Note: We do not assert on performance here, as wall-clock timing is unreliable in CI.

// In practice, train_only should be faster, but this is best verified with dedicated benchmarks.

ahuber21

Just a few nitpicks here and there. Great contribution overall!
Is there anything I can/should do to independently validate correctness?
I didn't deep-dive into the IVF part itself.

ahuber21 · 2025-11-19T13:52:37Z

include/svs/index/ivf/common.h

+    constexpr size_t max_int = static_cast<size_t>(std::numeric_limits<int>::max());
+    if (m > max_int || n > max_int || k > max_int) {
+        throw ANNEXCEPTION(
+            "Matrix dimensions too large for Intel MKL GEMM: m={}, n={}, k={}", m, n, k


To make it more helpful we should include max_int in the error message.

ahuber21 · 2025-11-19T13:53:11Z

include/svs/index/ivf/common.h

+
    if constexpr (std::is_same_v<T, float>) {
+        // Cast size_t parameters to int for MKL GEMM functions
+        int m_int = static_cast<int>(m);


Do the casting outside the if block to avoid duplication

ahuber21 · 2025-12-15T12:28:11Z

include/svs/index/ivf/index.h

+    template <typename Query> double get_distance(size_t id, const Query& query) const {
+        // Lazily initialize ID mapping on first call
+        if (id_to_cluster_.empty()) {
+            const_cast<IVFIndex*>(this)->initialize_id_mapping();


This is not thread-safe. I didn't find the pattern of const_casting this for lazy init elsewhere in the lib. So I suggest to not introduce it here.

ahuber21 · 2025-12-15T12:36:41Z

include/svs/index/ivf/index.h

+        auto distance_copy = distance_;
+        svs::distance::maybe_fix_argument(distance_copy, query);


Suggested change

auto distance_copy = distance_;

svs::distance::maybe_fix_argument(distance_copy, query);

if constexpr (Dist::must_fix_argument) {

// Fix distance argument if needed (e.g., for cosine similarity)

auto distance_copy = distance_;

svs::distance::maybe_fix_argument(distance_copy, query);

// Call extension for distance computation

return svs::index::ivf::extensions::get_distance_ext(

cluster_, distance_copy, cluster_id, pos, query

);

} else {

// Call extension for distance computation

return svs::index::ivf::extensions::get_distance_ext(

cluster_, distance_, cluster_id, pos, query

);

}

This might be premature optimization. Do you think we can gain a few QPS by avoiding the copy if it's not required?

Also: It states in include/svs/index/ivf/common.h that only L2 and MIP are supported. Is is helpful to mention cosine in the comment?

ahuber21 · 2025-12-15T13:01:54Z

include/svs/index/ivf/index.h

 // performance. This value was chosen based on empirical testing to avoid excessive memory
 // allocation while supporting large batch operations typical in high-throughput
 // environments.
 const size_t MAX_QUERY_BATCH_SIZE = 10000;


Change to constexpr?

Suggested change

constexpr size_t MAX_QUERY_BATCH_SIZE = 10000;

ahuber21 · 2025-12-15T13:04:46Z

include/svs/index/ivf/clustering.h


    size_t size() const { return data_.size(); }

+    // Support for dynamic operations - SimpleData already has resize()


Suggested change

// Support for dynamic operations - SimpleData already has resize()

// Support for dynamic operations

data_type is templated at this point, we shouldn't mention SimpleData

ahuber21 · 2025-12-15T13:06:13Z

include/svs/index/ivf/clustering.h

+    Data& view_cluster() { return data_; }

  public:
    Data data_;


Better change this to data_type for consistency? Some more examples of using Data instead of data_type exist throughout the file.

ahuber21 · 2025-12-15T13:12:45Z

include/svs/index/ivf/dynamic_ivf.h

+
+            // Use centroid_assignment to compute assignments for this batch
+            centroid_assignment(
+                const_cast<Points&>(points), // centroid_assignment expects non-const


Again, I'm wary of using const_cast. I peaked into centroid_assignment. I don't see why it can't accept const Data& data. Can we change that?

(If the reason is that MKL doesn't accept const pointers, we should const_cast just before dispatching into MKL instead.)

ahuber21 · 2025-12-15T13:15:00Z

include/svs/index/ivf/dynamic_ivf.h

+
+    // Use a small block size for IVF clusters (1MB instead of 1GB default)
+    auto blocking_params = data::BlockingParameters{
+        .blocksize_bytes = lib::PowerOfTwo(20) // 2^20 = 1MB


Why this value?

ahuber21 · 2025-12-15T13:18:40Z

include/svs/index/ivf/kmeans.h

 ) {
    return kmeans_clustering_impl<BuildType>(
-        parameters, data, distance, threadpool, integer_type, std::move(logger)
+        parameters, data, distance, threadpool, integer_type, std::move(logger), train_only


Maybe split this into two functions? Along the lines of:

auto centroids = kmeans_centroids_impl<BuildType>(...); std::vector<std::vector<uint32_t>> clusters(parameters.num_centroids_); if (!train_only) { clusters = kmeans_clustering_impl<BuildType>(centroids, ...); } return std::make_tuple(centroids, std::move(clusters));

ibhati · 2025-12-15T16:59:24Z

Just a few nitpicks here and there. Great contribution overall! Is there anything I can/should do to independently validate correctness? I didn't deep-dive into the IVF part itself.

Thanks @ahuber21 for reviewing this, i will fix your suggestions. Nothing in particular, a high level look is good for now.

ibhati added 15 commits November 6, 2025 09:09

Enable IVF train only mode where centroid are built but data is not

9398042

added in the cluster. Added comprhensive tests for IVF

Add cluster assignment functionality and tests

7505c68

Add integration for train only scenario

dbef61e

Merge remote-tracking branch 'origin/main' into ib/dynamic_ivf

44eb6d8

Merge remote-tracking branch 'origin/main' into ib/dynamic_ivf

05702a5

Minor fixes

6fa85c0

First attempt at dynamic index

149ff83

Merge remote-tracking branch 'origin/main' into ib/dynamic_ivf

3796c5d

Optimized search and add_points functions

8700c6d

Clang tidy

fcc8d44

Python bindings for dynamic IVF, first version

769e7ea

Added get_distance and consolidate

aef3044

Add get_distance support in static ivf index

8d24b99

Merge remote-tracking branch 'origin/main' into ib/dynamic_ivf

3bf3310

Entable intra_query_threads and separate clustering threads

60434e6

ibhati changed the title ~~Prepare IVF to support dynamic operations~~ Dynamic IVF Index Implementation Dec 9, 2025

ibhati marked this pull request as ready for review December 9, 2025 21:15

ibhati requested review from ahuber21, ethanglaser, mihaic and yuejiaointel as code owners December 9, 2025 21:15

ibhati added 4 commits December 9, 2025 14:30

Add examples and support fp16 clusters from Python

da236b5

Simplify get_distance implementation

9dcec84

progress

8a2270c

Merge remote-tracking branch 'origin/main' into ib/dynamic_ivf

a89c3a1

ethanglaser requested a review from Copilot December 11, 2025 23:41

Copilot AI reviewed Dec 11, 2025

View reviewed changes

ibhati added 3 commits December 11, 2025 16:55

Restructured the Python APIs for dynamic IVF to match static

74b2672

formatting

c373c1c

Make dynamic IVF APIs similar to static

dbe2ae8

ibhati added 3 commits December 12, 2025 17:02

Improve compact implementation

0def7ab

Add support for SQDataset

639ca08

clang format

2f714b2

ahuber21 requested changes Dec 15, 2025

View reviewed changes

-    // Validate parameters to avoid Intel MKL errors
+    // Early return for zero dimensions.
+    // Calling Intel MKL functions (e.g., cblas_sgemm) with zero dimensions may result in undefined behavior or runtime errors.
+    // This check ensures we avoid such cases.

		// or training_fraction of data, whichever is larger.
		// This ensures we have enough training data even for small datasets

-    // or training_fraction of data, whichever is larger.
-    // This ensures we have enough training data even for small datasets
+    // but no more than the dataset size. (The minimum is capped at data.size().)
+    // This ensures we have enough training data even for small datasets,
+    // without exceeding the available data.

	// This ensures we have enough training data even for small datasets
	// The minimum is capped at data.size() using std::min, so we never use more training data than available.

		auto distance_copy = distance_;
		svs::distance::maybe_fix_argument(distance_copy, query);

-        auto distance_copy = distance_;
-        svs::distance::maybe_fix_argument(distance_copy, query);
+        if constexpr (Dist::must_fix_argument) {
+            // Fix distance argument if needed (e.g., for cosine similarity)
+            auto distance_copy = distance_;
+            svs::distance::maybe_fix_argument(distance_copy, query);
+            // Call extension for distance computation
+            return svs::index::ivf::extensions::get_distance_ext(
+                cluster_, distance_copy, cluster_id, pos, query
+            );
+        } else {
+            // Call extension for distance computation
+            return svs::index::ivf::extensions::get_distance_ext(
+                cluster_, distance_, cluster_id, pos, query
+            );
+        }


		size_t size() const { return data_.size(); }

		// Support for dynamic operations - SimpleData already has resize()

	// Support for dynamic operations - SimpleData already has resize()
	// Support for dynamic operations

Dynamic IVF Index Implementation #223

Are you sure you want to change the base?

Dynamic IVF Index Implementation #223

Uh oh!

Conversation

ibhati commented Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Features