-
Notifications
You must be signed in to change notification settings - Fork 34
Dynamic IVF Index Implementation #223
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
added in the cluster. Added comprhensive tests for IVF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces Dynamic IVF (Inverted File) Index support for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive test coverage, memory optimizations through blocked storage, flexible thread configuration, and full Python bindings. Key enhancements include a new train_only parameter for k-means clustering that decouples centroid training from cluster assignment, and get_distance API for computing distances between queries and indexed vectors.
Key Changes
- Dynamic IVF index implementation with insert/delete support via
BlockedDatastorage - Enhanced k-means clustering with
train_onlymode and improved training data selection - New
get_distanceAPI for distance computation in both static and dynamic IVF indices
Reviewed changes
Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/svs/index/ivf/kmeans.cpp | Adds comprehensive test coverage for k-means train_only functionality, edge cases, reproducibility, and parameter variations |
| tests/svs/index/ivf/hierarchical_kmeans.cpp | Extends hierarchical k-means tests with train_only verification, level1 cluster configurations, and comparison tests |
| tests/svs/index/ivf/dynamic_ivf.cpp | New test suite for dynamic IVF operations including add/delete, search, compaction, threading, and get_distance |
| tests/svs/index/ivf/common.cpp | Adds utility function tests and verifies train_only + cluster_assignment workflow |
| tests/integration/ivf/index_search.cpp | Integration test for get_distance functionality in static IVF |
| tests/integration/ivf/index_build.cpp | Integration test for train_only build workflow |
| tests/CMakeLists.txt | Registers dynamic_ivf.cpp test file |
| include/svs/orchestrators/ivf.h | Adds get_distance API to IVFInterface |
| include/svs/orchestrators/dynamic_ivf.h | New orchestrator for dynamic IVF with mutable operations |
| include/svs/index/ivf/kmeans.h | Implements train_only parameter and improved training data selection |
| include/svs/index/ivf/index.h | Adds ID mapping and get_distance support to static IVF |
| include/svs/index/ivf/hierarchical_kmeans.h | Implements train_only parameter for hierarchical k-means |
| include/svs/index/ivf/extensions.h | Adds get_distance_ext CPO for extensible distance computation |
| include/svs/index/ivf/dynamic_ivf.h | Core dynamic IVF implementation with BlockedData storage |
| include/svs/index/ivf/common.h | Enhanced compute_matmul validation, improved type handling, and new cluster_assignment utility |
| include/svs/index/ivf/clustering.h | Adds cluster indexing operator for DenseClusteredDataset |
| examples/python/example_ivf_dynamic.py | Python example demonstrating dynamic IVF usage |
| examples/python/example_ivf.py | Python example for static IVF clustering and assembly |
| bindings/python/tests/test_ivf.py | Adds get_distance test coverage |
| bindings/python/src/python_bindings.cpp | Registers dynamic_ivf Python bindings |
| bindings/python/src/ivf.cpp | Updates clustering to support Float16/BFloat16 variants |
| bindings/python/src/dynamic_ivf.cpp | Python bindings for dynamic IVF operations |
| bindings/python/include/svs/python/ivf.h | Adds get_distance to Python IVF interface |
| bindings/python/include/svs/python/dynamic_ivf.h | Header for dynamic IVF Python bindings |
| bindings/python/CMakeLists.txt | Registers dynamic_ivf.cpp source file |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
include/svs/index/ivf/common.h
Outdated
| void compute_matmul( | ||
| const T* data, const T* centroids, float* results, size_t m, size_t n, size_t k | ||
| ) { | ||
| // Validate parameters to avoid Intel MKL errors |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The early return for zero dimensions is correct, but the comment could be more specific about what would happen without this check. Consider clarifying that MKL functions may have undefined behavior or produce errors with zero dimensions.
| // Validate parameters to avoid Intel MKL errors | |
| // Early return for zero dimensions. | |
| // Calling Intel MKL functions (e.g., cblas_sgemm) with zero dimensions may result in undefined behavior or runtime errors. | |
| // This check ensures we avoid such cases. |
include/svs/index/ivf/kmeans.h
Outdated
| // or training_fraction of data, whichever is larger. | ||
| // This ensures we have enough training data even for small datasets |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic for min_training_data calculates the minimum between num_centroids * MIN_TRAINING_SAMPLE_MULTIPLIER and data.size(), which caps the minimum at the dataset size. However, the comment says "Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of centroids, or training_fraction of data, whichever is larger," which doesn't accurately describe the interaction with the std::min operation. The comment should clarify that the minimum is capped at the dataset size.
| // or training_fraction of data, whichever is larger. | |
| // This ensures we have enough training data even for small datasets | |
| // but no more than the dataset size. (The minimum is capped at data.size().) | |
| // This ensures we have enough training data even for small datasets, | |
| // without exceeding the available data. |
| if (num_training_data < num_clusters || num_training_data > data.size()) { | ||
| // Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of clusters, | ||
| // or training_fraction of data, whichever is larger. | ||
| // This ensures we have enough training data even for small datasets |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the kmeans.h comment, the documentation doesn't fully describe the std::min operation that caps min_training_data at data.size(). The comment should be more precise about how the minimum training data is calculated with respect to the dataset size constraint.
| // This ensures we have enough training data even for small datasets | |
| // The minimum is capped at data.size() using std::min, so we never use more training data than available. |
tests/svs/index/ivf/kmeans.cpp
Outdated
| // Verify train_only doesn't take significantly longer (allow some variance) | ||
| // In practice, train_only should be faster, but we allow it to be up to 50% longer due | ||
| // to variance | ||
| CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5); |
Copilot
AI
Dec 11, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Performance tests comparing execution times can be flaky, especially with a 50% tolerance that allows train_only mode to be slower despite the comment stating it should be faster. Consider either removing this assertion or documenting why such a wide tolerance is acceptable for a feature that's expected to improve performance.
| // Verify train_only doesn't take significantly longer (allow some variance) | |
| // In practice, train_only should be faster, but we allow it to be up to 50% longer due | |
| // to variance | |
| CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5); | |
| // Note: We do not assert on performance here, as wall-clock timing is unreliable in CI. | |
| // In practice, train_only should be faster, but this is best verified with dedicated benchmarks. |
ahuber21
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few nitpicks here and there. Great contribution overall!
Is there anything I can/should do to independently validate correctness?
I didn't deep-dive into the IVF part itself.
| constexpr size_t max_int = static_cast<size_t>(std::numeric_limits<int>::max()); | ||
| if (m > max_int || n > max_int || k > max_int) { | ||
| throw ANNEXCEPTION( | ||
| "Matrix dimensions too large for Intel MKL GEMM: m={}, n={}, k={}", m, n, k |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make it more helpful we should include max_int in the error message.
|
|
||
| if constexpr (std::is_same_v<T, float>) { | ||
| // Cast size_t parameters to int for MKL GEMM functions | ||
| int m_int = static_cast<int>(m); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the casting outside the if block to avoid duplication
| template <typename Query> double get_distance(size_t id, const Query& query) const { | ||
| // Lazily initialize ID mapping on first call | ||
| if (id_to_cluster_.empty()) { | ||
| const_cast<IVFIndex*>(this)->initialize_id_mapping(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not thread-safe. I didn't find the pattern of const_casting this for lazy init elsewhere in the lib. So I suggest to not introduce it here.
| auto distance_copy = distance_; | ||
| svs::distance::maybe_fix_argument(distance_copy, query); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| auto distance_copy = distance_; | |
| svs::distance::maybe_fix_argument(distance_copy, query); | |
| if constexpr (Dist::must_fix_argument) { | |
| // Fix distance argument if needed (e.g., for cosine similarity) | |
| auto distance_copy = distance_; | |
| svs::distance::maybe_fix_argument(distance_copy, query); | |
| // Call extension for distance computation | |
| return svs::index::ivf::extensions::get_distance_ext( | |
| cluster_, distance_copy, cluster_id, pos, query | |
| ); | |
| } else { | |
| // Call extension for distance computation | |
| return svs::index::ivf::extensions::get_distance_ext( | |
| cluster_, distance_, cluster_id, pos, query | |
| ); | |
| } |
This might be premature optimization. Do you think we can gain a few QPS by avoiding the copy if it's not required?
Also: It states in include/svs/index/ivf/common.h that only L2 and MIP are supported. Is is helpful to mention cosine in the comment?
| // performance. This value was chosen based on empirical testing to avoid excessive memory | ||
| // allocation while supporting large batch operations typical in high-throughput | ||
| // environments. | ||
| const size_t MAX_QUERY_BATCH_SIZE = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change to constexpr?
| constexpr size_t MAX_QUERY_BATCH_SIZE = 10000; |
|
|
||
| size_t size() const { return data_.size(); } | ||
|
|
||
| // Support for dynamic operations - SimpleData already has resize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Support for dynamic operations - SimpleData already has resize() | |
| // Support for dynamic operations |
data_type is templated at this point, we shouldn't mention SimpleData
| Data& view_cluster() { return data_; } | ||
|
|
||
| public: | ||
| Data data_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better change this to data_type for consistency? Some more examples of using Data instead of data_type exist throughout the file.
|
|
||
| // Use centroid_assignment to compute assignments for this batch | ||
| centroid_assignment( | ||
| const_cast<Points&>(points), // centroid_assignment expects non-const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, I'm wary of using const_cast. I peaked into centroid_assignment. I don't see why it can't accept const Data& data. Can we change that?
(If the reason is that MKL doesn't accept const pointers, we should const_cast just before dispatching into MKL instead.)
|
|
||
| // Use a small block size for IVF clusters (1MB instead of 1GB default) | ||
| auto blocking_params = data::BlockingParameters{ | ||
| .blocksize_bytes = lib::PowerOfTwo(20) // 2^20 = 1MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this value?
| ) { | ||
| return kmeans_clustering_impl<BuildType>( | ||
| parameters, data, distance, threadpool, integer_type, std::move(logger) | ||
| parameters, data, distance, threadpool, integer_type, std::move(logger), train_only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe split this into two functions? Along the lines of:
auto centroids = kmeans_centroids_impl<BuildType>(...);
std::vector<std::vector<uint32_t>> clusters(parameters.num_centroids_);
if (!train_only) {
clusters = kmeans_clustering_impl<BuildType>(centroids, ...);
}
return std::make_tuple(centroids, std::move(clusters));
Thanks @ahuber21 for reviewing this, i will fix your suggestions. Nothing in particular, a high level look is good for now. |
This PR enables support for Dynamic IVF (Inverted File) Index for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive memory optimizations, thread configuration improvements, and full Python bindings.
Key Features
🚀 Dynamic IVF Index
BlockedDatawith configurable block sizes for efficient memory managementTODOs for IVF: