Skip to content

Conversation

@ibhati
Copy link
Member

@ibhati ibhati commented Nov 18, 2025

This PR enables support for Dynamic IVF (Inverted File) Index for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive memory optimizations, thread configuration improvements, and full Python bindings.

Key Features

🚀 Dynamic IVF Index

  • Dynamic Operations: Support for adding and deleting vectors after index construction
  • Blocked Storage: Uses BlockedData with configurable block sizes for efficient memory management
  • Thread Control: Separate configuration for clustering threads, index threads, and intra-query threads

TODOs for IVF:

  • Enable efficient Load/Save (for both static and dynamic IVF)
  • Complete IVF documentation
  • Batch iterator with IVF
  • Benchmarking support for dynamic ivf

@ibhati ibhati changed the title Prepare IVF to support dynamic operations Dynamic IVF Index Implementation Dec 9, 2025
@ibhati ibhati marked this pull request as ready for review December 9, 2025 21:15
@ethanglaser ethanglaser requested a review from Copilot December 11, 2025 23:41
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces Dynamic IVF (Inverted File) Index support for SVS, enabling efficient vector search with dynamic insert and delete operations. The implementation includes comprehensive test coverage, memory optimizations through blocked storage, flexible thread configuration, and full Python bindings. Key enhancements include a new train_only parameter for k-means clustering that decouples centroid training from cluster assignment, and get_distance API for computing distances between queries and indexed vectors.

Key Changes

  • Dynamic IVF index implementation with insert/delete support via BlockedData storage
  • Enhanced k-means clustering with train_only mode and improved training data selection
  • New get_distance API for distance computation in both static and dynamic IVF indices

Reviewed changes

Copilot reviewed 25 out of 25 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/svs/index/ivf/kmeans.cpp Adds comprehensive test coverage for k-means train_only functionality, edge cases, reproducibility, and parameter variations
tests/svs/index/ivf/hierarchical_kmeans.cpp Extends hierarchical k-means tests with train_only verification, level1 cluster configurations, and comparison tests
tests/svs/index/ivf/dynamic_ivf.cpp New test suite for dynamic IVF operations including add/delete, search, compaction, threading, and get_distance
tests/svs/index/ivf/common.cpp Adds utility function tests and verifies train_only + cluster_assignment workflow
tests/integration/ivf/index_search.cpp Integration test for get_distance functionality in static IVF
tests/integration/ivf/index_build.cpp Integration test for train_only build workflow
tests/CMakeLists.txt Registers dynamic_ivf.cpp test file
include/svs/orchestrators/ivf.h Adds get_distance API to IVFInterface
include/svs/orchestrators/dynamic_ivf.h New orchestrator for dynamic IVF with mutable operations
include/svs/index/ivf/kmeans.h Implements train_only parameter and improved training data selection
include/svs/index/ivf/index.h Adds ID mapping and get_distance support to static IVF
include/svs/index/ivf/hierarchical_kmeans.h Implements train_only parameter for hierarchical k-means
include/svs/index/ivf/extensions.h Adds get_distance_ext CPO for extensible distance computation
include/svs/index/ivf/dynamic_ivf.h Core dynamic IVF implementation with BlockedData storage
include/svs/index/ivf/common.h Enhanced compute_matmul validation, improved type handling, and new cluster_assignment utility
include/svs/index/ivf/clustering.h Adds cluster indexing operator for DenseClusteredDataset
examples/python/example_ivf_dynamic.py Python example demonstrating dynamic IVF usage
examples/python/example_ivf.py Python example for static IVF clustering and assembly
bindings/python/tests/test_ivf.py Adds get_distance test coverage
bindings/python/src/python_bindings.cpp Registers dynamic_ivf Python bindings
bindings/python/src/ivf.cpp Updates clustering to support Float16/BFloat16 variants
bindings/python/src/dynamic_ivf.cpp Python bindings for dynamic IVF operations
bindings/python/include/svs/python/ivf.h Adds get_distance to Python IVF interface
bindings/python/include/svs/python/dynamic_ivf.h Header for dynamic IVF Python bindings
bindings/python/CMakeLists.txt Registers dynamic_ivf.cpp source file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

void compute_matmul(
const T* data, const T* centroids, float* results, size_t m, size_t n, size_t k
) {
// Validate parameters to avoid Intel MKL errors
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The early return for zero dimensions is correct, but the comment could be more specific about what would happen without this check. Consider clarifying that MKL functions may have undefined behavior or produce errors with zero dimensions.

Suggested change
// Validate parameters to avoid Intel MKL errors
// Early return for zero dimensions.
// Calling Intel MKL functions (e.g., cblas_sgemm) with zero dimensions may result in undefined behavior or runtime errors.
// This check ensures we avoid such cases.

Copilot uses AI. Check for mistakes.
Comment on lines 49 to 50
// or training_fraction of data, whichever is larger.
// This ensures we have enough training data even for small datasets
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic for min_training_data calculates the minimum between num_centroids * MIN_TRAINING_SAMPLE_MULTIPLIER and data.size(), which caps the minimum at the dataset size. However, the comment says "Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of centroids, or training_fraction of data, whichever is larger," which doesn't accurately describe the interaction with the std::min operation. The comment should clarify that the minimum is capped at the dataset size.

Suggested change
// or training_fraction of data, whichever is larger.
// This ensures we have enough training data even for small datasets
// but no more than the dataset size. (The minimum is capped at data.size().)
// This ensures we have enough training data even for small datasets,
// without exceeding the available data.

Copilot uses AI. Check for mistakes.
if (num_training_data < num_clusters || num_training_data > data.size()) {
// Use at least MIN_TRAINING_SAMPLE_MULTIPLIER times the number of clusters,
// or training_fraction of data, whichever is larger.
// This ensures we have enough training data even for small datasets
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the kmeans.h comment, the documentation doesn't fully describe the std::min operation that caps min_training_data at data.size(). The comment should be more precise about how the minimum training data is calculated with respect to the dataset size constraint.

Suggested change
// This ensures we have enough training data even for small datasets
// The minimum is capped at data.size() using std::min, so we never use more training data than available.

Copilot uses AI. Check for mistakes.
Comment on lines 169 to 172
// Verify train_only doesn't take significantly longer (allow some variance)
// In practice, train_only should be faster, but we allow it to be up to 50% longer due
// to variance
CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5);
Copy link

Copilot AI Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance tests comparing execution times can be flaky, especially with a 50% tolerance that allows train_only mode to be slower despite the comment stating it should be faster. Consider either removing this assertion or documenting why such a wide tolerance is acceptable for a feature that's expected to improve performance.

Suggested change
// Verify train_only doesn't take significantly longer (allow some variance)
// In practice, train_only should be faster, but we allow it to be up to 50% longer due
// to variance
CATCH_REQUIRE(train_only_duration.count() <= normal_duration.count() * 1.5);
// Note: We do not assert on performance here, as wall-clock timing is unreliable in CI.
// In practice, train_only should be faster, but this is best verified with dedicated benchmarks.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

@ahuber21 ahuber21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few nitpicks here and there. Great contribution overall!
Is there anything I can/should do to independently validate correctness?
I didn't deep-dive into the IVF part itself.

constexpr size_t max_int = static_cast<size_t>(std::numeric_limits<int>::max());
if (m > max_int || n > max_int || k > max_int) {
throw ANNEXCEPTION(
"Matrix dimensions too large for Intel MKL GEMM: m={}, n={}, k={}", m, n, k
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it more helpful we should include max_int in the error message.


if constexpr (std::is_same_v<T, float>) {
// Cast size_t parameters to int for MKL GEMM functions
int m_int = static_cast<int>(m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the casting outside the if block to avoid duplication

template <typename Query> double get_distance(size_t id, const Query& query) const {
// Lazily initialize ID mapping on first call
if (id_to_cluster_.empty()) {
const_cast<IVFIndex*>(this)->initialize_id_mapping();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not thread-safe. I didn't find the pattern of const_casting this for lazy init elsewhere in the lib. So I suggest to not introduce it here.

Comment on lines +204 to +205
auto distance_copy = distance_;
svs::distance::maybe_fix_argument(distance_copy, query);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto distance_copy = distance_;
svs::distance::maybe_fix_argument(distance_copy, query);
if constexpr (Dist::must_fix_argument) {
// Fix distance argument if needed (e.g., for cosine similarity)
auto distance_copy = distance_;
svs::distance::maybe_fix_argument(distance_copy, query);
// Call extension for distance computation
return svs::index::ivf::extensions::get_distance_ext(
cluster_, distance_copy, cluster_id, pos, query
);
} else {
// Call extension for distance computation
return svs::index::ivf::extensions::get_distance_ext(
cluster_, distance_, cluster_id, pos, query
);
}

This might be premature optimization. Do you think we can gain a few QPS by avoiding the copy if it's not required?

Also: It states in include/svs/index/ivf/common.h that only L2 and MIP are supported. Is is helpful to mention cosine in the comment?

// performance. This value was chosen based on empirical testing to avoid excessive memory
// allocation while supporting large batch operations typical in high-throughput
// environments.
const size_t MAX_QUERY_BATCH_SIZE = 10000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to constexpr?

Suggested change
constexpr size_t MAX_QUERY_BATCH_SIZE = 10000;


size_t size() const { return data_.size(); }

// Support for dynamic operations - SimpleData already has resize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Support for dynamic operations - SimpleData already has resize()
// Support for dynamic operations

data_type is templated at this point, we shouldn't mention SimpleData

Data& view_cluster() { return data_; }

public:
Data data_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better change this to data_type for consistency? Some more examples of using Data instead of data_type exist throughout the file.


// Use centroid_assignment to compute assignments for this batch
centroid_assignment(
const_cast<Points&>(points), // centroid_assignment expects non-const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I'm wary of using const_cast. I peaked into centroid_assignment. I don't see why it can't accept const Data& data. Can we change that?

(If the reason is that MKL doesn't accept const pointers, we should const_cast just before dispatching into MKL instead.)


// Use a small block size for IVF clusters (1MB instead of 1GB default)
auto blocking_params = data::BlockingParameters{
.blocksize_bytes = lib::PowerOfTwo(20) // 2^20 = 1MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this value?

) {
return kmeans_clustering_impl<BuildType>(
parameters, data, distance, threadpool, integer_type, std::move(logger)
parameters, data, distance, threadpool, integer_type, std::move(logger), train_only
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe split this into two functions? Along the lines of:

auto centroids = kmeans_centroids_impl<BuildType>(...);
std::vector<std::vector<uint32_t>> clusters(parameters.num_centroids_);
if (!train_only) {
  clusters = kmeans_clustering_impl<BuildType>(centroids, ...);
}
return std::make_tuple(centroids, std::move(clusters));

@ibhati
Copy link
Member Author

ibhati commented Dec 15, 2025

Just a few nitpicks here and there. Great contribution overall! Is there anything I can/should do to independently validate correctness? I didn't deep-dive into the IVF part itself.

Thanks @ahuber21 for reviewing this, i will fix your suggestions. Nothing in particular, a high level look is good for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants