Vector database search optimized using cuda
- ✅ SIFT1M dataset preprocessing (
.fvecs → .npy) - ✅ Optional vector normalization
- ✅ Custom CUDA-based KMeans:
kmeans_gpu.cu: core CUDA kernels for assignment and centroid updatekmeans_driver.cpp: runs clustering loop and saves centroids- This implementation uses atomicAdd to update centroids (safe but not optimal)
- Future optimization: shared memory, warp-level primitives, CUDA-based reduction
- ✅ Python IVF index builder (
build_ivf_index.py) using precomputed centroids - ✅ Test current implementation with SIFTSMALL
- ✅ Update project structure below
- ✅ Document the compilation commands
- ✅ Scripts for converting data, checking shape, ...
- ✅ Migrate all steps to GPU
- ✅ Benchmarking CPU vs GPU KMeans
Due to GitHub file size limits, we do not include the full dataset in this repo.
To download:
./scripts/download_sift1m.shPreprocess:
- Convert
.fvecsto.npy:
python3 scripts/convert_fvecs.py- (Optional) Normalize vectors:
python3 scripts/normalize_vectors.pyWe run KMeans clustering on the base vectors to compute K coarse centroids. These centroids serve as the anchors for grouping similar vectors in the inverted index.
Compilation command:
Notice that the test environment was based on Euler, with nvidia/cuda/11.8.0 module loaded. And we choose K=1024 for dataset sift1m.
nvcc -std=c++17 -O3 -Xcompiler -Wall -Xptxas -O3 -Iinclude -Iinclude/cnpy -o bin/kmeans_driver src/cuda/kmeans_driver.cu src/cuda/kmeans_gpu.cu src/cuda/IVFIndex.cu include/cnpy/cnpy.cpp -lcuda -lcudart -lzSample Slurm script test_kmeans_sift1m.sh
#!/usr/bin/env bash
#SBATCH -p instruction
#SBATCH --job-name=kmeans-sift1m
#SBATCH --output=kmeans-sift1m.out
#SBATCH --error=kmeans-sift1m.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00
echo "[INFO] Job started on $(hostname) at $(date)"
echo "[INFO] Working directory: $(pwd)"
echo "[INFO] Running binary on SIFT1M..."
# Check if binary exists
if [[ ! -x ./bin/kmeans_driver ]]; then
echo "[ERROR] ./bin/kmeans_driver not found or not executable!"
exit 1
fi
# Run the program with SIFT1M and K=1024
./bin/kmeans_driver data/processed/sift_base_1M.npy \
output/ivf_centroids.npy output/ivf_centroids.bin 1024 20
echo "[INFO] Job finished at $(date)"After running kmeans_driver, ivf_centroids.npy and ivf_centroids.bin are generated and stored in data/processed.
Use the trained centroids to assign each base vector to its closest cluster, producing the inverted index (ie. ivf_list_ids.npy and ivf_offsets.npy.
python3 scripts/build_ivf_index.pyTo inspect the resulting IVF and debug, you can use scripts/debug_ivf_index.py. This script helps you verify the IVF index built from KMeans centroids.
python3 scripts/debug_ivf_index.pyrepo/
├── README.md
├── .gitignore
├── data/
│ ├── raw/ # Original downloaded datasets (e.g., SIFT1M)
│ ├── processed/ # .npy vectors, normalized data, IVF index files
│ │ ├── sift_base.npy
│ │ ├── ivf_centroids.npy # Trained KMeans centroids
│ │ ├── ivf_centroids.bin # Binary version of centroids for GPU use
│ │ ├── ivf_list_ids.npy # Flattened list of vector indices by cluster
│ │ ├── ivf_offsets.npy # Offsets into list_ids for each cluster
│ │ ├── parse.py # Helper script that converts .npy to .txt
│ │ └── sift_query.npy # Query set
│ └── scripts/ # (Future) MS MARCO embedding scripts
├── src/
│ ├── cuda/
│ │ ├── kmeans_gpu.cu # Custom CUDA KMeans kernel code
│ │ ├── kmeans_driver.cpp # Host-side driver to run GPU KMeans
│ └── cpu_baseline/ # CPU baseline evaluation and quality comparison tools
│ │ ├── train_cpu_kmeans.py # sklearn KMeans training (with optional shared initialization)
│ │ ├── init_centroids.py # Extracts initial centroids from GPU run for shared initialization
│ │ ├── compare_centroids.py # L2 distance comparison between CPU and GPU centroids
│ │ ├── compare_clustering_loss.py # Clustering loss (inertia) comparison
│ │ ├── compute_imbalance_ratio.py # Computes imbalance ratio for cluster size distribution
│ │ ├── cluster_size_compare.py # Plots cluster size distribution per cluster ID
│ │ ├── plot_centroid_diff.py # Visualizes centroid-wise L2 distance histogram
│ │ ├── plot_cluster_sizes.py # Visualizes cluster size distribution (sorted by cluster ID)
│ │ └── plot_cluster_size_histogram.py# Histogram of cluster sizes (frequency)
│ └── cpu/ # Serialized cpu implementation
├── include/
│ └── kmeans_gpu.h # CUDA kernel function declarations
│ └── cnpy/ # Embedded C++ library for reading/writing .npy
│ ├── cnpy.h
│ └── cnpy.cpp
├── scripts/
│ ├── convert_fvecs.py # Convert SIFT1M .fvecs → .npy
│ ├── download_sift1m.sh # Script to download and extract SIFT1M dataset
│ ├── npy_to_bin.py # Convert .npy centroid file to binary format for GPU use
│ ├── normalize_vectors.py # Optional: normalize vectors
│ ├── build_ivf_index.py # Build inverted lists using IVF centroids
│ ├── debug_ivf_index.py # Inspect and validate IVF inverted lists
│ ├── benchmark_kmeans.py # Compare sklearn vs CUDA KMeans
│ └── check_sift_stats.py # Inspect shape/dtype of parsed vectors
├── tests/ # planned