Skip to content

rishabh0203iitr/cuda-vector-databases

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

cuda_vector_databases

Vector database search optimized using cuda


✅ Implemented So Far

  • ✅ SIFT1M dataset preprocessing (.fvecs → .npy)
  • ✅ Optional vector normalization
  • ✅ Custom CUDA-based KMeans:
    • kmeans_gpu.cu: core CUDA kernels for assignment and centroid update
    • kmeans_driver.cpp: runs clustering loop and saves centroids
    • This implementation uses atomicAdd to update centroids (safe but not optimal)
    • Future optimization: shared memory, warp-level primitives, CUDA-based reduction
  • ✅ Python IVF index builder (build_ivf_index.py) using precomputed centroids
  • ✅ Test current implementation with SIFTSMALL
  • ✅ Update project structure below
  • ✅ Document the compilation commands
  • ✅ Scripts for converting data, checking shape, ...
  • ✅ Migrate all steps to GPU
  • ✅ Benchmarking CPU vs GPU KMeans

Dataset: SIFT1M

Due to GitHub file size limits, we do not include the full dataset in this repo.

To download:

./scripts/download_sift1m.sh

Preprocess:

  1. Convert .fvecs to .npy:
python3 scripts/convert_fvecs.py
  1. (Optional) Normalize vectors:
python3 scripts/normalize_vectors.py

IVF Indexing (Inverted File Index)

Run custom GPU KMeans and save IVF centroids:

We run KMeans clustering on the base vectors to compute K coarse centroids. These centroids serve as the anchors for grouping similar vectors in the inverted index.

Compilation command: Notice that the test environment was based on Euler, with nvidia/cuda/11.8.0 module loaded. And we choose K=1024 for dataset sift1m.

nvcc -std=c++17 -O3  -Xcompiler -Wall -Xptxas -O3 -Iinclude -Iinclude/cnpy -o bin/kmeans_driver  src/cuda/kmeans_driver.cu  src/cuda/kmeans_gpu.cu src/cuda/IVFIndex.cu include/cnpy/cnpy.cpp -lcuda -lcudart -lz

Sample Slurm script test_kmeans_sift1m.sh

#!/usr/bin/env bash
#SBATCH -p instruction
#SBATCH --job-name=kmeans-sift1m
#SBATCH --output=kmeans-sift1m.out
#SBATCH --error=kmeans-sift1m.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

echo "[INFO] Job started on $(hostname) at $(date)"
echo "[INFO] Working directory: $(pwd)"
echo "[INFO] Running binary on SIFT1M..."

# Check if binary exists
if [[ ! -x ./bin/kmeans_driver ]]; then
  echo "[ERROR] ./bin/kmeans_driver not found or not executable!"
  exit 1
fi

# Run the program with SIFT1M and K=1024
./bin/kmeans_driver data/processed/sift_base_1M.npy \
    output/ivf_centroids.npy output/ivf_centroids.bin 1024 20

echo "[INFO] Job finished at $(date)"

After running kmeans_driver, ivf_centroids.npy and ivf_centroids.bin are generated and stored in data/processed.

Inverted File (IVF) Index Construction

Use the trained centroids to assign each base vector to its closest cluster, producing the inverted index (ie. ivf_list_ids.npy and ivf_offsets.npy.

python3 scripts/build_ivf_index.py

To inspect the resulting IVF and debug, you can use scripts/debug_ivf_index.py. This script helps you verify the IVF index built from KMeans centroids.

python3 scripts/debug_ivf_index.py

Project Structure

repo/
├── README.md
├── .gitignore
├── data/
│   ├── raw/                      # Original downloaded datasets (e.g., SIFT1M)
│   ├── processed/                # .npy vectors, normalized data, IVF index files
│   │   ├── sift_base.npy
│   │   ├── ivf_centroids.npy       # Trained KMeans centroids
│   │   ├── ivf_centroids.bin       # Binary version of centroids for GPU use
│   │   ├── ivf_list_ids.npy        # Flattened list of vector indices by cluster
│   │   ├── ivf_offsets.npy         # Offsets into list_ids for each cluster
│   │   ├── parse.py                # Helper script that converts .npy to .txt
│   │   └── sift_query.npy          # Query set
│   └── scripts/                  # (Future) MS MARCO embedding scripts
├── src/
│   ├── cuda/
│   │   ├── kmeans_gpu.cu         # Custom CUDA KMeans kernel code
│   │   ├── kmeans_driver.cpp     # Host-side driver to run GPU KMeans
│   └── cpu_baseline/             # CPU baseline evaluation and quality comparison tools
│   │   ├── train_cpu_kmeans.py           # sklearn KMeans training (with optional shared initialization)
│   │   ├── init_centroids.py             # Extracts initial centroids from GPU run for shared initialization
│   │   ├── compare_centroids.py          # L2 distance comparison between CPU and GPU centroids
│   │   ├── compare_clustering_loss.py    # Clustering loss (inertia) comparison
│   │   ├── compute_imbalance_ratio.py    # Computes imbalance ratio for cluster size distribution
│   │   ├── cluster_size_compare.py       # Plots cluster size distribution per cluster ID
│   │   ├── plot_centroid_diff.py         # Visualizes centroid-wise L2 distance histogram
│   │   ├── plot_cluster_sizes.py         # Visualizes cluster size distribution (sorted by cluster ID)
│   │   └── plot_cluster_size_histogram.py# Histogram of cluster sizes (frequency)
│   └── cpu/                      # Serialized cpu implementation
├── include/
│   └── kmeans_gpu.h              # CUDA kernel function declarations
│   └── cnpy/                     # Embedded C++ library for reading/writing .npy
│       ├── cnpy.h
│       └── cnpy.cpp
├── scripts/
│   ├── convert_fvecs.py          # Convert SIFT1M .fvecs → .npy
│   ├── download_sift1m.sh        # Script to download and extract SIFT1M dataset
│   ├── npy_to_bin.py             # Convert .npy centroid file to binary format for GPU use
│   ├── normalize_vectors.py      # Optional: normalize vectors
│   ├── build_ivf_index.py        # Build inverted lists using IVF centroids
│   ├── debug_ivf_index.py        # Inspect and validate IVF inverted lists
│   ├── benchmark_kmeans.py       # Compare sklearn vs CUDA KMeans
│   └── check_sift_stats.py       # Inspect shape/dtype of parsed vectors
├── tests/                        # planned

About

Vector database search optimized using cuda

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors