cuda_vector_databases

Vector database search optimized using cuda

✅ Implemented So Far

✅ SIFT1M dataset preprocessing (.fvecs → .npy)
✅ Optional vector normalization
✅ Custom CUDA-based KMeans:
- kmeans_gpu.cu: core CUDA kernels for assignment and centroid update
- kmeans_driver.cpp: runs clustering loop and saves centroids
- This implementation uses atomicAdd to update centroids (safe but not optimal)
- Future optimization: shared memory, warp-level primitives, CUDA-based reduction
✅ Python IVF index builder (build_ivf_index.py) using precomputed centroids
✅ Test current implementation with SIFTSMALL
✅ Update project structure below
✅ Document the compilation commands
✅ Scripts for converting data, checking shape, ...
✅ Migrate all steps to GPU
✅ Benchmarking CPU vs GPU KMeans

Dataset: SIFT1M

Due to GitHub file size limits, we do not include the full dataset in this repo.

To download:

./scripts/download_sift1m.sh

Preprocess:

Convert .fvecs to .npy:

python3 scripts/convert_fvecs.py

(Optional) Normalize vectors:

python3 scripts/normalize_vectors.py

IVF Indexing (Inverted File Index)

Run custom GPU KMeans and save IVF centroids:

We run KMeans clustering on the base vectors to compute K coarse centroids. These centroids serve as the anchors for grouping similar vectors in the inverted index.

Compilation command: Notice that the test environment was based on Euler, with nvidia/cuda/11.8.0 module loaded. And we choose K=1024 for dataset sift1m.

nvcc -std=c++17 -O3  -Xcompiler -Wall -Xptxas -O3 -Iinclude -Iinclude/cnpy -o bin/kmeans_driver  src/cuda/kmeans_driver.cu  src/cuda/kmeans_gpu.cu src/cuda/IVFIndex.cu include/cnpy/cnpy.cpp -lcuda -lcudart -lz

Sample Slurm script test_kmeans_sift1m.sh

#!/usr/bin/env bash
#SBATCH -p instruction
#SBATCH --job-name=kmeans-sift1m
#SBATCH --output=kmeans-sift1m.out
#SBATCH --error=kmeans-sift1m.err
#SBATCH --ntasks=1
#SBATCH --gres=gpu:1
#SBATCH --time=00:15:00

echo "[INFO] Job started on $(hostname) at $(date)"
echo "[INFO] Working directory: $(pwd)"
echo "[INFO] Running binary on SIFT1M..."

# Check if binary exists
if [[ ! -x ./bin/kmeans_driver ]]; then
  echo "[ERROR] ./bin/kmeans_driver not found or not executable!"
  exit 1
fi

# Run the program with SIFT1M and K=1024
./bin/kmeans_driver data/processed/sift_base_1M.npy \
    output/ivf_centroids.npy output/ivf_centroids.bin 1024 20

echo "[INFO] Job finished at $(date)"

After running kmeans_driver, ivf_centroids.npy and ivf_centroids.bin are generated and stored in data/processed.

Inverted File (IVF) Index Construction

Use the trained centroids to assign each base vector to its closest cluster, producing the inverted index (ie. ivf_list_ids.npy and ivf_offsets.npy.

python3 scripts/build_ivf_index.py

To inspect the resulting IVF and debug, you can use scripts/debug_ivf_index.py. This script helps you verify the IVF index built from KMeans centroids.

python3 scripts/debug_ivf_index.py

Project Structure

repo/
├── README.md
├── .gitignore
├── data/
│   ├── raw/                      # Original downloaded datasets (e.g., SIFT1M)
│   ├── processed/                # .npy vectors, normalized data, IVF index files
│   │   ├── sift_base.npy
│   │   ├── ivf_centroids.npy       # Trained KMeans centroids
│   │   ├── ivf_centroids.bin       # Binary version of centroids for GPU use
│   │   ├── ivf_list_ids.npy        # Flattened list of vector indices by cluster
│   │   ├── ivf_offsets.npy         # Offsets into list_ids for each cluster
│   │   ├── parse.py                # Helper script that converts .npy to .txt
│   │   └── sift_query.npy          # Query set
│   └── scripts/                  # (Future) MS MARCO embedding scripts
├── src/
│   ├── cuda/
│   │   ├── kmeans_gpu.cu         # Custom CUDA KMeans kernel code
│   │   ├── kmeans_driver.cpp     # Host-side driver to run GPU KMeans
│   └── cpu_baseline/             # CPU baseline evaluation and quality comparison tools
│   │   ├── train_cpu_kmeans.py           # sklearn KMeans training (with optional shared initialization)
│   │   ├── init_centroids.py             # Extracts initial centroids from GPU run for shared initialization
│   │   ├── compare_centroids.py          # L2 distance comparison between CPU and GPU centroids
│   │   ├── compare_clustering_loss.py    # Clustering loss (inertia) comparison
│   │   ├── compute_imbalance_ratio.py    # Computes imbalance ratio for cluster size distribution
│   │   ├── cluster_size_compare.py       # Plots cluster size distribution per cluster ID
│   │   ├── plot_centroid_diff.py         # Visualizes centroid-wise L2 distance histogram
│   │   ├── plot_cluster_sizes.py         # Visualizes cluster size distribution (sorted by cluster ID)
│   │   └── plot_cluster_size_histogram.py# Histogram of cluster sizes (frequency)
│   └── cpu/                      # Serialized cpu implementation
├── include/
│   └── kmeans_gpu.h              # CUDA kernel function declarations
│   └── cnpy/                     # Embedded C++ library for reading/writing .npy
│       ├── cnpy.h
│       └── cnpy.cpp
├── scripts/
│   ├── convert_fvecs.py          # Convert SIFT1M .fvecs → .npy
│   ├── download_sift1m.sh        # Script to download and extract SIFT1M dataset
│   ├── npy_to_bin.py             # Convert .npy centroid file to binary format for GPU use
│   ├── normalize_vectors.py      # Optional: normalize vectors
│   ├── build_ivf_index.py        # Build inverted lists using IVF centroids
│   ├── debug_ivf_index.py        # Inspect and validate IVF inverted lists
│   ├── benchmark_kmeans.py       # Compare sklearn vs CUDA KMeans
│   └── check_sift_stats.py       # Inspect shape/dtype of parsed vectors
├── tests/                        # planned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

cuda_vector_databases

✅ Implemented So Far

Dataset: SIFT1M

IVF Indexing (Inverted File Index)

Run custom GPU KMeans and save IVF centroids:

Inverted File (IVF) Index Construction

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
include		include
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
generateData.sh		generateData.sh
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

cuda_vector_databases

✅ Implemented So Far

Dataset: SIFT1M

IVF Indexing (Inverted File Index)

Run custom GPU KMeans and save IVF centroids:

Inverted File (IVF) Index Construction

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages