Skip to content

feat: optional CUDA/GPU acceleration for embeddings#66

Open
sattva1 wants to merge 3 commits intoharshkedia177:mainfrom
sattva1:sattva/feat/cuda-support
Open

feat: optional CUDA/GPU acceleration for embeddings#66
sattva1 wants to merge 3 commits intoharshkedia177:mainfrom
sattva1:sattva/feat/cuda-support

Conversation

@sattva1
Copy link
Copy Markdown

@sattva1 sattva1 commented Apr 3, 2026

Summary

Closes #64.

Adds opt-in CUDA support for the embedding pipeline, reducing embedding time from minutes to seconds on GPU-capable machines.

Two activation paths:

  • --cuda CLI flag on analyze, watch, host, and serve
  • AXON_CUDA=1 environment variable (works across all commands without per-command flags)

Design

Uses a module-level configuration in embedder.py rather than threading a cuda parameter through every function signature. _get_model() reads the CUDA state at call time via _resolve_cuda(), which checks both the programmatic flag and the env var. This means all embedding call sites — pipeline, watcher, and search-time embed_query() from MCP/web — automatically use GPU when enabled.

CUDA validation

When CUDA is requested, _get_model() captures fastembed's RuntimeWarning on CUDAExecutionProvider fallback and converts it to a RuntimeError with actionable install instructions. CLI commands call validate_cuda() before the pipeline starts, so the error surfaces immediately rather than being swallowed by the pipeline's broad except Exception handler.

Changes

  • src/axon/core/embeddings/embedder.pyconfigure_cuda(), _resolve_cuda(), validate_cuda(). _get_model() uses (model_name, cuda) compound cache key and post-init CUDA fallback detection.
  • src/axon/cli/main.py_configure_and_validate_cuda() helper. --cuda flag on analyze, watch, host, serve.
  • tests/core/test_embedder.py — 15 new tests covering flag, env var, cache separation, fallback detection.

Usage

# Via CLI flag
axon analyze --cuda .

# Via environment variable (works with any command)
export AXON_CUDA=1
axon analyze .
axon watch
axon host

Vladislav Miller added 3 commits April 3, 2026 13:15
…ng pipeline

- Add configure_cuda() / validate_cuda() public API and _resolve_cuda()
  to honour both the --cuda flag and AXON_CUDA env var
- Key model cache on (model_name, cuda) tuple to prevent CPU/GPU model
  aliasing; pass cuda=True to TextEmbedding and surface ONNX
  CUDAExecutionProvider fallback warnings as RuntimeError
- Expose --cuda flag on analyze, watch, host, and serve commands via
  shared _configure_and_validate_cuda() helper
…t OOM

fastembed defaults to Device.AUTO, which auto-detects and uses CUDA when
onnxruntime-gpu is installed. On GPUs with limited VRAM (e.g. 8GB), the
nomic model with batch_size=32 causes OOM via a 9.5GB BiasSoftmax
allocation. Pass cuda=False explicitly in the CPU path.

Also fix test isolation: reset _cuda_enabled and AXON_CUDA env var in
the autouse fixture to prevent state leaking between tests.
The nomic-embed-text-v1.5 model has 12 attention heads and 2048-token
context, making each batch element's attention matrix ~192 MB. At
batch_size=32 this totals ~6.4 GB for attention alone, causing OOM on
both CPU (physical memory) and GPUs with <= 8 GB VRAM. Batch size 8
keeps peak memory under ~2 GB.

This was not an issue with the previous BAAI/bge-small-en-v1.5 model
(6 heads, 512-token limit) but the batch size was never adjusted when
the model was upgraded.
@sattva1
Copy link
Copy Markdown
Author

sattva1 commented Apr 3, 2026

Two additional fixes pushed:

1. Explicit cuda=False in CPU path (2365c99)

fastembed defaults to Device.AUTO, which auto-detects and uses CUDA when onnxruntime-gpu is installed — even without --cuda or AXON_CUDA. This caused silent OOM on GPUs with limited VRAM. The CPU path now explicitly passes cuda=False to override auto-detection.

2. Batch size 32 → 8 (acac2e8)

The default batch size was never adjusted when the model changed from BAAI/bge-small-en-v1.5 (6 heads, 512-token context) to nomic-embed-text-v1.5 (12 heads, 2048-token context). Each batch element's attention matrix is ~192 MB with nomic — at batch_size=32 that's ~6.4 GB for attention alone, causing OOM on both CPU and 8 GB GPUs. Batch size 8 keeps peak attention memory under ~2 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CUDA support

1 participant