feat: optional CUDA/GPU acceleration for embeddings by sattva1 · Pull Request #66 · harshkedia177/axon

sattva1 · 2026-04-03T11:18:26Z

Summary

Closes #64.

Adds opt-in CUDA support for the embedding pipeline, reducing embedding time from minutes to seconds on GPU-capable machines.

Two activation paths:

--cuda CLI flag on analyze, watch, host, and serve
AXON_CUDA=1 environment variable (works across all commands without per-command flags)

Design

Uses a module-level configuration in embedder.py rather than threading a cuda parameter through every function signature. _get_model() reads the CUDA state at call time via _resolve_cuda(), which checks both the programmatic flag and the env var. This means all embedding call sites — pipeline, watcher, and search-time embed_query() from MCP/web — automatically use GPU when enabled.

CUDA validation

When CUDA is requested, _get_model() captures fastembed's RuntimeWarning on CUDAExecutionProvider fallback and converts it to a RuntimeError with actionable install instructions. CLI commands call validate_cuda() before the pipeline starts, so the error surfaces immediately rather than being swallowed by the pipeline's broad except Exception handler.

Changes

src/axon/core/embeddings/embedder.py — configure_cuda(), _resolve_cuda(), validate_cuda(). _get_model() uses (model_name, cuda) compound cache key and post-init CUDA fallback detection.
src/axon/cli/main.py — _configure_and_validate_cuda() helper. --cuda flag on analyze, watch, host, serve.
tests/core/test_embedder.py — 15 new tests covering flag, env var, cache separation, fallback detection.

Usage

# Via CLI flag
axon analyze --cuda .

# Via environment variable (works with any command)
export AXON_CUDA=1
axon analyze .
axon watch
axon host

…ng pipeline - Add configure_cuda() / validate_cuda() public API and _resolve_cuda() to honour both the --cuda flag and AXON_CUDA env var - Key model cache on (model_name, cuda) tuple to prevent CPU/GPU model aliasing; pass cuda=True to TextEmbedding and surface ONNX CUDAExecutionProvider fallback warnings as RuntimeError - Expose --cuda flag on analyze, watch, host, and serve commands via shared _configure_and_validate_cuda() helper

…t OOM fastembed defaults to Device.AUTO, which auto-detects and uses CUDA when onnxruntime-gpu is installed. On GPUs with limited VRAM (e.g. 8GB), the nomic model with batch_size=32 causes OOM via a 9.5GB BiasSoftmax allocation. Pass cuda=False explicitly in the CPU path. Also fix test isolation: reset _cuda_enabled and AXON_CUDA env var in the autouse fixture to prevent state leaking between tests.

The nomic-embed-text-v1.5 model has 12 attention heads and 2048-token context, making each batch element's attention matrix ~192 MB. At batch_size=32 this totals ~6.4 GB for attention alone, causing OOM on both CPU (physical memory) and GPUs with <= 8 GB VRAM. Batch size 8 keeps peak memory under ~2 GB. This was not an issue with the previous BAAI/bge-small-en-v1.5 model (6 heads, 512-token limit) but the batch size was never adjusted when the model was upgraded.

sattva1 · 2026-04-03T13:17:56Z

Two additional fixes pushed:

1. Explicit cuda=False in CPU path (2365c99)

fastembed defaults to Device.AUTO, which auto-detects and uses CUDA when onnxruntime-gpu is installed — even without --cuda or AXON_CUDA. This caused silent OOM on GPUs with limited VRAM. The CPU path now explicitly passes cuda=False to override auto-detection.

2. Batch size 32 → 8 (acac2e8)

The default batch size was never adjusted when the model changed from BAAI/bge-small-en-v1.5 (6 heads, 512-token context) to nomic-embed-text-v1.5 (12 heads, 2048-token context). Each batch element's attention matrix is ~192 MB with nomic — at batch_size=32 that's ~6.4 GB for attention alone, causing OOM on both CPU and 8 GB GPUs. Batch size 8 keeps peak attention memory under ~2 GB.

Vladislav Miller added 3 commits April 3, 2026 13:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: optional CUDA/GPU acceleration for embeddings#66

feat: optional CUDA/GPU acceleration for embeddings#66
sattva1 wants to merge 3 commits intoharshkedia177:mainfrom
sattva1:sattva/feat/cuda-support

sattva1 commented Apr 3, 2026

Uh oh!

sattva1 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sattva1 commented Apr 3, 2026

Summary

Design

CUDA validation

Changes

Usage

Uh oh!

sattva1 commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant