Skip to content

feat: Modal.com remote GPU inference backend#444

Draft
ofekby wants to merge 59 commits intotobi:mainfrom
ofekby:feat/modal-inference
Draft

feat: Modal.com remote GPU inference backend#444
ofekby wants to merge 59 commits intotobi:mainfrom
ofekby:feat/modal-inference

Conversation

@ofekby
Copy link
Copy Markdown

@ofekby ofekby commented Mar 19, 2026

Motivation

QMD relies on three local GGUF models (embedding, query expansion, reranking) via node-llama-cpp. Users without a dedicated GPU — or on machines where llama.cpp doesn't build cleanly — can't use hybrid search at all. This PR adds an optional remote GPU backend so those users get the same search quality without local hardware requirements.

What this does

Adds Modal.com remote GPU inference as a transparent alternative backend. When enabled, all three model operations (embed, expand, rerank) run on a T4 GPU via Modal instead of locally.

  • qmd modal deploy / status / test / destroy — full lifecycle CLI
  • Transparent routing — when modal.inference=true in config, qmd query and qmd embed use Modal automatically; no other commands change
  • Same models, same results — the remote backend runs identical GGUF files through llama-server
  • qmd status shows the active inference backend (local vs Modal) so users can tell at a glance which is in use

Architecture

┌─────────────┐    modal npm SDK    ┌──────────────────────────┐
│  qmd (JS)   │ ──────────────────► │  Modal container (T4)    │
│  ModalLLM   │                     │  ┌─ llama-server :8081   │ ← embed
│  implements  │                     │  ├─ llama-server :8082   │ ← expand
│  LLM iface  │                     │  └─ llama-server :8083   │ ← rerank
└─────────────┘                     └──────────────────────────┘
  • modal/serve.py — Python Modal app running 3 llama-server subprocesses on separate ports with health checks
  • ghcr.io/ofekby/qmd-llama-server:b8179-sm75 — pre-built Docker image (llama.cpp b8179, CUDA, T4/sm_75), built via CI workflow
  • src/modal.tsModalBackend / ModalSession wrapping the Modal npm SDK with retry logic
  • src/llm.tsModalLLM implements the LLM interface; getDefaultLLM() routes based on config
  • GPU memory snapshots for fast cold starts (~6s vs ~40s without)
  • Region auto-detection picks the closest Modal region to minimize latency

Key design decisions

Decision Rationale
llama-server over llama-cpp-python llama-cpp-python is abandoned (last release Aug 2025) and doesn't support gemma-embedding
Pre-built Docker image via GHCR CUDA compilation exceeds Modal's image build timeout
GPU memory snapshots Without enable_gpu_snapshot, the snapshot phase runs CPU-only
No silent fallback Hard fail when Modal is unreachable — surprises are worse than clear errors
Region auto-detection Reduces latency for non-US users without manual config

Test plan

  • 65 unit tests covering config, backend, CLI, region detection, and integration
  • Smoke test via qmd modal test (embed + generate round-trip)
  • Full hybrid query end-to-end: qmd query with expansion → embedding → reranking
  • Clean lifecycle test: deploy → test → query → destroy

Config

New keys in ~/.config/qmd/index.yml:

modal:
  inference: true          # enable/disable Modal backend
  gpu: "T4"               # GPU type
  scaledown_window: 15    # idle timeout (seconds)
  region: "us-east"       # auto-detected, or set manually

🤖 Generated with Claude Code

ofekby and others added 30 commits March 19, 2026 11:46
Introduces optional remote GPU inference via Modal.com for users
without a local GPU. Uses llama-cpp-python on a T4 GPU with memory
snapshots for fast cold starts.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ps, chat templates)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds ModalConfig interface with inference, gpu, and scaledown_window
fields. Includes getModalConfig() and setModalConfig() helpers with
defaults (inference=false, gpu="T4", scaledown_window=15).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Python Modal app exposing embed(), generate(), rerank(), ping() on a
T4 GPU with memory snapshots. Models baked into image at build time.
CLI entry point for deploy/status/destroy.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ModalBackend class calls deployed Modal functions via the modal npm SDK
(gRPC). Includes retry logic (3 attempts for connection errors, 100ms
delay), lazy initialization, and ~/.modal.toml validation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Extracted modal CLI handler into src/cli/modal.ts with pre-flight
checks for python3, modal pip package, and ~/.modal.toml. Returns
testable result objects instead of calling process.exit directly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add ModalLLM class implementing the LLM interface that delegates
inference to the Modal backend while keeping prompt formatting local.
Modify withLLMSession to route through ModalSession when modal is active,
bypassing LLMSessionManager. Add validateModalConnection for startup
health checks and getDefaultLLM factory for backend selection.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add modal/ directory to package.json files array so serve.py and
requirements.txt ship with the npm package. Add changelog entry for
the Modal inference backend feature.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@app.cls must be outermost, @modal.concurrent inside it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ssing

The pip package nvidia-cuda-runtime-cu12 does not place .so files on the
system library path, causing llama-cpp-python to crash at runtime. Switch
to nvidia/cuda:12.4.0-runtime-ubuntu22.04 as the base image per Modal's
CUDA guide so CUDA runtime libraries are available as system packages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama-cpp-python is abandoned (last release Aug 2025) and doesn't support
the gemma-embedding architecture. Replace it with llama-server, which IS
llama.cpp and supports all architectures natively.

Key changes:
- Build llama-server from source (b8179, matching node-llama-cpp v3.17.1)
  during image build with CUDA support
- Run 3 llama-server instances on separate ports (8081 embed, 8082 expand,
  8083 rerank), each loading one model
- Use native /reranking endpoint with --pooling rank instead of the
  logprobs-based chat completion approach
- Proxy HTTP requests from Modal methods to local llama-server instances
- Enable memory snapshots to capture running subprocess state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add Dockerfile.llama-server: multi-stage build compiling llama-server
  from llama.cpp b8179 with CUDA support
- Add CI workflow to build and push to ghcr.io/ofekby/qmd-llama-server
- Update serve.py to use the pre-built image instead of compiling
  from source during Modal deploy (which times out)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Dockerfile builds llama-server from llama.cpp b8179 with CUDA for
  sm_75 (T4) only, with libcuda stub symlink for Docker builds
- serve.py uses ghcr.io/ofekby/qmd-llama-server:b8179-sm75 as base
- CI workflow included for future rebuilds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
llama-server depends on libmtmd.so and other .so files built alongside
it. Copy the entire build/bin/ directory and configure ldconfig.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ENTRYPOINT interfered with Modal's run_function for model downloads.
Modal runs its own Python process inside the container.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Modal caches by tag name and doesn't re-pull on tag updates.
Pin to exact digest to ensure correct image.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GPU snapshots allow llama-server to load models onto T4 during
snapshot phase. Subsequent cold starts restore from GPU snapshot
instead of re-loading models (~10x faster).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
End users won't hit Modal's tag caching issue since they deploy fresh.
The digest pin was only needed to bust our development cache.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Calls ping() after deploy to force container spin-up and snapshot
creation. Users no longer pay the ~40s cost on their first query.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add force_build=True to bust Modal's stale image cache
- Add embedBatch() to ModalLLM (required by store.ts)
- Replace getDefaultLlamaCpp() with getDefaultLLM() in store.ts
- Fix embed response flattening in serve.py

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove force_build=True (cache busted successfully)
- Fix embed response flattening for nested [[...]] format
- Add embedBatch() to ModalLLM
- Route store.ts through getDefaultLLM()

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Set --ubatch-size 2048 on embed server (matching local model config)
  to handle longer inputs like HyDE expansions (>512 tokens)
- Batch all texts in a single /embedding request to reduce round-trips

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ofekby and others added 16 commits March 19, 2026 17:40
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents test interference when bun runs multiple test files in the
same process with --preload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- docs/superpowers/specs/2026-03-20-modal-embed-design.md: full design for routing
  qmd embed through Modal when modal.inference=true, including tokenize and
  embed endpoints on Modal, LLM interface update, and generateEmbeddings refactor
- .gitignore: add exception for docs/superpowers/specs/*.md
- ModalLLM tokenize tests go in modal-integration.test.ts (already tests ModalLLM)
- generateEmbeddings routing verified by existing store.test.ts tests
- modal-backend.test.ts covers ModalBackend.tokenize()
Proxies /tokenize from llama-server embeddinggemma (port 8081) to
enable tokenization for Modal-only indexing.
Both LlamaCpp and ModalLLM already implement these methods; adding them
to the interface enables generateEmbeddings to route through getDefaultLLM.
Calls QMDInference.tokenize() on Modal, which proxies to llama-server's
/tokenize endpoint for the embeddinggemma model.
ModalLLM delegates to ModalBackend.tokenize() which calls the
QMDInference.tokenize() endpoint on Modal. ModalSession also exposes
these for consistency with LlamaCpp's session interface.
…ocumentByTokens

- Change Store.llm type from LlamaCpp to LLM
- Update generateEmbeddings to use getDefaultLLM() instead of getDefaultLlamaCpp()
- Update chunkDocumentByTokens to use getDefaultLLM() instead of getDefaultLlamaCpp()
- Generalize withLLMSessionForLlm to accept LLM and dispatch to ModalSession for ModalLLM
- Update expandQuery and rerank function signatures to accept LLM instead of LlamaCpp
The rerank function now uses getDefaultLLM() instead of getDefaultLlamaCpp(),
so the test needs to mock the correct function.
The reranking model (qwen3-reranker) was hitting the default ubatch-size
limit of 512 tokens, causing errors like:
  input (1020 tokens) is too large to process

This matches the local model config which uses RERANK_CONTEXT_SIZE 2048.
- Add detokenize endpoint to Modal serve.py (proxies to llama-server /detokenize)
- Add detokenize to ModalBackend and LLM interface
- Add detokenize implementation to ModalLLM
- Replace char-based truncation with token-level truncation in ModalLLM.rerank()
- Match local LlamaCpp behavior: maxDocTokens = ctxSize - overhead - queryTokens
- Add tests for detokenize in modal-backend.test.ts and modal-integration.test.ts

This fixes the 'Context size exceeded' error in Modal reranking by properly
truncating documents to fit within the rerank model's context window, matching
the local behavior exactly.
@ofekby ofekby marked this pull request as draft March 21, 2026 12:24
@ofekby ofekby force-pushed the feat/modal-inference branch from e6c0279 to 07cbbf2 Compare March 21, 2026 23:16
The Inference section now indicates whether queries route through
Modal (remote GPU) or the local device, so users can tell at a glance
which backend is active.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant