feat: Modal.com remote GPU inference backend by ofekby · Pull Request #444 · tobi/qmd

ofekby · 2026-03-19T15:42:13Z

Motivation

QMD relies on three local GGUF models (embedding, query expansion, reranking) via node-llama-cpp. Users without a dedicated GPU — or on machines where llama.cpp doesn't build cleanly — can't use hybrid search at all. This PR adds an optional remote GPU backend so those users get the same search quality without local hardware requirements.

What this does

Adds Modal.com remote GPU inference as a transparent alternative backend. When enabled, all three model operations (embed, expand, rerank) run on a T4 GPU via Modal instead of locally.

qmd modal deploy / status / test / destroy — full lifecycle CLI
Transparent routing — when modal.inference=true in config, qmd query and qmd embed use Modal automatically; no other commands change
Same models, same results — the remote backend runs identical GGUF files through llama-server
qmd status shows the active inference backend (local vs Modal) so users can tell at a glance which is in use

Architecture

┌─────────────┐    modal npm SDK    ┌──────────────────────────┐
│  qmd (JS)   │ ──────────────────► │  Modal container (T4)    │
│  ModalLLM   │                     │  ┌─ llama-server :8081   │ ← embed
│  implements  │                     │  ├─ llama-server :8082   │ ← expand
│  LLM iface  │                     │  └─ llama-server :8083   │ ← rerank
└─────────────┘                     └──────────────────────────┘

modal/serve.py — Python Modal app running 3 llama-server subprocesses on separate ports with health checks
ghcr.io/ofekby/qmd-llama-server:b8179-sm75 — pre-built Docker image (llama.cpp b8179, CUDA, T4/sm_75), built via CI workflow
src/modal.ts — ModalBackend / ModalSession wrapping the Modal npm SDK with retry logic
src/llm.ts — ModalLLM implements the LLM interface; getDefaultLLM() routes based on config
GPU memory snapshots for fast cold starts (~6s vs ~40s without)
Region auto-detection picks the closest Modal region to minimize latency

Key design decisions

Decision	Rationale
llama-server over llama-cpp-python	llama-cpp-python is abandoned (last release Aug 2025) and doesn't support gemma-embedding
Pre-built Docker image via GHCR	CUDA compilation exceeds Modal's image build timeout
GPU memory snapshots	Without `enable_gpu_snapshot`, the snapshot phase runs CPU-only
No silent fallback	Hard fail when Modal is unreachable — surprises are worse than clear errors
Region auto-detection	Reduces latency for non-US users without manual config

Test plan

65 unit tests covering config, backend, CLI, region detection, and integration
Smoke test via qmd modal test (embed + generate round-trip)
Full hybrid query end-to-end: qmd query with expansion → embedding → reranking
Clean lifecycle test: deploy → test → query → destroy

Config

New keys in ~/.config/qmd/index.yml:

modal:
  inference: true          # enable/disable Modal backend
  gpu: "T4"               # GPU type
  scaledown_window: 15    # idle timeout (seconds)
  region: "us-east"       # auto-detected, or set manually

🤖 Generated with Claude Code

Introduces optional remote GPU inference via Modal.com for users without a local GPU. Uses llama-cpp-python on a T4 GPU with memory snapshots for fast cold starts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ps, chat templates) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Adds ModalConfig interface with inference, gpu, and scaledown_window fields. Includes getModalConfig() and setModalConfig() helpers with defaults (inference=false, gpu="T4", scaledown_window=15). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Python Modal app exposing embed(), generate(), rerank(), ping() on a T4 GPU with memory snapshots. Models baked into image at build time. CLI entry point for deploy/status/destroy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ModalBackend class calls deployed Modal functions via the modal npm SDK (gRPC). Includes retry logic (3 attempts for connection errors, 100ms delay), lazy initialization, and ~/.modal.toml validation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Extracted modal CLI handler into src/cli/modal.ts with pre-flight checks for python3, modal pip package, and ~/.modal.toml. Returns testable result objects instead of calling process.exit directly. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add ModalLLM class implementing the LLM interface that delegates inference to the Modal backend while keeping prompt formatting local. Modify withLLMSession to route through ModalSession when modal is active, bypassing LLMSessionManager. Add validateModalConnection for startup health checks and getDefaultLLM factory for backend selection. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add modal/ directory to package.json files array so serve.py and requirements.txt ship with the npm package. Add changelog entry for the Modal inference backend feature. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@app.cls must be outermost, @modal.concurrent inside it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ssing The pip package nvidia-cuda-runtime-cu12 does not place .so files on the system library path, causing llama-cpp-python to crash at runtime. Switch to nvidia/cuda:12.4.0-runtime-ubuntu22.04 as the base image per Modal's CUDA guide so CUDA runtime libraries are available as system packages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

llama-cpp-python is abandoned (last release Aug 2025) and doesn't support the gemma-embedding architecture. Replace it with llama-server, which IS llama.cpp and supports all architectures natively. Key changes: - Build llama-server from source (b8179, matching node-llama-cpp v3.17.1) during image build with CUDA support - Run 3 llama-server instances on separate ports (8081 embed, 8082 expand, 8083 rerank), each loading one model - Use native /reranking endpoint with --pooling rank instead of the logprobs-based chat completion approach - Proxy HTTP requests from Modal methods to local llama-server instances - Enable memory snapshots to capture running subprocess state Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add Dockerfile.llama-server: multi-stage build compiling llama-server from llama.cpp b8179 with CUDA support - Add CI workflow to build and push to ghcr.io/ofekby/qmd-llama-server - Update serve.py to use the pre-built image instead of compiling from source during Modal deploy (which times out) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Dockerfile builds llama-server from llama.cpp b8179 with CUDA for sm_75 (T4) only, with libcuda stub symlink for Docker builds - serve.py uses ghcr.io/ofekby/qmd-llama-server:b8179-sm75 as base - CI workflow included for future rebuilds Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

llama-server depends on libmtmd.so and other .so files built alongside it. Copy the entire build/bin/ directory and configure ldconfig. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ENTRYPOINT interfered with Modal's run_function for model downloads. Modal runs its own Python process inside the container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Modal caches by tag name and doesn't re-pull on tag updates. Pin to exact digest to ensure correct image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

GPU snapshots allow llama-server to load models onto T4 during snapshot phase. Subsequent cold starts restore from GPU snapshot instead of re-loading models (~10x faster). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

End users won't hit Modal's tag caching issue since they deploy fresh. The digest pin was only needed to bust our development cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Calls ping() after deploy to force container spin-up and snapshot creation. Users no longer pay the ~40s cost on their first query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Add force_build=True to bust Modal's stale image cache - Add embedBatch() to ModalLLM (required by store.ts) - Replace getDefaultLlamaCpp() with getDefaultLLM() in store.ts - Fix embed response flattening in serve.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Remove force_build=True (cache busted successfully) - Fix embed response flattening for nested [[...]] format - Add embedBatch() to ModalLLM - Route store.ts through getDefaultLLM() Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Set --ubatch-size 2048 on embed server (matching local model config) to handle longer inputs like HyDE expansions (>512 tokens) - Batch all texts in a single /embedding request to reduce round-trips Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Prevents test interference when bun runs multiple test files in the same process with --preload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- docs/superpowers/specs/2026-03-20-modal-embed-design.md: full design for routing qmd embed through Modal when modal.inference=true, including tokenize and embed endpoints on Modal, LLM interface update, and generateEmbeddings refactor - .gitignore: add exception for docs/superpowers/specs/*.md

- ModalLLM tokenize tests go in modal-integration.test.ts (already tests ModalLLM) - generateEmbeddings routing verified by existing store.test.ts tests - modal-backend.test.ts covers ModalBackend.tokenize()

Proxies /tokenize from llama-server embeddinggemma (port 8081) to enable tokenization for Modal-only indexing.

Both LlamaCpp and ModalLLM already implement these methods; adding them to the interface enables generateEmbeddings to route through getDefaultLLM.

Calls QMDInference.tokenize() on Modal, which proxies to llama-server's /tokenize endpoint for the embeddinggemma model.

ModalLLM delegates to ModalBackend.tokenize() which calls the QMDInference.tokenize() endpoint on Modal. ModalSession also exposes these for consistency with LlamaCpp's session interface.

…ocumentByTokens - Change Store.llm type from LlamaCpp to LLM - Update generateEmbeddings to use getDefaultLLM() instead of getDefaultLlamaCpp() - Update chunkDocumentByTokens to use getDefaultLLM() instead of getDefaultLlamaCpp() - Generalize withLLMSessionForLlm to accept LLM and dispatch to ModalSession for ModalLLM - Update expandQuery and rerank function signatures to accept LLM instead of LlamaCpp

The rerank function now uses getDefaultLLM() instead of getDefaultLlamaCpp(), so the test needs to mock the correct function.

The reranking model (qwen3-reranker) was hitting the default ubatch-size limit of 512 tokens, causing errors like: input (1020 tokens) is too large to process This matches the local model config which uses RERANK_CONTEXT_SIZE 2048.

- Add detokenize endpoint to Modal serve.py (proxies to llama-server /detokenize) - Add detokenize to ModalBackend and LLM interface - Add detokenize implementation to ModalLLM - Replace char-based truncation with token-level truncation in ModalLLM.rerank() - Match local LlamaCpp behavior: maxDocTokens = ctxSize - overhead - queryTokens - Add tests for detokenize in modal-backend.test.ts and modal-integration.test.ts This fixes the 'Context size exceeded' error in Modal reranking by properly truncating documents to fit within the rerank model's context window, matching the local behavior exactly.

- Add docs/modal-latency.md with formulas and measurements - Add scripts/benchmark_modal_latency.py to measure Modal overhead - Benchmark shows 40.5% network overhead for US region from EU - EU region would reduce overhead to 0.6%

The Inference section now indicates whether queries route through Modal (remote GPU) or the local device, so users can tell at a glance which backend is active. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ofekby and others added 30 commits March 19, 2026 11:46

Add design spec for Modal inference backend

2994a95

Introduces optional remote GPU inference via Modal.com for users without a local GPU. Uses llama-cpp-python on a T4 GPU with memory snapshots for fast cold starts. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update Modal inference spec with review feedback

913777f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update spec and plan with review feedback (rerank method, optional de…

d21bcec

…ps, chat templates) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: add ModalBackend JS client with retry logic

8ebfaf7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use CUDA base image and modal.concurrent for Modal service

f069c10

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: swap decorator order for modal.concurrent

aa1406e

@app.cls must be outermost, @modal.concurrent inside it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ci: retrigger build after GHCR permissions fix

6460d20

fix: copy all shared libraries in llama-server Docker image

c2a35c6

llama-server depends on libmtmd.so and other .so files built alongside it. Copy the entire build/bin/ directory and configure ldconfig. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: add CUDA arch suffix to CI image tag (b8179-sm75)

304b650

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use v2 tag to bust Modal image cache

37ab851

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: remove ENTRYPOINT from Docker image, use v3 tag

2d7e910

ENTRYPOINT interfered with Modal's run_function for model downloads. Modal runs its own Python process inside the container. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use correct GHCR tag b8179-sm75 (matches CI output)

53c5c72

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: pin GHCR image by digest to bust Modal cache

8a8c69b

Modal caches by tag name and doesn't re-pull on tag updates. Pin to exact digest to ensure correct image. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: use b8179-sm75 tag instead of pinned digest

99eaacd

End users won't hit Modal's tag caching issue since they deploy fresh. The digest pin was only needed to bust our development cache. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: trigger GPU snapshot creation during qmd modal deploy

762779f

Calls ping() after deploy to force container spin-up and snapshot creation. Users no longer pay the ~40s cost on their first query. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: print progress message during GPU snapshot creation

d3a6128

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ofekby and others added 16 commits March 19, 2026 17:40

docs: add Modal inference section to CLAUDE.md

0a68355

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

chore: pin modal/requirements.txt to patch versions

501a1c0

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: isolate modal config tests with withInlineConfig helper

c8ac5de

Prevents test interference when bun runs multiple test files in the same process with --preload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

docs: add modal embed implementation plan

39200ac

fix(spec): correct test file references per review

0b85024

- ModalLLM tokenize tests go in modal-integration.test.ts (already tests ModalLLM) - generateEmbeddings routing verified by existing store.test.ts tests - modal-backend.test.ts covers ModalBackend.tokenize()

feat(modal): add tokenize endpoint to QMDInference

e367234

Proxies /tokenize from llama-server embeddinggemma (port 8081) to enable tokenization for Modal-only indexing.

feat(llm): add tokenize/countTokens to LLM interface

bc16022

Both LlamaCpp and ModalLLM already implement these methods; adding them to the interface enables generateEmbeddings to route through getDefaultLLM.

feat(modal): add tokenize() to ModalBackend

43f4a48

Calls QMDInference.tokenize() on Modal, which proxies to llama-server's /tokenize endpoint for the embeddinggemma model.

feat(llm): implement tokenize/countTokens in ModalLLM and ModalSession

d9bd059

ModalLLM delegates to ModalBackend.tokenize() which calls the QMDInference.tokenize() endpoint on Modal. ModalSession also exposes these for consistency with LlamaCpp's session interface.

test(modal): add tests for tokenize() in ModalBackend and ModalLLM

7292189

fix(test): update rerank test to spy on getDefaultLLM

ff32e31

The rerank function now uses getDefaultLLM() instead of getDefaultLlamaCpp(), so the test needs to mock the correct function.

docs: update spec to clarify session management approach

47142b3

fix(modal): add ubatch-size 2048 to rerank server

31a4a1e

The reranking model (qwen3-reranker) was hitting the default ubatch-size limit of 512 tokens, causing errors like: input (1020 tokens) is too large to process This matches the local model config which uses RERANK_CONTEXT_SIZE 2048.

ofekby marked this pull request as draft March 21, 2026 12:24

ofekby added 11 commits March 21, 2026 13:00

docs: add Modal latency analysis and benchmark script

82fb214

- Add docs/modal-latency.md with formulas and measurements - Add scripts/benchmark_modal_latency.py to measure Modal overhead - Benchmark shows 40.5% network overhead for US region from EU - EU region would reduce overhead to 0.6%

docs: add Modal region auto-detection design spec

1886db5

docs: add Modal region auto-detection implementation plan

2a61ca6

feat(modal): add region field to ModalConfig

07df2c0

feat(modal): add region detection functions

28a6fe1

feat(modal): integrate region detection into deploy command

b4850fe

feat(modal): add --region argument to serve.py deploy

a5fb306

test(modal): add unit tests for region detection functions

c8b7676

docs: add region auto-detection documentation

0a66aec

fix(modal): use eu-west-3 for EU region detection

96e8245

fix(modal): use --parallel 1 and restore --ubatch-size 2048 for reranker

07cbbf2

ofekby force-pushed the feat/modal-inference branch from e6c0279 to 07cbbf2 Compare March 21, 2026 23:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Modal.com remote GPU inference backend#444

feat: Modal.com remote GPU inference backend#444
ofekby wants to merge 59 commits intotobi:mainfrom
ofekby:feat/modal-inference

ofekby commented Mar 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ofekby commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What this does

Architecture

Key design decisions

Test plan

Config

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ofekby commented Mar 19, 2026 •

edited

Loading