feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server by bagg-anon · Pull Request #476 · tobi/qmd

bagg-anon · 2026-03-27T19:47:20Z

Problem

When running qmd mcp or qmd mcp --http as a long-lived daemon, the current behaviour disposes model weights after 5 minutes of inactivity (disposeModelsOnInactivity: true hard-coded in createStore). This means the first query after an idle period has to reload the GGUF file from disk — which can take several seconds and causes a brief VRAM spike.

That trade-off makes sense for one-off CLI usage (you don't want to keep VRAM pinned after a single query), but not for a persistent MCP server where the whole point is low-latency responses.

The LlamaCppConfig already has disposeModelsOnInactivity: false as its default for exactly this reason (see the comment in llm.ts: "Keeping models loaded avoids repeated VRAM thrash"). We just had no way to opt into it from the outside.

Solution

Add a --keep-models flag (and QMD_KEEP_MODELS=1 env var) that keeps model weights in memory between queries while still releasing contexts (the lighter per-session VRAM objects) on inactivity.

Changes

File	Change
`src/index.ts`	`StoreOptions` gains `keepModels?: boolean`; `createStore` reads `QMD_KEEP_MODELS` env var and passes the flag through to `LlamaCpp`
`src/mcp/server.ts`	`startMcpServer` and `startMcpHttpServer` accept `{ keepModels }` and forward it to `createStore`
`src/cli/qmd.ts`	`qmd mcp` gains `--keep-models` flag; it is also forwarded through the daemon spawn args; help text updated

Usage

# stdio transport (e.g. Claude Desktop, Cursor)
qmd mcp --keep-models

# HTTP foreground
qmd mcp --http --keep-models

# HTTP daemon (flag is forwarded to the child process)
qmd mcp --http --daemon --keep-models

# Via environment variable (handy for systemd units or Docker)
QMD_KEEP_MODELS=1 qmd mcp

Why not flip the default?

The 5-minute timeout + disposeModelsOnInactivity: true is the right default for CLI use — most people run qmd query once and move on, and holding VRAM indefinitely would surprise them. The flag is an explicit opt-in for operators who know they are running a server.

Testing

Verified by reading the logic path end-to-end:

--keep-models → keepModels = true → createStore({ keepModels: true }) → disposeModelsOnInactivity: false → model weights stay loaded, contexts are still released on inactivity timer
QMD_KEEP_MODELS=1 works the same way when the flag is absent
Daemon spawn correctly appends --keep-models to child args so the flag survives qmd mcp --http --daemon --keep-models

When running as a long-lived daemon (qmd mcp or qmd mcp --http), the current behaviour disposes *model weights* after 5 minutes of inactivity (disposeModelsOnInactivity: true in createStore). That means the first query after an idle period has to reload the GGUF file from disk, which can take several seconds and briefly spikes VRAM allocation. For one-off CLI use this is fine — you don't want to hold VRAM hostage after a single query. But for a long-running MCP server the right tradeoff is to keep the model weights in memory and only release the *contexts* (the lighter per-session objects) on inactivity. That is exactly what disposeModelsOnInactivity: false already does; we just had no way to opt into it from the outside. Changes: - StoreOptions gains a keepModels?: boolean field (SDK-level knob) - createStore reads QMD_KEEP_MODELS=1 as a fallback env var - startMcpServer / startMcpHttpServer accept { keepModels } and forward it - qmd mcp gains a --keep-models CLI flag that is also forwarded through the daemon spawn args so it survives qmd mcp --http --daemon --keep-models - Help text and inline docs updated Usage: qmd mcp --keep-models # stdio, weights stay warm qmd mcp --http --keep-models # HTTP foreground qmd mcp --http --daemon --keep-models # HTTP daemon QMD_KEEP_MODELS=1 qmd mcp # via env var (e.g. systemd)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476

feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476
bagg-anon wants to merge 1 commit intotobi:mainfrom
bagg-anon:feat/keep-models-warm-for-mcp-server

bagg-anon commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bagg-anon commented Mar 27, 2026

Problem

Solution

Changes

Usage

Why not flip the default?

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant