Skip to content

feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476

Open
bagg-anon wants to merge 1 commit intotobi:mainfrom
bagg-anon:feat/keep-models-warm-for-mcp-server
Open

feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476
bagg-anon wants to merge 1 commit intotobi:mainfrom
bagg-anon:feat/keep-models-warm-for-mcp-server

Conversation

@bagg-anon
Copy link
Copy Markdown

Problem

When running qmd mcp or qmd mcp --http as a long-lived daemon, the current behaviour disposes model weights after 5 minutes of inactivity (disposeModelsOnInactivity: true hard-coded in createStore). This means the first query after an idle period has to reload the GGUF file from disk — which can take several seconds and causes a brief VRAM spike.

That trade-off makes sense for one-off CLI usage (you don't want to keep VRAM pinned after a single query), but not for a persistent MCP server where the whole point is low-latency responses.

The LlamaCppConfig already has disposeModelsOnInactivity: false as its default for exactly this reason (see the comment in llm.ts: "Keeping models loaded avoids repeated VRAM thrash"). We just had no way to opt into it from the outside.

Solution

Add a --keep-models flag (and QMD_KEEP_MODELS=1 env var) that keeps model weights in memory between queries while still releasing contexts (the lighter per-session VRAM objects) on inactivity.

Changes

File Change
src/index.ts StoreOptions gains keepModels?: boolean; createStore reads QMD_KEEP_MODELS env var and passes the flag through to LlamaCpp
src/mcp/server.ts startMcpServer and startMcpHttpServer accept { keepModels } and forward it to createStore
src/cli/qmd.ts qmd mcp gains --keep-models flag; it is also forwarded through the daemon spawn args; help text updated

Usage

# stdio transport (e.g. Claude Desktop, Cursor)
qmd mcp --keep-models

# HTTP foreground
qmd mcp --http --keep-models

# HTTP daemon (flag is forwarded to the child process)
qmd mcp --http --daemon --keep-models

# Via environment variable (handy for systemd units or Docker)
QMD_KEEP_MODELS=1 qmd mcp

Why not flip the default?

The 5-minute timeout + disposeModelsOnInactivity: true is the right default for CLI use — most people run qmd query once and move on, and holding VRAM indefinitely would surprise them. The flag is an explicit opt-in for operators who know they are running a server.

Testing

Verified by reading the logic path end-to-end:

  • --keep-modelskeepModels = truecreateStore({ keepModels: true })disposeModelsOnInactivity: false → model weights stay loaded, contexts are still released on inactivity timer
  • QMD_KEEP_MODELS=1 works the same way when the flag is absent
  • Daemon spawn correctly appends --keep-models to child args so the flag survives qmd mcp --http --daemon --keep-models

When running as a long-lived daemon (qmd mcp or qmd mcp --http), the
current behaviour disposes *model weights* after 5 minutes of inactivity
(disposeModelsOnInactivity: true in createStore).  That means the first
query after an idle period has to reload the GGUF file from disk, which
can take several seconds and briefly spikes VRAM allocation.

For one-off CLI use this is fine — you don't want to hold VRAM hostage
after a single query.  But for a long-running MCP server the right
tradeoff is to keep the model weights in memory and only release the
*contexts* (the lighter per-session objects) on inactivity.  That is
exactly what disposeModelsOnInactivity: false already does; we just had
no way to opt into it from the outside.

Changes:
- StoreOptions gains a keepModels?: boolean field (SDK-level knob)
- createStore reads QMD_KEEP_MODELS=1 as a fallback env var
- startMcpServer / startMcpHttpServer accept { keepModels } and forward it
- qmd mcp gains a --keep-models CLI flag that is also forwarded through
  the daemon spawn args so it survives qmd mcp --http --daemon --keep-models
- Help text and inline docs updated

Usage:
  qmd mcp --keep-models                       # stdio, weights stay warm
  qmd mcp --http --keep-models                # HTTP foreground
  qmd mcp --http --daemon --keep-models       # HTTP daemon
  QMD_KEEP_MODELS=1 qmd mcp                   # via env var (e.g. systemd)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant