feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476
Open
feat: add --keep-models flag and QMD_KEEP_MODELS env var for MCP server#476
Conversation
When running as a long-lived daemon (qmd mcp or qmd mcp --http), the
current behaviour disposes *model weights* after 5 minutes of inactivity
(disposeModelsOnInactivity: true in createStore). That means the first
query after an idle period has to reload the GGUF file from disk, which
can take several seconds and briefly spikes VRAM allocation.
For one-off CLI use this is fine — you don't want to hold VRAM hostage
after a single query. But for a long-running MCP server the right
tradeoff is to keep the model weights in memory and only release the
*contexts* (the lighter per-session objects) on inactivity. That is
exactly what disposeModelsOnInactivity: false already does; we just had
no way to opt into it from the outside.
Changes:
- StoreOptions gains a keepModels?: boolean field (SDK-level knob)
- createStore reads QMD_KEEP_MODELS=1 as a fallback env var
- startMcpServer / startMcpHttpServer accept { keepModels } and forward it
- qmd mcp gains a --keep-models CLI flag that is also forwarded through
the daemon spawn args so it survives qmd mcp --http --daemon --keep-models
- Help text and inline docs updated
Usage:
qmd mcp --keep-models # stdio, weights stay warm
qmd mcp --http --keep-models # HTTP foreground
qmd mcp --http --daemon --keep-models # HTTP daemon
QMD_KEEP_MODELS=1 qmd mcp # via env var (e.g. systemd)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When running
qmd mcporqmd mcp --httpas a long-lived daemon, the current behaviour disposes model weights after 5 minutes of inactivity (disposeModelsOnInactivity: truehard-coded increateStore). This means the first query after an idle period has to reload the GGUF file from disk — which can take several seconds and causes a brief VRAM spike.That trade-off makes sense for one-off CLI usage (you don't want to keep VRAM pinned after a single query), but not for a persistent MCP server where the whole point is low-latency responses.
The
LlamaCppConfigalready hasdisposeModelsOnInactivity: falseas its default for exactly this reason (see the comment inllm.ts: "Keeping models loaded avoids repeated VRAM thrash"). We just had no way to opt into it from the outside.Solution
Add a
--keep-modelsflag (andQMD_KEEP_MODELS=1env var) that keeps model weights in memory between queries while still releasing contexts (the lighter per-session VRAM objects) on inactivity.Changes
src/index.tsStoreOptionsgainskeepModels?: boolean;createStorereadsQMD_KEEP_MODELSenv var and passes the flag through toLlamaCppsrc/mcp/server.tsstartMcpServerandstartMcpHttpServeraccept{ keepModels }and forward it tocreateStoresrc/cli/qmd.tsqmd mcpgains--keep-modelsflag; it is also forwarded through the daemon spawn args; help text updatedUsage
Why not flip the default?
The 5-minute timeout +
disposeModelsOnInactivity: trueis the right default for CLI use — most people runqmd queryonce and move on, and holding VRAM indefinitely would surprise them. The flag is an explicit opt-in for operators who know they are running a server.Testing
Verified by reading the logic path end-to-end:
--keep-models→keepModels = true→createStore({ keepModels: true })→disposeModelsOnInactivity: false→ model weights stay loaded, contexts are still released on inactivity timerQMD_KEEP_MODELS=1works the same way when the flag is absent--keep-modelsto child args so the flag survivesqmd mcp --http --daemon --keep-models