Skip to content

Add LM Studio provider (native /api/v1)#2

Open
jasonwarta wants to merge 3 commits intoPickle-Pixel:mainfrom
jasonwarta:feat/lmstudio-provider
Open

Add LM Studio provider (native /api/v1)#2
jasonwarta wants to merge 3 commits intoPickle-Pixel:mainfrom
jasonwarta:feat/lmstudio-provider

Conversation

@jasonwarta
Copy link
Copy Markdown

Summary

  • New LMStudioProvider targeting LM Studio's native REST API (/api/v1/models, /api/v1/chat) rather than the OpenAI-compat surface — so we get back tokens/sec, time-to-first-token, and model load time for diagnosing local-inference perf. Logged at debug level; doesn't leak LM Studio specifics into the provider-agnostic QueryResponse.
  • Local providers (Ollama + LM Studio) now register unconditionally instead of being gated on a one-shot boot-time health check. A local server that was down at startup used to stay dropped for the entire MCP process lifetime — now list_models/ask_model reach out live on each call (backed by the existing 30s cache), so a LAN/local server coming online mid-session just works.
  • 3s AbortController timeout on LM Studio healthCheck/listModels so a dead server can't stall list_models calls.
  • Defaults to http://localhost:1234, override with LMSTUDIO_URL for LAN/remote instances.
  • Filters non-chat models (e.g. embeddings) via the type field from /api/v1/models.
  • Route explicitly with lmstudio/<model-key>; JIT-loading is delegated to LM Studio so not-loaded models load on first query.

Why the local-provider fix is in this PR

Hit the startup-only registration bug while testing (LAN LM Studio box was asleep at first boot; /mcp reconnect didn't respawn the process, had to claude mcp remove && add). Ollama has the identical pattern. Fixing only one would leave two different registration behaviors for conceptually identical providers, so both are fixed together. Cloud providers already register based on env-var presence alone, so this brings local in line with cloud.

Known limitations

  • `/api/v1/chat`'s message-array input shape wasn't reverse-engineerable from error messages on the probed instance (string `input` works; the array form expects a content-part discriminator the schema doesn't advertise). `system_prompt` is prepended to the single-string input as a framed prefix — works correctly for single-turn prompts, which is all HydraMCP tools currently do.
  • Context size is discovered from `loaded_instances[0].config.context_length` / `max_context_length`; we don't force a specific ctx on JIT load. Bump it in the LM Studio UI if you need larger ctx.

Test plan

  • `healthCheck()` against live LM Studio on the LAN returns true
  • `listModels()` returns 2 LLMs, filters out 1 embedding model via `type` field
  • `query()` against `mistralai/codestral-22b-v0.1` — PONG response, usage/latency reported
  • `query()` against `meta-llama-3.1-8b-instruct` — JIT-loads and responds, `model_load_time_seconds` captured
  • `system_prompt` + `temperature` + `max_tokens` all honored by the native endpoint
  • `npx tsc --noEmit` clean, `npm run build` clean
  • End-to-end through MCP client: `list_models` shows lmstudio section with both models
  • End-to-end through MCP client: `ask_model lmstudio/meta-llama-3.1-8b-instruct` → JIT-loads and responds in ~3.9s
  • End-to-end through MCP client: `ask_model lmstudio/mistralai/codestral-22b-v0.1` → evicts llama, loads codestral, responds in ~7.5s (one-model-at-a-time LM Studio setup)

🤖 Generated with Claude Code

jasonwarta and others added 2 commits April 13, 2026 16:02
New provider talks to LM Studio's native REST endpoints
(/api/v1/models and /api/v1/chat) rather than the OpenAI-compat
surface, so we surface local-inference detail (tokens/sec, ttft,
model load time) at debug level for diagnosing perf on your own
hardware.

Auto-registers when LM Studio is reachable. Defaults to
http://localhost:1234, override with LMSTUDIO_URL for LAN use.
Filters embedding models via the `type` field from /api/v1/models.
JIT-loading is delegated to LM Studio — not-loaded models load on
first query. Route explicitly with the `lmstudio/` prefix.

Known limitation: /api/v1/chat's message-array input shape isn't
documented on the probed instance, so system_prompt is prepended
to the single-string input form. Single-turn prompts (all HydraMCP
tools today) work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previously Ollama and LM Studio were only registered if a one-shot
boot-time health check succeeded. If the local server happened to be
down or unreachable at startup, the provider was silently dropped for
the entire lifetime of the MCP process — `/mcp` reconnect would not
fix it; only a full Claude Code restart or `claude mcp remove && add`
would. This surprised at least one user who had to remove+re-add to
recover after waking a LAN LM Studio machine.

Local servers restart independently of the MCP process, so gating
registration on a boot-time check is the wrong shape. This aligns them
with cloud providers, which register based on env-var presence alone.
Now listModels and query reach out live on each tool call (backed by
the existing 30s model-list cache) and Promise.allSettled in
MultiProvider means unreachable providers just contribute no models.

To keep `list_models` snappy when a provider is down, add a 3s
AbortController timeout to LM Studio's healthCheck and listModels.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The existing snippet only showed the bare shell env var, but the
canonical way to pass config to a Claude Code MCP server is via
\`claude mcp add -e\` (matches the Quick Start example at the top of
the README). Document both so readers see the integration form first
and the raw-env form as an alternative.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant