fix: selective de-streaming, threaded server, and config-driven GET paths by Nannerz · Pull Request #4 · crashr/llama-stream

Nannerz · 2026-03-08T06:21:21Z

Issues fixed with Opus as I found them. I use concurrent sessions so needed to add support for that. Added selective de-streaming so streaming still works for things outside of tool calls. "allowed_get_paths" wasn't getting used so fixed that. Also fixed some traceback spam. Below is an LLM generated summary.

Summary

Five improvements discovered while running llama-stream in production as a reverse proxy in front of llama-swap:

Selective de-streaming: Only force stream: false when the request contains tools, tool_choice, or response_format (the fields that trigger the llama_grammar_init_impl grammar bug in llama-server). Normal chat completions now pass through with native streaming, preventing client timeouts on long responses.
Threaded HTTP server: Replace http.server.HTTPServer with a ThreadingMixIn subclass so concurrent requests don't block each other. The original single-threaded server deadlocks when a long-lived connection (like an SSE endpoint) is active.
reasoning_content support: _simulate_streaming now includes reasoning_content (thinking/reasoning) from the backend response, streamed in chunks before content/tool_calls. Previously this field was silently dropped.
Config-driven GET allowlist: do_GET now reads from the allowed_get_paths config instead of hardcoding /v1/models.
BrokenPipeError handling: Client disconnections during streaming log a debug message instead of a full traceback.

Test plan

curl /v1/models returns instantly (was previously blocked by concurrent SSE connections)
Streaming chat completion without tools streams natively (no timeout)
Chat completion with tools de-streams correctly (grammar bug avoided)
Thinking/reasoning content preserved through de-streaming
Client disconnection during streaming logs cleanly (no traceback)

…ig-driven GET paths - Only force stream:false when tools/tool_choice/response_format are present (avoids client timeouts on normal streaming completions) - Use ThreadingMixIn so concurrent requests don't block each other (fixes deadlock when long-lived SSE connections are active) - Stream reasoning_content (thinking) in chunks before content/tool_calls (previously silently dropped by _simulate_streaming) - Read allowed_get_paths from config instead of hardcoding /v1/models - Catch BrokenPipeError on client disconnect instead of full traceback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: selective de-streaming, threaded server, and config-driven GET paths#4

fix: selective de-streaming, threaded server, and config-driven GET paths#4
Nannerz wants to merge 1 commit intocrashr:mainfrom
Nannerz:fix/selective-destream-and-threading

Nannerz commented Mar 8, 2026 •

edited by crashr

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Nannerz commented Mar 8, 2026 • edited by crashr Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Nannerz commented Mar 8, 2026 •

edited by crashr

Loading