Skip to content

fix: selective de-streaming, threaded server, and config-driven GET paths#4

Open
Nannerz wants to merge 1 commit intocrashr:mainfrom
Nannerz:fix/selective-destream-and-threading
Open

fix: selective de-streaming, threaded server, and config-driven GET paths#4
Nannerz wants to merge 1 commit intocrashr:mainfrom
Nannerz:fix/selective-destream-and-threading

Conversation

@Nannerz
Copy link
Copy Markdown

@Nannerz Nannerz commented Mar 8, 2026

Issues fixed with Opus as I found them. I use concurrent sessions so needed to add support for that. Added selective de-streaming so streaming still works for things outside of tool calls. "allowed_get_paths" wasn't getting used so fixed that. Also fixed some traceback spam. Below is an LLM generated summary.


Summary

Five improvements discovered while running llama-stream in production as a reverse proxy in front of llama-swap:

  • Selective de-streaming: Only force stream: false when the request contains tools, tool_choice, or response_format (the fields that trigger the llama_grammar_init_impl grammar bug in llama-server). Normal chat completions now pass through with native streaming, preventing client timeouts on long responses.
  • Threaded HTTP server: Replace http.server.HTTPServer with a ThreadingMixIn subclass so concurrent requests don't block each other. The original single-threaded server deadlocks when a long-lived connection (like an SSE endpoint) is active.
  • reasoning_content support: _simulate_streaming now includes reasoning_content (thinking/reasoning) from the backend response, streamed in chunks before content/tool_calls. Previously this field was silently dropped.
  • Config-driven GET allowlist: do_GET now reads from the allowed_get_paths config instead of hardcoding /v1/models.
  • BrokenPipeError handling: Client disconnections during streaming log a debug message instead of a full traceback.

Test plan

  • curl /v1/models returns instantly (was previously blocked by concurrent SSE connections)
  • Streaming chat completion without tools streams natively (no timeout)
  • Chat completion with tools de-streams correctly (grammar bug avoided)
  • Thinking/reasoning content preserved through de-streaming
  • Client disconnection during streaming logs cleanly (no traceback)

…ig-driven GET paths

- Only force stream:false when tools/tool_choice/response_format are present
  (avoids client timeouts on normal streaming completions)
- Use ThreadingMixIn so concurrent requests don't block each other
  (fixes deadlock when long-lived SSE connections are active)
- Stream reasoning_content (thinking) in chunks before content/tool_calls
  (previously silently dropped by _simulate_streaming)
- Read allowed_get_paths from config instead of hardcoding /v1/models
- Catch BrokenPipeError on client disconnect instead of full traceback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant