-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix streaming thinking tags split across multiple chunks #3206
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix streaming thinking tags split across multiple chunks #3206
Conversation
fd1d0c2 to
b04532c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsfaccini Thanks for working on this David. I think we need to test every plausible combination of strings and make sure we never lose text or events.
| start_tag, end_tag = thinking_tags | ||
|
|
||
| # Combine any buffered content with the new content | ||
| buffered = self._tag_buffer.get(vendor_part_id, '') if vendor_part_id is not None else '' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if vendor_part_id is None? Will none of this work anymore? Should we require one for this method to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the correct handling in that case would be to assume it's a TextPart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one more commit 0818191 to cover and test these cases:
optional-content</think>more-content=>ThinkingPart("...optional-content") + TextPart("more-content")vendor_id is Noneand chunk=<think>start-of-thinking=>ThinkingPart("start-of-thinking")
| # Clear any state for this vendor_part_id and start thinking part | ||
| self._vendor_id_to_part_index.pop(vendor_part_id, None) | ||
| self._tag_buffer.pop(vendor_part_id, None) | ||
| thinking_event = self.handle_thinking_delta(vendor_part_id=vendor_part_id, content='') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's after_start, we shouldn't need content=''
| return False | ||
| # Check if the tag starts with any suffix of the content | ||
| # E.g., for content="<thi" and tag="<think>", we check if "<think>" starts with "<thi" | ||
| for i in range(len(content)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to look at the entire content, right, just the last len(tag) chars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was replaced by the function _parts_manager.py::_could_be_tag_start()
def _could_be_tag_start(self, content: str, tag: str) -> bool:
"""Check if content could be the start of a tag."""
# Defensive check for content that's already complete or longer than tag
# This occurs when buffered content + new chunk exceeds tag length
# Example: buffer='<think' + new='<' = '<think<' (7 chars) >= '<think>' (7 chars)
if len(content) >= len(tag):
return False
return tag.startswith(content)
tests/test_parts_manager.py
Outdated
|
|
||
| # Build up: "text <thi" | ||
| event = manager.handle_text_delta(vendor_part_id='content', content='text <thi', thinking_tags=thinking_tags) | ||
| assert event is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the parts manager is never called again after this, we'll lose this text 😬
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this remains a valid concern: if the last chunk happens to be <thi, it will be lost (because it will be buffered but the parts manager won't be called again)
I'll add a test for this and remediate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just pushed a commit to prevent this, together with new tests adc51e6
| return self.handle_thinking_delta(vendor_part_id=vendor_part_id, content=combined_content) | ||
| else: | ||
| # Not in thinking mode, look for start tag | ||
| if start_tag in combined_content: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the model outputs <think> in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.
I don't know if there's a good way to prevent that. Previously, we were relying on the (weak) assumption that <think> as a standalone chunk always means the special THINK-START token, whereas <think> in regular text output would (maybe?) be split up over multiple chunks/tokens.
But that was not reliable anyway, as models may also be debouncing their own chunk streaming meaning we'd get multiple tokens at once.
I'm worried about this breaking legitimate XML output though.
Maybe we should only do this at the start of a response, not allowing <think> portions in the middle of text output. And/or leave this off by default and require a ModelProfile setting to opt into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the model outputs in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.
This was my other concern and secondary reason of why I hadn't created a PR for the issue. I was having trouble determining whether this could happen (whether a <think> could show up in the middle of a response). I've seen information that Claude models can emit <reflection> tags in the middle of a response. But I'm having a hard time finding any concrete references on this.
Though it seems there is a model called "Reflection 70B" which seems to be clearly documented to do this. Though its output seems to be more structured, in that it has distinct <thinking>/<reflection>/<output> tags, so the issue of misinterpreting a <think> tag isn't possible. But yeah, if we have a model specific profile that can handle the parsing for these cases, that would address the issue.
@DouweM thank you for taking the time to review it! I refactored the event handler into returning a generator and am taking into account your comments. And I agree with your comment about the XML case. Specially considering this is a very edge case, and it might even fix itself in the future (by producing chunks properly). I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process? |
|
I refactored the event handler into returning a generator and am taking into account your comments. @dsfaccini Did you mean to have pushed already?
It doesn't break any existing tests since there are none that currently hit the |
No I didn't mean to push anything yet, I wanted to clear up that question to make sure I was not misunderstanding something about the test suite. Also I didn't wanna push anything else before clearing the XML "case".
For now I'll write new tests to cover the edge cases you've pointed out, excluding the possibility (quoted above) for |
We should also have a test to ensure that in that case, we treat it as regular text! |
…ering Convert handle_text_delta() from returning a single event to yielding multiple events via a generator pattern. This enables proper handling of thinking tags that may be split across multiple streaming chunks. Key changes: - Convert handle_text_delta() return type from ModelResponseStreamEvent | None to Generator[ModelResponseStreamEvent, None, None] - Add _tag_buffer field to track partial content across chunks - Implement _handle_text_delta_simple() for non-thinking-tag cases - Implement _handle_text_delta_with_thinking_tags() with buffering logic - Add _could_be_tag_start() helper to detect potential split tags - Update all model implementations (10 files) to iterate over events - Adapt test_handle_text_deltas_with_think_tags for generator API Behavior: - Complete thinking tags work at any position (maintains original behavior) - Split thinking tags are buffered when starting at position 0 of chunk - Split tags only work when vendor_part_id is not None (buffering requirement) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
Hey guys, first of all, obligatory: @DouweM, this PR got a bit confusing from the back and forth, specially the caveats I overlooked the first time trying to support constraints we've arrived at through discussionConstraint 1:
Constraint 2:I won't change any existing test aside from adapting to the new return type, I'll create a new test file for this Constraint 3:The new functions in
next steps?My question to you is:
trajectory of this PR and reasoning for a new onethe current behavior (main branch)
the problem with the current PR
this is the restart prompt I usedI reviewed the original, that is the main-branch version of this function: tests/test_parts_manager.py::test_handle_text_deltas_with_think_tags, and noticed that WE changed it. We had to change tests because we're now returning a generator (which is approved in the PR discussion), but we SHOULD NOT have changed the logic of that function. The function clearly shows the following: a chunk arrives with the summary is: a thinking tag arrives after a text part, but because 1. it arrives in a chunk of its own and 2. it arrives uin full, it is valid our pr wants to support split thinking tags, the question is whether we should support split thinking tags that arrive in any position, and so far the decision is NO: we want to support split thinking tags that arrive, at least, in the first position of their chunk so current (main) behavior won't identify chunk 1: what we (wrongly) disallowed is for the previous behavior, where a full text chunk can arrive before a thinking tag that is: chunk 1: what we are explicitly disallowing, for now, is: chunk 1: |
|
@dsfaccini Please force push into this PR so we keep everything in one place! |
411d969 to
3439159
Compare
…hat look like thinking tags
Fixes two issues with thinking tag detection in streaming responses:
1. Support for tags with trailing content in same chunk:
- START tags: "<think>content" now correctly creates ThinkingPart("content")
- END tags: "</think>after" now correctly closes thinking and creates TextPart("after")
- Works for both complete and split tags across chunks
- Implemented by splitting content at tag boundaries and recursively processing
2. Fix vendor_part_id=None content routing bug:
- When vendor_part_id=None and content follows a start tag (e.g., "<think>thinking"),
content is now routed to the existing ThinkingPart instead of creating a new TextPart
- Added check in _handle_text_delta_simple to detect existing ThinkingPart
Implementation:
- Modified _handle_text_delta_simple to split content at START/END tag boundaries
- Modified _handle_text_delta_with_thinking_tags with symmetric split logic
- Added ThinkingPart detection for vendor_part_id=None case (lines 164-168)
- Kept pragma comments only on architecturally unreachable branches
Tests added (11 new tests in test_parts_manager_split_tags.py):
Fixes #3007
When streaming responses from models like Gemini via LiteLLM, thinking tags can be split across multiple chunks (e.g., chunk 1:
"<thi", chunk 2:"nk>content</think>"). The existing implementation only detected complete tags that arrived as standalone chunks, causing split tags to be treated as regular text instead of being extracted intoThinkingPart.Changes:
ModelResponsePartsManagerto accumulate content when it might be part of a split taghandle_text_delta()to detect complete tags across chunk boundaries while maintaining backward compatibilityEdit: what does this PR do
Constraint 1:
_parts_manager.py::get_parts()returns-> Generator[ModelResponseStreamEvent, None, None]-> ModelResponseStreamEvent | Noneif is Nonechecks withfor event in ...Constraint 2:
_parts_managerConstraint 3:
_parts_manager.pywill buffer chunks when a chunk arrives that looks like it will be a<think>tag<thifoo<thiwill not get buffered, it'll be emitted as aTextPartCases it does (should) and doesn't cover
<think>thinking->ThinkingPart("thinking")<thi+nk>+thinking->ThinkingPart("thinking")<thi+nk>th+inking->ThinkingPart("thinking")foo+<thi+ink>+ ... ->TextPart("foo")+ThinkingPart("")...foo<th-> thinking chunk needs to start with something that looks like a thinking partEdge cases
<thit will be buffered at first but then flushed and emitted as aTextParttest_model_test.py:: test_finalize_integration_buffered_content()