Skip to content

Conversation

@dsfaccini
Copy link
Contributor

@dsfaccini dsfaccini commented Oct 20, 2025

Fixes #3007

When streaming responses from models like Gemini via LiteLLM, thinking tags can be split across multiple chunks (e.g., chunk 1: "<thi", chunk 2: "nk>content</think>"). The existing implementation only detected complete tags that arrived as standalone chunks, causing split tags to be treated as regular text instead of being extracted into ThinkingPart.

Changes:

  • Added buffering mechanism to ModelResponsePartsManager to accumulate content when it might be part of a split tag
  • Refactored handle_text_delta() to detect complete tags across chunk boundaries while maintaining backward compatibility
  • Added comprehensive tests covering 2-chunk splits, 3+ chunk splits, false positives, and interleaved content scenarios (thinking tags and text mixed/interleaved together)
  • Models without thinking tags (e.g., Anthropic with native thinking support) are unaffected

Edit: what does this PR do

Constraint 1:

  • _parts_manager.py::get_parts() returns -> Generator[ModelResponseStreamEvent, None, None]
    • instead of -> ModelResponseStreamEvent | None
  • this allows us to return multiple events instead of 0 or 1
  • also has a nice side-effect of replacing multiple if is None checks with for event in ...

Constraint 2:

  • It doesn't change any existing test aside from adapting to the new return type of the _parts_manager
  • I created a new test file for this specific case (split thinking tags)

Constraint 3:

  • The new functions in _parts_manager.py will buffer chunks when a chunk arrives that looks like it will be a <think> tag
  • But only if the chunk starts with sth that looks like a think tag, like: <thi
  • That means a chunk like foo<thi will not get buffered, it'll be emitted as a TextPart

Cases it does (should) and doesn't cover

  • <think>thinking -> ThinkingPart("thinking")
  • <thi + nk> + thinking -> ThinkingPart("thinking")
  • <thi + nk>th + inking -> ThinkingPart("thinking")
  • foo + <thi + ink> + ... -> TextPart("foo") + ThinkingPart("") ...
  • foo<th -> thinking chunk needs to start with something that looks like a thinking part

Edge cases

@dsfaccini dsfaccini force-pushed the handle-streamed-thinking-over-multiple-chunks branch from fd1d0c2 to b04532c Compare October 20, 2025 20:18
Copy link
Collaborator

@DouweM DouweM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dsfaccini Thanks for working on this David. I think we need to test every plausible combination of strings and make sure we never lose text or events.

start_tag, end_tag = thinking_tags

# Combine any buffered content with the new content
buffered = self._tag_buffer.get(vendor_part_id, '') if vendor_part_id is not None else ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if vendor_part_id is None? Will none of this work anymore? Should we require one for this method to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the correct handling in that case would be to assume it's a TextPart

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one more commit 0818191 to cover and test these cases:

  1. optional-content</think>more-content => ThinkingPart("...optional-content") + TextPart("more-content")
  2. vendor_id is None and chunk=<think>start-of-thinking => ThinkingPart("start-of-thinking")

# Clear any state for this vendor_part_id and start thinking part
self._vendor_id_to_part_index.pop(vendor_part_id, None)
self._tag_buffer.pop(vendor_part_id, None)
thinking_event = self.handle_thinking_delta(vendor_part_id=vendor_part_id, content='')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's after_start, we shouldn't need content=''

return False
# Check if the tag starts with any suffix of the content
# E.g., for content="<thi" and tag="<think>", we check if "<think>" starts with "<thi"
for i in range(len(content)):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to look at the entire content, right, just the last len(tag) chars?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was replaced by the function _parts_manager.py::_could_be_tag_start()

    def _could_be_tag_start(self, content: str, tag: str) -> bool:
        """Check if content could be the start of a tag."""
        # Defensive check for content that's already complete or longer than tag
        # This occurs when buffered content + new chunk exceeds tag length
        # Example: buffer='<think' + new='<' = '<think<' (7 chars) >= '<think>' (7 chars)
        if len(content) >= len(tag):
            return False
        return tag.startswith(content)


# Build up: "text <thi"
event = manager.handle_text_delta(vendor_part_id='content', content='text <thi', thinking_tags=thinking_tags)
assert event is None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the parts manager is never called again after this, we'll lose this text 😬

Copy link
Contributor Author

@dsfaccini dsfaccini Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this remains a valid concern: if the last chunk happens to be <thi, it will be lost (because it will be buffered but the parts manager won't be called again)

I'll add a test for this and remediate

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just pushed a commit to prevent this, together with new tests adc51e6

return self.handle_thinking_delta(vendor_part_id=vendor_part_id, content=combined_content)
else:
# Not in thinking mode, look for start tag
if start_tag in combined_content:
Copy link
Collaborator

@DouweM DouweM Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the model outputs <think> in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.

I don't know if there's a good way to prevent that. Previously, we were relying on the (weak) assumption that <think> as a standalone chunk always means the special THINK-START token, whereas <think> in regular text output would (maybe?) be split up over multiple chunks/tokens.

But that was not reliable anyway, as models may also be debouncing their own chunk streaming meaning we'd get multiple tokens at once.

I'm worried about this breaking legitimate XML output though.

Maybe we should only do this at the start of a response, not allowing <think> portions in the middle of text output. And/or leave this off by default and require a ModelProfile setting to opt into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the model outputs in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.

This was my other concern and secondary reason of why I hadn't created a PR for the issue. I was having trouble determining whether this could happen (whether a <think> could show up in the middle of a response). I've seen information that Claude models can emit <reflection> tags in the middle of a response. But I'm having a hard time finding any concrete references on this.
Though it seems there is a model called "Reflection 70B" which seems to be clearly documented to do this. Though its output seems to be more structured, in that it has distinct <thinking>/<reflection>/<output> tags, so the issue of misinterpreting a <think> tag isn't possible. But yeah, if we have a model specific profile that can handle the parsing for these cases, that would address the issue.

@dsfaccini
Copy link
Contributor Author

@dsfaccini Thanks for working on this David. I think we need to test every plausible combination of strings and make sure we never lose text or events.

@DouweM thank you for taking the time to review it! I refactored the event handler into returning a generator and am taking into account your comments. And I agree with your comment about the XML case. Specially considering this is a very edge case, and it might even fix itself in the future (by producing chunks properly).

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

@DouweM
Copy link
Collaborator

DouweM commented Oct 21, 2025

I refactored the event handler into returning a generator and am taking into account your comments.

@dsfaccini Did you mean to have pushed already?

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

It doesn't break any existing tests since there are none that currently hit the <think>-tag related edge cases we're identifying. So we should focus on making the new test suite very exhaustive with all the edge cases we can think of.

@dsfaccini
Copy link
Contributor Author

@dsfaccini Did you mean to have pushed already?

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

It doesn't break any existing tests since there are none that currently hit the <think>-tag related edge cases we're identifying. So we should focus on making the new test suite very exhaustive with all the edge cases we can think of.

No I didn't mean to push anything yet, I wanted to clear up that question to make sure I was not misunderstanding something about the test suite. Also I didn't wanna push anything else before clearing the XML "case".

Maybe we should only do this at the start of a response, not allowing portions in the middle of text output. And/or leave this off by default and require a ModelProfile setting to opt into it.

For now I'll write new tests to cover the edge cases you've pointed out, excluding the possibility (quoted above) for <think> tags in the middle of text.

@DouweM
Copy link
Collaborator

DouweM commented Oct 21, 2025

excluding the possibility (quoted above) for <think> tags in the middle of text.

We should also have a test to ensure that in that case, we treat it as regular text!

dsfaccini and others added 2 commits October 22, 2025 18:22
…ering

Convert handle_text_delta() from returning a single event to yielding multiple
events via a generator pattern. This enables proper handling of thinking tags
that may be split across multiple streaming chunks.

Key changes:
- Convert handle_text_delta() return type from ModelResponseStreamEvent | None
  to Generator[ModelResponseStreamEvent, None, None]
- Add _tag_buffer field to track partial content across chunks
- Implement _handle_text_delta_simple() for non-thinking-tag cases
- Implement _handle_text_delta_with_thinking_tags() with buffering logic
- Add _could_be_tag_start() helper to detect potential split tags
- Update all model implementations (10 files) to iterate over events
- Adapt test_handle_text_deltas_with_think_tags for generator API

Behavior:
- Complete thinking tags work at any position (maintains original behavior)
- Split thinking tags are buffered when starting at position 0 of chunk
- Split tags only work when vendor_part_id is not None (buffering requirement)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@dsfaccini
Copy link
Contributor Author

Hey guys, first of all, obligatory: we don't do it because it's easy, we do it because we thought it would be easy. I'm very much punching above my weight here, so thanks a lot for bearing with me.

@DouweM, this PR got a bit confusing from the back and forth, specially the caveats I overlooked the first time trying to support <think> tags in any position, so I started a new branch based on the following constraints:

constraints we've arrived at through discussion

Constraint 1:

_parts_manager.py::get_parts() should return -> Generator[ModelResponseStreamEvent, None, None] instead of -> ModelResponseStreamEvent | None

  1. that requires a series of small changes to some models/* and tests/models/* and lastly tests/test_parts_manager.py
  2. these changes are merely to account for the new return type

Constraint 2:

I won't change any existing test aside from adapting to the new return type, I'll create a new test file for this

Constraint 3:

The new functions in _parts_manager.py will buffer chunks when a chunk arrives that looks like it will be a <think> tag, but only if the chunk starts with sth that looks like it. E.g. <thi.

  1. that means a chunk like foo<thi will not get buffered, it'll be emitted as a TextPart

next steps?

My question to you is:

trajectory of this PR and reasoning for a new one

the current behavior (main branch)

  • the most important test I've identified is the tests/test_parts_manager.py::test_handle_text_deltas_with_think_tags
  • that test shows that the current behavior allows a <think> tag after text (in this case pre-)
    • the reason it doesn't handle split tags is because it requires the whole <think> tag to come alone in a chunk
    • what we want to achieve in this PR, at least what I understand from our discussion is:
      • to accept split <think> tags, e.g. chunk 1: `<thi`, chunk 2: `nk>`
      • but we still require the chunk to start with the think tag, i.e. chunk 1: `<thi`
      • what we don't accept is chunk 1: `foo<thi`, chunk 2: `nk>bar` -> i.e. we take this to be TEXT (like XML in a codeblock)
      • (we may accept that via a ModelProvider setting, per the discussion)

the problem with the current PR

  1. when I read the issue I thought we wanted to support split think tags in whatever combination
  2. that's why I failed to think through the XML in a codeblock case
  3. so after douwe's comments, I restricted the <think> tag identification to the start of the chunk
  4. but I belive I did this cross chunk, such that if a TextPart is emitted before the <think> tag, the <think> tag is assumed to be text
  5. I also broke the important test I linked above by removing the first chunk (pre-) here https://github.com/pydantic/pydantic-ai/pull/3206/files#:~:text=event%20%3D%20manager.handle_text_delta(vendor_part_id%3D%27content,%23%20Start%20with%20thinking%20tag%20(no%20prior%20text)
  6. if I hadn't broken it, that test would've caught that I'm disallowing something that is currently allowed (having a <think> chunk after a text chunk)
this is the restart prompt I used

I reviewed the original, that is the main-branch version of this function: tests/test_parts_manager.py::test_handle_text_deltas_with_think_tags, and noticed that WE changed it. We had to change tests because we're now returning a generator (which is approved in the PR discussion), but we SHOULD NOT have changed the logic of that function. The function clearly shows the following: a chunk arrives with pre-, second chunk arrives with <think> (full think tag!), then the test continues (you need to read it)

the summary is: a thinking tag arrives after a text part, but because 1. it arrives in a chunk of its own and 2. it arrives uin full, it is valid

our pr wants to support split thinking tags, the question is whether we should support split thinking tags that arrive in any position, and so far the decision is NO: we want to support split thinking tags that arrive, at least, in the first position of their chunk

so current (main) behavior won't identify chunk 1: <thi chunk 2: ink> as a thinking tag, but our new implementation will

what we (wrongly) disallowed is for the previous behavior, where a full text chunk can arrive before a thinking tag

that is: chunk 1: pre-, chunk 2: <thi, chunk 3: nk>, is marked by our implementation as a textparth, but it should be marked as TextPart + ThinkingPart

what we are explicitly disallowing, for now, is: chunk 1: pre-<th, chunk 2: ink>, because the thinking tag isn't starting in its own chunk, it starts as part of a textpart, which we don't want to assume is a thinking tag, because it could be xml in a codebock, for example.

@DouweM
Copy link
Collaborator

DouweM commented Oct 23, 2025

@dsfaccini Please force push into this PR so we keep everything in one place!

@dsfaccini dsfaccini force-pushed the handle-streamed-thinking-over-multiple-chunks branch from 411d969 to 3439159 Compare October 23, 2025 16:55
  Fixes two issues with thinking tag detection in streaming responses:

  1. Support for tags with trailing content in same chunk:
     - START tags: "<think>content" now correctly creates ThinkingPart("content")
     - END tags: "</think>after" now correctly closes thinking and creates TextPart("after")
     - Works for both complete and split tags across chunks
     - Implemented by splitting content at tag boundaries and recursively processing

  2. Fix vendor_part_id=None content routing bug:
     - When vendor_part_id=None and content follows a start tag (e.g., "<think>thinking"),
       content is now routed to the existing ThinkingPart instead of creating a new TextPart
     - Added check in _handle_text_delta_simple to detect existing ThinkingPart

  Implementation:
  - Modified _handle_text_delta_simple to split content at START/END tag boundaries
  - Modified _handle_text_delta_with_thinking_tags with symmetric split logic
  - Added ThinkingPart detection for vendor_part_id=None case (lines 164-168)
  - Kept pragma comments only on architecturally unreachable branches

  Tests added (11 new tests in test_parts_manager_split_tags.py):
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Streaming extract <think> tags split up over multiple chunks

3 participants