Fix streaming thinking tags split across multiple chunks #3206

dsfaccini · 2025-10-20T19:40:34Z

When streaming responses from models like Gemini via LiteLLM, thinking tags can be split across multiple chunks (e.g., chunk 1: "<thi", chunk 2: "nk>content</think>"). The existing implementation only detected complete tags that arrived as standalone chunks, causing split tags to be treated as regular text instead of being extracted into ThinkingPart.

Changes:

Added buffering mechanism to ModelResponsePartsManager to accumulate content when it might be part of a split tag
Refactored handle_text_delta() to detect complete tags across chunk boundaries while maintaining backward compatibility
Added comprehensive tests covering 2-chunk splits, 3+ chunk splits, false positives, and interleaved content scenarios (thinking tags and text mixed/interleaved together)
Models without thinking tags (e.g., Anthropic with native thinking support) are unaffected

Edit: what does this PR do

Constraint 1:

_parts_manager.py::get_parts() returns -> Generator[ModelResponseStreamEvent, None, None]
- instead of -> ModelResponseStreamEvent | None
this allows us to return multiple events instead of 0 or 1
also has a nice side-effect of replacing multiple if is None checks with for event in ...

Constraint 2:

It doesn't change any existing test aside from adapting to the new return type of the _parts_manager
I created a new test file for this specific case (split thinking tags)

Constraint 3:

The new functions in _parts_manager.py will buffer chunks when a chunk arrives that looks like it will be a <think> tag
But only if the chunk starts with sth that looks like a think tag, like: <thi
That means a chunk like foo<thi will not get buffered, it'll be emitted as a TextPart

Cases it does (should) and doesn't cover

✅ <think>thinking -> ThinkingPart("thinking")
✅ <thi + nk> + thinking -> ThinkingPart("thinking")
✅ <thi + nk>th + inking -> ThinkingPart("thinking")
✅ foo + <thi + ink> + ... -> TextPart("foo") + ThinkingPart("") ...
❌ foo<th -> thinking chunk needs to start with something that looks like a thinking part

Edge cases

If the last chunk of a stream is <th it will be buffered at first but then flushed and emitted as a TextPart test_model_test.py:: test_finalize_integration_buffered_content()

DouweM

@dsfaccini Thanks for working on this David. I think we need to test every plausible combination of strings and make sure we never lose text or events.

pydantic_ai_slim/pydantic_ai/_parts_manager.py

DouweM · 2025-10-21T18:51:44Z

pydantic_ai_slim/pydantic_ai/_parts_manager.py

+        start_tag, end_tag = thinking_tags
+
+        # Combine any buffered content with the new content
+        buffered = self._tag_buffer.get(vendor_part_id, '') if vendor_part_id is not None else ''


What if vendor_part_id is None? Will none of this work anymore? Should we require one for this method to work?

I believe the correct handling in that case would be to assume it's a TextPart

I added one more commit 0818191 to cover and test these cases:

optional-content</think>more-content => ThinkingPart("...optional-content") + TextPart("more-content")

vendor_id is None and chunk=<think>start-of-thinking => ThinkingPart("start-of-thinking")

pydantic_ai_slim/pydantic_ai/_parts_manager.py

DouweM · 2025-10-21T18:55:19Z

pydantic_ai_slim/pydantic_ai/_parts_manager.py

+                # Clear any state for this vendor_part_id and start thinking part
+                self._vendor_id_to_part_index.pop(vendor_part_id, None)
+                self._tag_buffer.pop(vendor_part_id, None)
+                thinking_event = self.handle_thinking_delta(vendor_part_id=vendor_part_id, content='')


If there's after_start, we shouldn't need content=''

pydantic_ai_slim/pydantic_ai/_parts_manager.py

DouweM · 2025-10-21T18:58:14Z

pydantic_ai_slim/pydantic_ai/_parts_manager.py

+            return False
+        # Check if the tag starts with any suffix of the content
+        # E.g., for content="<thi" and tag="<think>", we check if "<think>" starts with "<thi"
+        for i in range(len(content)):


We don't need to look at the entire content, right, just the last len(tag) chars?

This was replaced by the function _parts_manager.py::_could_be_tag_start()

def _could_be_tag_start(self, content: str, tag: str) -> bool: """Check if content could be the start of a tag.""" # Defensive check for content that's already complete or longer than tag # This occurs when buffered content + new chunk exceeds tag length # Example: buffer='<think' + new='<' = '<think<' (7 chars) >= '<think>' (7 chars) if len(content) >= len(tag): return False return tag.startswith(content)

DouweM · 2025-10-21T18:58:47Z

tests/test_parts_manager.py

+
+    # Build up: "text <thi"
+    event = manager.handle_text_delta(vendor_part_id='content', content='text <thi', thinking_tags=thinking_tags)
+    assert event is None


If the parts manager is never called again after this, we'll lose this text 😬

this remains a valid concern: if the last chunk happens to be <thi, it will be lost (because it will be buffered but the parts manager won't be called again)

I'll add a test for this and remediate

I just pushed a commit to prevent this, together with new tests adc51e6

DouweM · 2025-10-21T19:04:38Z

pydantic_ai_slim/pydantic_ai/_parts_manager.py

+                return self.handle_thinking_delta(vendor_part_id=vendor_part_id, content=combined_content)
+        else:
+            # Not in thinking mode, look for start tag
+            if start_tag in combined_content:


What if the model outputs <think> in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.

I don't know if there's a good way to prevent that. Previously, we were relying on the (weak) assumption that <think> as a standalone chunk always means the special THINK-START token, whereas <think> in regular text output would (maybe?) be split up over multiple chunks/tokens.

But that was not reliable anyway, as models may also be debouncing their own chunk streaming meaning we'd get multiple tokens at once.

I'm worried about this breaking legitimate XML output though.

Maybe we should only do this at the start of a response, not allowing <think> portions in the middle of text output. And/or leave this off by default and require a ModelProfile setting to opt into it.

What if the model outputs in the middle of its text, inside a code block with XML? We shouldn't treat that as thinking then, just text.

This was my other concern and secondary reason of why I hadn't created a PR for the issue. I was having trouble determining whether this could happen (whether a <think> could show up in the middle of a response). I've seen information that Claude models can emit <reflection> tags in the middle of a response. But I'm having a hard time finding any concrete references on this.
Though it seems there is a model called "Reflection 70B" which seems to be clearly documented to do this. Though its output seems to be more structured, in that it has distinct <thinking>/<reflection>/<output> tags, so the issue of misinterpreting a <think> tag isn't possible. But yeah, if we have a model specific profile that can handle the parsing for these cases, that would address the issue.

dsfaccini · 2025-10-21T21:09:23Z

@dsfaccini Thanks for working on this David. I think we need to test every plausible combination of strings and make sure we never lose text or events.

@DouweM thank you for taking the time to review it! I refactored the event handler into returning a generator and am taking into account your comments. And I agree with your comment about the XML case. Specially considering this is a very edge case, and it might even fix itself in the future (by producing chunks properly).

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

DouweM · 2025-10-21T22:13:27Z

I refactored the event handler into returning a generator and am taking into account your comments.

@dsfaccini Did you mean to have pushed already?

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

It doesn't break any existing tests since there are none that currently hit the <think>-tag related edge cases we're identifying. So we should focus on making the new test suite very exhaustive with all the edge cases we can think of.

dsfaccini · 2025-10-21T22:19:34Z

@dsfaccini Did you mean to have pushed already?

I have a question, is it weird that all checks passed even though there may be a lot of breaking stuff in the PR? Does it mean that we should add tests to cover these cases or is there something I may be misunderstanding about the testing/CI process?

It doesn't break any existing tests since there are none that currently hit the <think>-tag related edge cases we're identifying. So we should focus on making the new test suite very exhaustive with all the edge cases we can think of.

No I didn't mean to push anything yet, I wanted to clear up that question to make sure I was not misunderstanding something about the test suite. Also I didn't wanna push anything else before clearing the XML "case".

Maybe we should only do this at the start of a response, not allowing portions in the middle of text output. And/or leave this off by default and require a ModelProfile setting to opt into it.

For now I'll write new tests to cover the edge cases you've pointed out, excluding the possibility (quoted above) for <think> tags in the middle of text.

DouweM · 2025-10-21T22:23:57Z

excluding the possibility (quoted above) for <think> tags in the middle of text.

We should also have a test to ensure that in that case, we treat it as regular text!

…ering Convert handle_text_delta() from returning a single event to yielding multiple events via a generator pattern. This enables proper handling of thinking tags that may be split across multiple streaming chunks. Key changes: - Convert handle_text_delta() return type from ModelResponseStreamEvent | None to Generator[ModelResponseStreamEvent, None, None] - Add _tag_buffer field to track partial content across chunks - Implement _handle_text_delta_simple() for non-thinking-tag cases - Implement _handle_text_delta_with_thinking_tags() with buffering logic - Add _could_be_tag_start() helper to detect potential split tags - Update all model implementations (10 files) to iterate over events - Adapt test_handle_text_deltas_with_think_tags for generator API Behavior: - Complete thinking tags work at any position (maintains original behavior) - Split thinking tags are buffered when starting at position 0 of chunk - Split tags only work when vendor_part_id is not None (buffering requirement) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

dsfaccini · 2025-10-23T02:29:09Z

Hey guys, first of all, obligatory: we don't do it because it's easy, we do it because we thought it would be easy. I'm very much punching above my weight here, so thanks a lot for bearing with me.

@DouweM, this PR got a bit confusing from the back and forth, specially the caveats I overlooked the first time trying to support <think> tags in any position, so I started a new branch based on the following constraints:

constraints we've arrived at through discussion

Constraint 1:

_parts_manager.py::get_parts() should return -> Generator[ModelResponseStreamEvent, None, None] instead of -> ModelResponseStreamEvent | None

that requires a series of small changes to some models/* and tests/models/* and lastly tests/test_parts_manager.py
these changes are merely to account for the new return type

Constraint 2:

I won't change any existing test aside from adapting to the new return type, I'll create a new test file for this

Constraint 3:

The new functions in _parts_manager.py will buffer chunks when a chunk arrives that looks like it will be a <think> tag, but only if the chunk starts with sth that looks like it. E.g. <thi.

that means a chunk like foo<thi will not get buffered, it'll be emitted as a TextPart

next steps?

My question to you is:

should I force push that branch into this PR?
or open a new PR?
or does this issue require more discussion/decisions and I should put it on pause? (this question probably requires an answer still Fix streaming thinking tags split across multiple chunks #3206 (comment))

trajectory of this PR and reasoning for a new one

the current behavior (main branch)

the most important test I've identified is the tests/test_parts_manager.py::test_handle_text_deltas_with_think_tags
that test shows that the current behavior allows a <think> tag after text (in this case pre-)
- the reason it doesn't handle split tags is because it requires the whole <think> tag to come alone in a chunk
- what we want to achieve in this PR, at least what I understand from our discussion is:
  - to accept split <think> tags, e.g. chunk 1: `<thi`, chunk 2: `nk>`
  - but we still require the chunk to start with the think tag, i.e. chunk 1: `<thi`
  - what we don't accept is chunk 1: `foo<thi`, chunk 2: `nk>bar` -> i.e. we take this to be TEXT (like XML in a codeblock)
  - (we may accept that via a ModelProvider setting, per the discussion)

the problem with the current PR

when I read the issue I thought we wanted to support split think tags in whatever combination
that's why I failed to think through the XML in a codeblock case
so after douwe's comments, I restricted the <think> tag identification to the start of the chunk
but I belive I did this cross chunk, such that if a TextPart is emitted before the <think> tag, the <think> tag is assumed to be text
I also broke the important test I linked above by removing the first chunk (pre-) here https://github.com/pydantic/pydantic-ai/pull/3206/files#:~:text=event%20%3D%20manager.handle_text_delta(vendor_part_id%3D%27content,%23%20Start%20with%20thinking%20tag%20(no%20prior%20text)
if I hadn't broken it, that test would've caught that I'm disallowing something that is currently allowed (having a <think> chunk after a text chunk)

this is the restart prompt I used

I reviewed the original, that is the main-branch version of this function: tests/test_parts_manager.py::test_handle_text_deltas_with_think_tags, and noticed that WE changed it. We had to change tests because we're now returning a generator (which is approved in the PR discussion), but we SHOULD NOT have changed the logic of that function. The function clearly shows the following: a chunk arrives with pre-, second chunk arrives with <think> (full think tag!), then the test continues (you need to read it)

the summary is: a thinking tag arrives after a text part, but because 1. it arrives in a chunk of its own and 2. it arrives uin full, it is valid

our pr wants to support split thinking tags, the question is whether we should support split thinking tags that arrive in any position, and so far the decision is NO: we want to support split thinking tags that arrive, at least, in the first position of their chunk

so current (main) behavior won't identify chunk 1: <thi chunk 2: ink> as a thinking tag, but our new implementation will

what we (wrongly) disallowed is for the previous behavior, where a full text chunk can arrive before a thinking tag

that is: chunk 1: pre-, chunk 2: <thi, chunk 3: nk>, is marked by our implementation as a textparth, but it should be marked as TextPart + ThinkingPart

what we are explicitly disallowing, for now, is: chunk 1: pre-<th, chunk 2: ink>, because the thinking tag isn't starting in its own chunk, it starts as part of a textpart, which we don't want to assume is a thinking tag, because it could be xml in a codebock, for example.

DouweM · 2025-10-23T15:26:13Z

@dsfaccini Please force push into this PR so we keep everything in one place!

…gs-v2

…hat look like thinking tags

Fixes two issues with thinking tag detection in streaming responses: 1. Support for tags with trailing content in same chunk: - START tags: "<think>content" now correctly creates ThinkingPart("content") - END tags: "</think>after" now correctly closes thinking and creates TextPart("after") - Works for both complete and split tags across chunks - Implemented by splitting content at tag boundaries and recursively processing 2. Fix vendor_part_id=None content routing bug: - When vendor_part_id=None and content follows a start tag (e.g., "<think>thinking"), content is now routed to the existing ThinkingPart instead of creating a new TextPart - Added check in _handle_text_delta_simple to detect existing ThinkingPart Implementation: - Modified _handle_text_delta_simple to split content at START/END tag boundaries - Modified _handle_text_delta_with_thinking_tags with symmetric split logic - Added ThinkingPart detection for vendor_part_id=None case (lines 164-168) - Kept pragma comments only on architecturally unreachable branches Tests added (11 new tests in test_parts_manager_split_tags.py):

dsfaccini force-pushed the handle-streamed-thinking-over-multiple-chunks branch from fd1d0c2 to b04532c Compare October 20, 2025 20:18

DouweM reviewed Oct 21, 2025

View reviewed changes

DouweM requested changes Oct 21, 2025

View reviewed changes

DouweM self-assigned this Oct 21, 2025

DouweM added the awaiting author revision label Oct 21, 2025

dsfaccini and others added 2 commits October 22, 2025 18:22

fix test suite for generator pattern and ensure coverage

11b5f1f

Merge main into fix-split-thinking-tags-v2

0f876de

dsfaccini added 2 commits October 23, 2025 11:54

Merge remote-tracking branch 'origin/main' into fix-split-thinking-ta…

b5c0910

…gs-v2

rename _tag_buffer to _thinking_tag_buffer

3439159

dsfaccini force-pushed the handle-streamed-thinking-over-multiple-chunks branch from 411d969 to 3439159 Compare October 23, 2025 16:55

dsfaccini added 5 commits October 23, 2025 12:41

remove pragmas

876ebb2

adds a finalize method to prevent lost content from buffered chunks t…

adc51e6

…hat look like thinking tags

fix coverage

f50d4b4

remove pragmas

551d035

dsfaccini mentioned this pull request Oct 24, 2025

agui: TEXT_MESSAGE_CONTENT events are sometimes orphaned #3108

Open

2 tasks

Fix streaming thinking tags split across multiple chunks #3206

Are you sure you want to change the base?

Fix streaming thinking tags split across multiple chunks #3206

Uh oh!

Conversation

dsfaccini commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes:

Edit: what does this PR do

Constraint 1:

Constraint 2:

Constraint 3:

Cases it does (should) and doesn't cover

Edge cases

Uh oh!

DouweM left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsfaccini Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DouweM Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsfaccini commented Oct 21, 2025

Uh oh!

DouweM commented Oct 21, 2025

Uh oh!

dsfaccini commented Oct 21, 2025

Uh oh!

DouweM commented Oct 21, 2025

Uh oh!

dsfaccini commented Oct 23, 2025

constraints we've arrived at through discussion

Constraint 1:

Constraint 2:

Constraint 3:

next steps?

the current behavior (main branch)

the problem with the current PR

Uh oh!

DouweM commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dsfaccini commented Oct 20, 2025 •

edited

Loading

dsfaccini Oct 23, 2025 •

edited

Loading

DouweM Oct 21, 2025 •

edited

Loading