feat(responses): add reasoning summary generation for providers that expose reasoning content#5451
Conversation
…expose reasoning content When a user requests reasoning summaries (e.g., reasoning.summary = "concise" or "detailed"), and the inference provider exposes reasoning content through Chat Completions (such as Ollama and vLLM with reasoning models like GPT-OSS), Llama Stack now makes a second inference call to summarize that reasoning content and returns it in the OpenAI Responses API format. This addresses providers that do NOT natively support reasoning summaries but DO provide full reasoning text. For providers like OpenAI that hide reasoning content entirely, the behavior is unchanged — no reasoning output is produced since there is nothing to summarize. Closes ogx-ai#4735 Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Observations :
thoughts on coding's practice :
|
|
This is a cool idea, do we need to run the summarized reasoning through a guardrail check? |
… reasoning summary Add unit tests for should_summarize_reasoning, build_summary_prompt, and summarize_reasoning covering event sequences, text accumulation, sequence numbering, error propagation, and usage chunk collection. Add integration tests for reasoning summary streaming (concise/detailed/auto), non-streaming, event ordering, sequence number validation, negative case (no summary without request), and token usage accounting. Move summarization helpers to utils.py for testability. Track second inference call token usage via stream_options and propagate exceptions instead of silently returning. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Add recorded API responses for the reasoning summary integration tests using bedrock/openai.gpt-oss-20b, enabling replay mode in CI. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
…s config) Record reasoning summary tests using server:ci-tests config with record-if-missing mode to match CI request hashes. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Auto-generated by pre-commit hook. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Thanks for the feedback. Added all of that, Please take a look. |
@mattf, Where do you think this should belong |
|
@Nehanth We need to take a holistic approach as some providers like gemini already send actual summaries. Please take a look on my comment Edit : that should not be needed as I see we run summaries only when reasoning text is present. But something crucial I have noticed is, the events sequence are not per what OpenAI Responses server emits To avoid adding technical debt like we have till now, we could be correcting that reasoning summary events by comparing to official responses and gpt models. For example, on this script : I might be wrong, but I don't see the expected "Summary Event Sequence" like this being followed : I suggest, please do a thorough test for verification of correctness of event sequence and how they are populated. And reflect them in tests accordingly. More importantly, we should consider a possible scenario of Multiple "Summary Event Sequence" blocks to be handled by the logic. For example : |
|
@mattf Let us know if we can skip fixing the stream sequence as separate PR. |
|
@robinnarsinghranabhat if there's a bug in the stream sequence, please fix it independently. if this pr is introducing the bug, fix it here. |
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor # Conflicts: # docs/docs/api-openai/provider_matrix.md # src/llama_stack/providers/inline/responses/builtin/responses/utils.py
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Correct the streaming event sequence for reasoning summaries to match what OpenAI's Responses API emits. Key changes: - Emit OutputItemAdded (status=in_progress) before summary events and OutputItemDone (status=completed) after, wrapping the full reasoning item lifecycle. - Support multiple summary parts by splitting on paragraph boundaries, each with its own PartAdded/TextDelta/TextDone/PartDone block. - Fix flaky integration test that compared two non-deterministic LLM calls; now asserts usage fields are positive on a single call. - Add missing unit test assertions for summary_index and PartAdded empty text placeholder. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Bedrock does not always populate prompt_tokens and completion_tokens in usage data. Only assert total_tokens > 0 to avoid false failures in replay mode. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
The internal summarization call does not need streaming since we collect the full text before emitting events. Also extracts _accumulate_usage from _accumulate_chunk_usage for reuse. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Generate recordings for all reasoning summary tests so the Bedrock CI suite passes in replay mode. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
| raw_paragraphs = full_text.split("\n\n") | ||
| paragraphs = [p for p in raw_paragraphs if p.strip()] | ||
| if not paragraphs: | ||
| paragraphs = [full_text] | ||
|
|
||
| seq = start_sequence_number | ||
| for summary_index, paragraph_text in enumerate(paragraphs): | ||
| seq += 1 |
There was a problem hiding this comment.
The reason I'm splitting into paragraphs is because @robinnarsinghranabhat review here noted that OpenAI can emit multiple summary part sequences within a single reasoning item. Splitting on paragraph breaks gives us the multipart support to match that behavior. If the summary was a single paragraph, it just emits in one part.
There was a problem hiding this comment.
simplify for now, just emit the whole content at once. if we need to later we can introduce streaming.
There was a problem hiding this comment.
@mattf Just to clarify, are you asking to remove the paragraph splitting or to skip the intermediate summary streaming events entirely and just attach the summary to the reasoning output item in OutputItemDone?
There was a problem hiding this comment.
Hey @mattf, I've removed the intermediate summary streaming events and attached the summary directly to the reasoning output item in OutputItemDone. Since we are removing the streaming for reasoning summary, there's no point in emitting the events.
Let me know if this is fine or if anything needs to change. Thanks!
…call Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor
Summary
Implements reasoning summary generation for the Responses API (
reasoning.summary) for inference providers that expose full reasoning content through Chat Completions but do NOT natively provide summaries — such as Ollama and vLLM with reasoning models (e.g. GPT-OSS).Closes #4735
How it works
When
reasoning.summaryis set to"concise","detailed", or"auto", and the provider returns reasoning content via Chat Completions, Llama Stack makes a second inference call to summarize that reasoning text. The summary is streamed back using the existing Responses API event types (reasoning_summary_text.delta,reasoning_summary_text.done, etc.) and populates thesummaryfield on thereasoningoutput item.concise/auto: One-to-two sentence summary focusing on the final conclusiondetailed: Thorough summary preserving key logical steps and decisionssummary: [])For providers like OpenAI that hide reasoning content entirely (only exposing
reasoning_tokensin usage), no reasoning output is produced since there is nothing to summarize. A future follow-up will add a passthrough path for providers that natively support reasoning summaries via their own Responses API.Key design decisions
usagetotals viastream_options={"include_usage": true}, so callers see the full cost.utils.py:should_summarize_reasoning,build_summary_prompt, andsummarize_reasoninglive inutils.pyfor testability and to keep the streaming orchestrator focused.Test plan
Automated tests
Unit tests (26 tests, all passing):
TestShouldSummarizeReasoning— 4 tests covering None, concise, detailed, autoTestBuildSummaryPrompt— 3 tests for concise, detailed, auto prompt generationTestSummarizeReasoning— 8 tests: event sequence, text accumulation, sequence numbers, item_id propagation, error propagation, empty stream, non-streaming fallback, usage chunk collectionIntegration tests (11 test cases):
test_reasoning_summary_streaming[concise/detailed/auto]— Verifies all 4 summary event types are emitted and deltas concatenate to final texttest_reasoning_summary_non_streaming— Summary field populated on reasoning itemtest_reasoning_summary_event_ordering— Summary events appear after reasoning text eventstest_reasoning_summary_sequence_numbers— Strictly increasing sequence numberstest_reasoning_no_summary_without_request— No summary events when not requestedtest_reasoning_summary_usage_included— Token usage with summary exceeds withoutManual end-to-end tests
Tested with vLLM (
openai/gpt-oss-20b) and Ollama (gpt-oss:20b) through the Llama Stackstarterdistribution.Test 1:
summary: "concise"Request:
Response (abbreviated):
{ "output": [ { "type": "reasoning", "summary": [ { "text": "The reasoning concludes that the answer is 4.", "type": "summary_text" } ], "content": [ { "text": "Answer is 4.", "type": "reasoning_text" } ], "status": "completed" }, { "type": "message", "content": [ { "text": "2 + 2 = 4.", "type": "output_text" } ], "status": "completed" } ], "usage": { "input_tokens": 189, "output_tokens": 249, "total_tokens": 438 } }Test 2:
summary: "detailed"Request:
Response (abbreviated):
{ "output": [ { "type": "reasoning", "summary": [ { "text": "**Euclid's proof that there are infinitely many primes** ...[full structured proof with logical steps preserved]", "type": "summary_text" } ], "content": [ { "text": "The user says: \"Prove that there are infinitely many prime numbers.\" ...[full chain-of-thought reasoning]", "type": "reasoning_text" } ], "status": "completed" }, { "type": "message", "content": [ { "text": "**Proof (Euclid's classic argument)** ...", "type": "output_text" } ], "status": "completed" } ], "usage": { "input_tokens": 1715, "output_tokens": 2768, "total_tokens": 4483 } }Test 3: No summary requested (no regression)
Request:
Response (abbreviated):
{ "output": [ { "type": "reasoning", "summary": [], "content": [ { "text": "Simple math.", "type": "reasoning_text" } ], "status": "completed" }, { "type": "message", "content": [ { "text": "2 + 2 = **4**.", "type": "output_text" } ], "status": "completed" } ], "usage": { "input_tokens": 73, "output_tokens": 22, "total_tokens": 95 } }Comparison with OpenAI's native output
type"reasoning""reasoning"summarylist[{text, type: "summary_text"}]contentstatus"completed"Future work
Made with Cursor