Skip to content

feat(responses): add reasoning summary generation for providers that expose reasoning content#5451

Merged
cdoern merged 36 commits intoogx-ai:mainfrom
Nehanth:reasoning-output
Apr 21, 2026
Merged

feat(responses): add reasoning summary generation for providers that expose reasoning content#5451
cdoern merged 36 commits intoogx-ai:mainfrom
Nehanth:reasoning-output

Conversation

@Nehanth
Copy link
Copy Markdown
Contributor

@Nehanth Nehanth commented Apr 6, 2026

Summary

Implements reasoning summary generation for the Responses API (reasoning.summary) for inference providers that expose full reasoning content through Chat Completions but do NOT natively provide summaries — such as Ollama and vLLM with reasoning models (e.g. GPT-OSS).

Closes #4735

How it works

When reasoning.summary is set to "concise", "detailed", or "auto", and the provider returns reasoning content via Chat Completions, Llama Stack makes a second inference call to summarize that reasoning text. The summary is streamed back using the existing Responses API event types (reasoning_summary_text.delta, reasoning_summary_text.done, etc.) and populates the summary field on the reasoning output item.

  • concise / auto: One-to-two sentence summary focusing on the final conclusion
  • detailed: Thorough summary preserving key logical steps and decisions
  • No summary requested: No second call is made — behavior is unchanged (summary: [])

For providers like OpenAI that hide reasoning content entirely (only exposing reasoning_tokens in usage), no reasoning output is produced since there is nothing to summarize. A future follow-up will add a passthrough path for providers that natively support reasoning summaries via their own Responses API.

Key design decisions

  • Token usage tracking: The second inference call's token usage is folded into the response's usage totals via stream_options={"include_usage": true}, so callers see the full cost.
  • Error propagation: Failures during summary generation propagate to the caller rather than being silently swallowed.
  • Helpers in utils.py: should_summarize_reasoning, build_summary_prompt, and summarize_reasoning live in utils.py for testability and to keep the streaming orchestrator focused.

Test plan

Automated tests

Unit tests (26 tests, all passing):

uv run pytest tests/unit/providers/inline/responses/builtin/responses/test_streaming.py -v
  • TestShouldSummarizeReasoning — 4 tests covering None, concise, detailed, auto
  • TestBuildSummaryPrompt — 3 tests for concise, detailed, auto prompt generation
  • TestSummarizeReasoning — 8 tests: event sequence, text accumulation, sequence numbers, item_id propagation, error propagation, empty stream, non-streaming fallback, usage chunk collection

Integration tests (11 test cases):

uv run pytest tests/integration/responses/test_reasoning.py -v --stack-config=server:starter:8321 --text-model=ollama/gpt-oss:20b --inference-mode=live
  • test_reasoning_summary_streaming[concise/detailed/auto] — Verifies all 4 summary event types are emitted and deltas concatenate to final text
  • test_reasoning_summary_non_streaming — Summary field populated on reasoning item
  • test_reasoning_summary_event_ordering — Summary events appear after reasoning text events
  • test_reasoning_summary_sequence_numbers — Strictly increasing sequence numbers
  • test_reasoning_no_summary_without_request — No summary events when not requested
  • test_reasoning_summary_usage_included — Token usage with summary exceeds without

Manual end-to-end tests

Tested with vLLM (openai/gpt-oss-20b) and Ollama (gpt-oss:20b) through the Llama Stack starter distribution.

Test 1: summary: "concise"

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "What is 2 + 2?",
    "reasoning": { "effort": "low", "summary": "concise" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [
        {
          "text": "The reasoning concludes that the answer is 4.",
          "type": "summary_text"
        }
      ],
      "content": [
        { "text": "Answer is 4.", "type": "reasoning_text" }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "2 + 2 = 4.", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 189,
    "output_tokens": 249,
    "total_tokens": 438
  }
}

438 total tokens — includes both the primary reasoning call and the second summarization call.

Test 2: summary: "detailed"

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "Prove that there are infinitely many prime numbers.",
    "reasoning": { "effort": "high", "summary": "detailed" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [
        {
          "text": "**Euclid's proof that there are infinitely many primes** ...[full structured proof with logical steps preserved]",
          "type": "summary_text"
        }
      ],
      "content": [
        {
          "text": "The user says: \"Prove that there are infinitely many prime numbers.\" ...[full chain-of-thought reasoning]",
          "type": "reasoning_text"
        }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "**Proof (Euclid's classic argument)** ...", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 1715,
    "output_tokens": 2768,
    "total_tokens": 4483
  }
}

4,483 total tokens — the detailed summary preserves logical steps, consuming more tokens than concise mode.

Test 3: No summary requested (no regression)

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "What is 2 + 2?",
    "reasoning": { "effort": "low" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [],
      "content": [
        { "text": "Simple math.", "type": "reasoning_text" }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "2 + 2 = **4**.", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 73,
    "output_tokens": 22,
    "total_tokens": 95
  }
}

95 total tokens — no second inference call is made. Compare with Test 1's 438 tokens for the same prompt with summary: "concise" to see the additional cost.

Comparison with OpenAI's native output

Field OpenAI Spec Our Output
type "reasoning" "reasoning"
summary list[{text, type: "summary_text"}] Same
content Hidden (OpenAI never exposes) Exposed (from vLLM/Ollama)
status Not returned by OpenAI "completed"

Future work

  • Add a passthrough path for providers that natively support reasoning summaries via their Responses API (e.g., OpenAI)

Made with Cursor

…expose reasoning content

When a user requests reasoning summaries (e.g., reasoning.summary = "concise" or "detailed"),
and the inference provider exposes reasoning content through Chat Completions (such as Ollama
and vLLM with reasoning models like GPT-OSS), Llama Stack now makes a second inference call
to summarize that reasoning content and returns it in the OpenAI Responses API format.

This addresses providers that do NOT natively support reasoning summaries but DO provide full
reasoning text. For providers like OpenAI that hide reasoning content entirely, the behavior
is unchanged — no reasoning output is produced since there is nothing to summarize.

Closes ogx-ai#4735

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 6, 2026
@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor

Observations :

  • openai responses doesn't by default send us actual reasoning back. They expect we use responses in stateful mode and server side implementation internally binds actual reasoning tokens correctly as input to llm. For stateless mode, when customers want to manage conversation history by themselves ( like "append" prev. responses output back into new user input ), they provide encrypted-reasoning field option and reasoning item users get back is encrypted.
  • When we ask for reasoning summary, I assume they must internally do something like this PR is doing. But we might want to fixate on best approach or, say prompts , to generate these "reasoning summaries". Like, should they check for harmful content like responses docs say.
  • We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios.

thoughts on coding's practice :

  • Move helper functions to utility (keeping responses layer clean)
  • Error shows up in server logs, but user's don't see it
  • Explanation for choices like (temperature=0.3, prompt used)
  • Add tests

@iamemilio
Copy link
Copy Markdown
Contributor

iamemilio commented Apr 6, 2026

This is a cool idea, do we need to run the summarized reasoning through a guardrail check?

Copy link
Copy Markdown
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good direction, it needs -

  • unit tests
  • integration tests
  • support for token usage
  • to not silently fail

i wouldn't mix guardrails into this. they can be implemented by the app on the output before passing it along.

Nehanth added 5 commits April 6, 2026 16:37
… reasoning summary

Add unit tests for should_summarize_reasoning, build_summary_prompt, and
summarize_reasoning covering event sequences, text accumulation, sequence
numbering, error propagation, and usage chunk collection. Add integration
tests for reasoning summary streaming (concise/detailed/auto), non-streaming,
event ordering, sequence number validation, negative case (no summary without
request), and token usage accounting. Move summarization helpers to utils.py
for testability. Track second inference call token usage via stream_options
and propagate exceptions instead of silently returning.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Add recorded API responses for the reasoning summary integration tests
using bedrock/openai.gpt-oss-20b, enabling replay mode in CI.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
…s config)

Record reasoning summary tests using server:ci-tests config with
record-if-missing mode to match CI request hashes.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Auto-generated by pre-commit hook.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
@Nehanth
Copy link
Copy Markdown
Contributor Author

Nehanth commented Apr 6, 2026

this is a good direction, it needs -

  • unit tests
  • integration tests
  • support for token usage
  • to not silently fail

i wouldn't mix guardrails into this. they can be implemented by the app on the output before passing it along.

Thanks for the feedback.

Added all of that, Please take a look.

@Nehanth
Copy link
Copy Markdown
Contributor Author

Nehanth commented Apr 6, 2026

Observations :

  • openai responses doesn't by default send us actual reasoning back. They expect we use responses in stateful mode and server side implementation internally binds actual reasoning tokens correctly as input to llm. For stateless mode, when customers want to manage conversation history by themselves ( like "append" prev. responses output back into new user input ), they provide encrypted-reasoning field option and reasoning item users get back is encrypted.
  • When we ask for reasoning summary, I assume they must internally do something like this PR is doing. But we might want to fixate on best approach or, say prompts , to generate these "reasoning summaries". Like, should they check for harmful content like responses docs say.
  • We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios.

thoughts on coding's practice :

  • Move helper functions to utility (keeping responses layer clean)
  • Error shows up in server logs, but user's don't see it
  • Explanation for choices like (temperature=0.3, prompt used)
  • Add tests

@mattf, Where do you think this should belong "We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios." in the LLS docs?

@Nehanth Nehanth requested a review from mattf April 6, 2026 21:55
@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor

robinnarsinghranabhat commented Apr 8, 2026

@Nehanth We need to take a holistic approach as some providers like gemini already send actual summaries. Please take a look on my comment

Edit : that should not be needed as I see we run summaries only when reasoning text is present. But something crucial I have noticed is, the events sequence are not per what OpenAI Responses server emits

To avoid adding technical debt like we have till now, we could be correcting that reasoning summary events by comparing to official responses and gpt models.

For example, on this script :

from openai import OpenAI
client = OpenAI()

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get weather forecast",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    },
    {
        "type": "function",
        "name": "get_events",
        "description": "Get events happening in a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    }
]

response = client.responses.create(
    model="gpt-5",
    stream=True,
    reasoning={"effort": "high", "summary": "detailed"},
    tools=tools,
    input=[{
        "role": "user",
        "content": """
Plan a 2-day trip to Boston.

Steps:
1. Check weather
2. Based on weather, fetch relevant events
3. Create itinerary
4. Critically evaluate your itinerary and improve it
"""
    }],
    include=["reasoning.encrypted_content"]
)

I might be wrong, but I don't see the expected "Summary Event Sequence" like this being followed :

ResponseOutputItemAddedEvent
ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryPartDoneEvent
ResponseOutputItemDoneEvent
...

I suggest, please do a thorough test for verification of correctness of event sequence and how they are populated. And reflect them in tests accordingly.

More importantly, we should consider a possible scenario of Multiple "Summary Event Sequence" blocks to be handled by the logic.

For example :

ResponseOutputItemAddedEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryPartDoneEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryTextDoneEvent
ResponseReasoningSummaryPartDoneEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryTextDoneEvent
ResponseReasoningSummaryPartDoneEvent

ResponseOutputItemDoneEvent

@robinnarsinghranabhat
Copy link
Copy Markdown
Contributor

robinnarsinghranabhat commented Apr 8, 2026

@mattf Let us know if we can skip fixing the stream sequence as separate PR.

@mattf
Copy link
Copy Markdown
Collaborator

mattf commented Apr 8, 2026

@robinnarsinghranabhat if there's a bug in the stream sequence, please fix it independently. if this pr is introducing the bug, fix it here.

Nehanth added 5 commits April 8, 2026 17:10
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor

# Conflicts:
#	docs/docs/api-openai/provider_matrix.md
#	src/llama_stack/providers/inline/responses/builtin/responses/utils.py
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Correct the streaming event sequence for reasoning summaries to match
what OpenAI's Responses API emits. Key changes:

- Emit OutputItemAdded (status=in_progress) before summary events and
  OutputItemDone (status=completed) after, wrapping the full reasoning
  item lifecycle.
- Support multiple summary parts by splitting on paragraph boundaries,
  each with its own PartAdded/TextDelta/TextDone/PartDone block.
- Fix flaky integration test that compared two non-deterministic LLM
  calls; now asserts usage fields are positive on a single call.
- Add missing unit test assertions for summary_index and PartAdded
  empty text placeholder.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Bedrock does not always populate prompt_tokens and completion_tokens
in usage data. Only assert total_tokens > 0 to avoid false failures
in replay mode.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Nehanth added 7 commits April 10, 2026 12:39
Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
The internal summarization call does not need streaming since we collect
the full text before emitting events. Also extracts _accumulate_usage
from _accumulate_chunk_usage for reuse.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Generate recordings for all reasoning summary tests so the Bedrock CI
suite passes in replay mode.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com>
Made-with: Cursor
Comment on lines +713 to +720
raw_paragraphs = full_text.split("\n\n")
paragraphs = [p for p in raw_paragraphs if p.strip()]
if not paragraphs:
paragraphs = [full_text]

seq = start_sequence_number
for summary_index, paragraph_text in enumerate(paragraphs):
seq += 1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you doing this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm splitting into paragraphs is because @robinnarsinghranabhat review here noted that OpenAI can emit multiple summary part sequences within a single reasoning item. Splitting on paragraph breaks gives us the multipart support to match that behavior. If the summary was a single paragraph, it just emits in one part.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify for now, just emit the whole content at once. if we need to later we can introduce streaming.

Copy link
Copy Markdown
Contributor Author

@Nehanth Nehanth Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattf Just to clarify, are you asking to remove the paragraph splitting or to skip the intermediate summary streaming events entirely and just attach the summary to the reasoning output item in OutputItemDone?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @mattf, I've removed the intermediate summary streaming events and attached the summary directly to the reasoning output item in OutputItemDone. Since we are removing the streaming for reasoning summary, there's no point in emitting the events.

Let me know if this is fine or if anything needs to change. Thanks!

@Nehanth Nehanth requested a review from mattf April 14, 2026 19:07
@nidhishgajjar

This comment was marked as spam.

1 similar comment
@nidhishgajjar

This comment was marked as spam.

@cdoern cdoern enabled auto-merge April 21, 2026 14:39
@cdoern cdoern added this pull request to the merge queue Apr 21, 2026
Merged via the queue into ogx-ai:main with commit 4958b3d Apr 21, 2026
67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Responses API reasoning summaries from providers that do NOT provide them

6 participants