feat(responses): add reasoning summary generation for providers that expose reasoning content by Nehanth · Pull Request #5451 · ogx-ai/ogx

Nehanth · 2026-04-06T16:01:00Z

Summary

Implements reasoning summary generation for the Responses API (reasoning.summary) for inference providers that expose full reasoning content through Chat Completions but do NOT natively provide summaries — such as Ollama and vLLM with reasoning models (e.g. GPT-OSS).

Closes #4735

How it works

When reasoning.summary is set to "concise", "detailed", or "auto", and the provider returns reasoning content via Chat Completions, Llama Stack makes a second inference call to summarize that reasoning text. The summary is streamed back using the existing Responses API event types (reasoning_summary_text.delta, reasoning_summary_text.done, etc.) and populates the summary field on the reasoning output item.

concise / auto: One-to-two sentence summary focusing on the final conclusion
detailed: Thorough summary preserving key logical steps and decisions
No summary requested: No second call is made — behavior is unchanged (summary: [])

For providers like OpenAI that hide reasoning content entirely (only exposing reasoning_tokens in usage), no reasoning output is produced since there is nothing to summarize. A future follow-up will add a passthrough path for providers that natively support reasoning summaries via their own Responses API.

Key design decisions

Token usage tracking: The second inference call's token usage is folded into the response's usage totals via stream_options={"include_usage": true}, so callers see the full cost.
Error propagation: Failures during summary generation propagate to the caller rather than being silently swallowed.
Helpers in utils.py: should_summarize_reasoning, build_summary_prompt, and summarize_reasoning live in utils.py for testability and to keep the streaming orchestrator focused.

Test plan

Automated tests

Unit tests (26 tests, all passing):

uv run pytest tests/unit/providers/inline/responses/builtin/responses/test_streaming.py -v

TestShouldSummarizeReasoning — 4 tests covering None, concise, detailed, auto
TestBuildSummaryPrompt — 3 tests for concise, detailed, auto prompt generation
TestSummarizeReasoning — 8 tests: event sequence, text accumulation, sequence numbers, item_id propagation, error propagation, empty stream, non-streaming fallback, usage chunk collection

Integration tests (11 test cases):

uv run pytest tests/integration/responses/test_reasoning.py -v --stack-config=server:starter:8321 --text-model=ollama/gpt-oss:20b --inference-mode=live

test_reasoning_summary_streaming[concise/detailed/auto] — Verifies all 4 summary event types are emitted and deltas concatenate to final text
test_reasoning_summary_non_streaming — Summary field populated on reasoning item
test_reasoning_summary_event_ordering — Summary events appear after reasoning text events
test_reasoning_summary_sequence_numbers — Strictly increasing sequence numbers
test_reasoning_no_summary_without_request — No summary events when not requested
test_reasoning_summary_usage_included — Token usage with summary exceeds without

Manual end-to-end tests

Tested with vLLM (openai/gpt-oss-20b) and Ollama (gpt-oss:20b) through the Llama Stack starter distribution.

Test 1: `summary: "concise"`

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "What is 2 + 2?",
    "reasoning": { "effort": "low", "summary": "concise" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [
        {
          "text": "The reasoning concludes that the answer is 4.",
          "type": "summary_text"
        }
      ],
      "content": [
        { "text": "Answer is 4.", "type": "reasoning_text" }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "2 + 2 = 4.", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 189,
    "output_tokens": 249,
    "total_tokens": 438
  }
}

438 total tokens — includes both the primary reasoning call and the second summarization call.

Test 2: `summary: "detailed"`

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "Prove that there are infinitely many prime numbers.",
    "reasoning": { "effort": "high", "summary": "detailed" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [
        {
          "text": "**Euclid's proof that there are infinitely many primes** ...[full structured proof with logical steps preserved]",
          "type": "summary_text"
        }
      ],
      "content": [
        {
          "text": "The user says: \"Prove that there are infinitely many prime numbers.\" ...[full chain-of-thought reasoning]",
          "type": "reasoning_text"
        }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "**Proof (Euclid's classic argument)** ...", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 1715,
    "output_tokens": 2768,
    "total_tokens": 4483
  }
}

4,483 total tokens — the detailed summary preserves logical steps, consuming more tokens than concise mode.

Test 3: No summary requested (no regression)

Request:

curl -s http://localhost:8321/v1/responses \
  -H "Content-Type: application/json" \
  -d '{
    "model": "vllm/openai/gpt-oss-20b",
    "input": "What is 2 + 2?",
    "reasoning": { "effort": "low" }
  }'

Response (abbreviated):

{
  "output": [
    {
      "type": "reasoning",
      "summary": [],
      "content": [
        { "text": "Simple math.", "type": "reasoning_text" }
      ],
      "status": "completed"
    },
    {
      "type": "message",
      "content": [
        { "text": "2 + 2 = **4**.", "type": "output_text" }
      ],
      "status": "completed"
    }
  ],
  "usage": {
    "input_tokens": 73,
    "output_tokens": 22,
    "total_tokens": 95
  }
}

95 total tokens — no second inference call is made. Compare with Test 1's 438 tokens for the same prompt with summary: "concise" to see the additional cost.

Comparison with OpenAI's native output

Field	OpenAI Spec	Our Output
`type`	`"reasoning"`	`"reasoning"`
`summary`	`list[{text, type: "summary_text"}]`	Same
`content`	Hidden (OpenAI never exposes)	Exposed (from vLLM/Ollama)
`status`	Not returned by OpenAI	`"completed"`

Future work

Add a passthrough path for providers that natively support reasoning summaries via their Responses API (e.g., OpenAI)

Made with Cursor

…expose reasoning content When a user requests reasoning summaries (e.g., reasoning.summary = "concise" or "detailed"), and the inference provider exposes reasoning content through Chat Completions (such as Ollama and vLLM with reasoning models like GPT-OSS), Llama Stack now makes a second inference call to summarize that reasoning content and returns it in the OpenAI Responses API format. This addresses providers that do NOT natively support reasoning summaries but DO provide full reasoning text. For providers like OpenAI that hide reasoning content entirely, the behavior is unchanged — no reasoning output is produced since there is nothing to summarize. Closes ogx-ai#4735 Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

robinnarsinghranabhat · 2026-04-06T19:44:11Z

Observations :

openai responses doesn't by default send us actual reasoning back. They expect we use responses in stateful mode and server side implementation internally binds actual reasoning tokens correctly as input to llm. For stateless mode, when customers want to manage conversation history by themselves ( like "append" prev. responses output back into new user input ), they provide encrypted-reasoning field option and reasoning item users get back is encrypted.
When we ask for reasoning summary, I assume they must internally do something like this PR is doing. But we might want to fixate on best approach or, say prompts , to generate these "reasoning summaries". Like, should they check for harmful content like responses docs say.
We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios.

thoughts on coding's practice :

Move helper functions to utility (keeping responses layer clean)
Error shows up in server logs, but user's don't see it
Explanation for choices like (temperature=0.3, prompt used)
Add tests

iamemilio · 2026-04-06T19:48:59Z

This is a cool idea, do we need to run the summarized reasoning through a guardrail check?

mattf

this is a good direction, it needs -

unit tests
integration tests
support for token usage
to not silently fail

i wouldn't mix guardrails into this. they can be implemented by the app on the output before passing it along.

… reasoning summary Add unit tests for should_summarize_reasoning, build_summary_prompt, and summarize_reasoning covering event sequences, text accumulation, sequence numbering, error propagation, and usage chunk collection. Add integration tests for reasoning summary streaming (concise/detailed/auto), non-streaming, event ordering, sequence number validation, negative case (no summary without request), and token usage accounting. Move summarization helpers to utils.py for testability. Track second inference call token usage via stream_options and propagate exceptions instead of silently returning. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Add recorded API responses for the reasoning summary integration tests using bedrock/openai.gpt-oss-20b, enabling replay mode in CI. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

…s config) Record reasoning summary tests using server:ci-tests config with record-if-missing mode to match CI request hashes. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Auto-generated by pre-commit hook. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Nehanth · 2026-04-06T21:53:47Z

this is a good direction, it needs -

unit tests

integration tests

support for token usage

to not silently fail

i wouldn't mix guardrails into this. they can be implemented by the app on the output before passing it along.

Thanks for the feedback.

Added all of that, Please take a look.

Nehanth · 2026-04-06T21:54:46Z

Observations :

openai responses doesn't by default send us actual reasoning back. They expect we use responses in stateful mode and server side implementation internally binds actual reasoning tokens correctly as input to llm. For stateless mode, when customers want to manage conversation history by themselves ( like "append" prev. responses output back into new user input ), they provide encrypted-reasoning field option and reasoning item users get back is encrypted.

When we ask for reasoning summary, I assume they must internally do something like this PR is doing. But we might want to fixate on best approach or, say prompts , to generate these "reasoning summaries". Like, should they check for harmful content like responses docs say.

We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios.

thoughts on coding's practice :

Move helper functions to utility (keeping responses layer clean)

Error shows up in server logs, but user's don't see it

Explanation for choices like (temperature=0.3, prompt used)

Add tests

@mattf, Where do you think this should belong "We should let user's be aware of additional costs (tokens, latency) this option would introduce. And also fact that, it would only have decorative purposes ( could be wrong here). Moreover, we wouldn't want users to be confused and try to use reasoning summary instead for multi-turn scenarios." in the LLS docs?

robinnarsinghranabhat · 2026-04-08T18:19:13Z

@Nehanth We need to take a holistic approach as some providers like gemini already send actual summaries. Please take a look on my comment

Edit : that should not be needed as I see we run summaries only when reasoning text is present. But something crucial I have noticed is, the events sequence are not per what OpenAI Responses server emits

To avoid adding technical debt like we have till now, we could be correcting that reasoning summary events by comparing to official responses and gpt models.

For example, on this script :

from openai import OpenAI
client = OpenAI()

tools = [
    {
        "type": "function",
        "name": "get_weather",
        "description": "Get weather forecast",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    },
    {
        "type": "function",
        "name": "get_events",
        "description": "Get events happening in a city",
        "parameters": {
            "type": "object",
            "properties": {
                "city": {"type": "string"}
            },
            "required": ["city"]
        }
    }
]

response = client.responses.create(
    model="gpt-5",
    stream=True,
    reasoning={"effort": "high", "summary": "detailed"},
    tools=tools,
    input=[{
        "role": "user",
        "content": """
Plan a 2-day trip to Boston.

Steps:
1. Check weather
2. Based on weather, fetch relevant events
3. Create itinerary
4. Critically evaluate your itinerary and improve it
"""
    }],
    include=["reasoning.encrypted_content"]
)

I might be wrong, but I don't see the expected "Summary Event Sequence" like this being followed :

ResponseOutputItemAddedEvent
ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryPartDoneEvent
ResponseOutputItemDoneEvent
...

I suggest, please do a thorough test for verification of correctness of event sequence and how they are populated. And reflect them in tests accordingly.

More importantly, we should consider a possible scenario of Multiple "Summary Event Sequence" blocks to be handled by the logic.

For example :

ResponseOutputItemAddedEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryPartDoneEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryTextDoneEvent
ResponseReasoningSummaryPartDoneEvent

ResponseReasoningSummaryPartAddedEvent
ResponseReasoningSummaryTextDeltaEvent * N
ResponseReasoningSummaryTextDoneEvent
ResponseReasoningSummaryPartDoneEvent

ResponseOutputItemDoneEvent

robinnarsinghranabhat · 2026-04-08T20:52:09Z

@mattf Let us know if we can skip fixing the stream sequence as separate PR.

mattf · 2026-04-08T21:01:43Z

@robinnarsinghranabhat if there's a bug in the stream sequence, please fix it independently. if this pr is introducing the bug, fix it here.

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor # Conflicts: # docs/docs/api-openai/provider_matrix.md # src/llama_stack/providers/inline/responses/builtin/responses/utils.py

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Correct the streaming event sequence for reasoning summaries to match what OpenAI's Responses API emits. Key changes: - Emit OutputItemAdded (status=in_progress) before summary events and OutputItemDone (status=completed) after, wrapping the full reasoning item lifecycle. - Support multiple summary parts by splitting on paragraph boundaries, each with its own PartAdded/TextDelta/TextDone/PartDone block. - Fix flaky integration test that compared two non-deterministic LLM calls; now asserts usage fields are positive on a single call. - Add missing unit test assertions for summary_index and PartAdded empty text placeholder. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Bedrock does not always populate prompt_tokens and completion_tokens in usage data. Only assert total_tokens > 0 to avoid false failures in replay mode. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

The internal summarization call does not need streaming since we collect the full text before emitting events. Also extracts _accumulate_usage from _accumulate_chunk_usage for reuse. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Generate recordings for all reasoning summary tests so the Bedrock CI suite passes in replay mode. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

mattf · 2026-04-14T14:16:52Z

+    raw_paragraphs = full_text.split("\n\n")
+    paragraphs = [p for p in raw_paragraphs if p.strip()]
+    if not paragraphs:
+        paragraphs = [full_text]
+
+    seq = start_sequence_number
+    for summary_index, paragraph_text in enumerate(paragraphs):
+        seq += 1


why are you doing this?

The reason I'm splitting into paragraphs is because @robinnarsinghranabhat review here noted that OpenAI can emit multiple summary part sequences within a single reasoning item. Splitting on paragraph breaks gives us the multipart support to match that behavior. If the summary was a single paragraph, it just emits in one part.

simplify for now, just emit the whole content at once. if we need to later we can introduce streaming.

@mattf Just to clarify, are you asking to remove the paragraph splitting or to skip the intermediate summary streaming events entirely and just attach the summary to the reasoning output item in OutputItemDone?

Hey @mattf, I've removed the intermediate summary streaming events and attached the summary directly to the reasoning output item in OutputItemDone. Since we are removing the streaming for reasoning summary, there's no point in emitting the events.

Let me know if this is fine or if anything needs to change. Thanks!

…call Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

… output item

Nehanth requested review from ashwinb, bbrowning, cdoern, ehhuang, franciscojavierarceo, leseb, mattf and raghotham as code owners April 6, 2026 16:01

meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 6, 2026

Merge branch 'main' into reasoning-output

f37d00e

mattf requested changes Apr 6, 2026

View reviewed changes

Nehanth added 5 commits April 6, 2026 16:37

Merge branch 'main' into reasoning-output

b1b3994

chore: add Bedrock integration test recordings for reasoning summary

78edbbd

Add recorded API responses for the reasoning summary integration tests using bedrock/openai.gpt-oss-20b, enabling replay mode in CI. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

chore: add integration test recordings for reasoning summary (ci-test…

1a8d473

…s config) Record reasoning summary tests using server:ci-tests config with record-if-missing mode to match CI request hashes. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

chore: update provider compatibility matrix

7447008

Auto-generated by pre-commit hook. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Nehanth requested a review from mattf April 6, 2026 21:55

Nehanth added 5 commits April 8, 2026 17:10

Merge remote-tracking branch 'origin/main' into reasoning-output

e3b6984

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor # Conflicts: # docs/docs/api-openai/provider_matrix.md # src/llama_stack/providers/inline/responses/builtin/responses/utils.py

chore: regenerate provider compatibility matrix after merge

9ab758a

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Merge branch 'main' into reasoning-output

63a2d80

Nehanth added 7 commits April 10, 2026 12:39

Merge branch 'main' into reasoning-output

dbcf5a2

Merge branch 'main' into reasoning-output

4d2e8dc

Merge branch 'main' into reasoning-output

7ee8c40

fix: remove unnecessary choices loop since responses uses n=1

753ac31

Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

feat: add Bedrock integration test recordings for reasoning summary

522ba66

Generate recordings for all reasoning summary tests so the Bedrock CI suite passes in replay mode. Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Merge branch 'main' into reasoning-output

dbbd716

mattf reviewed Apr 14, 2026

View reviewed changes

Nehanth added 4 commits April 14, 2026 11:09

Merge branch 'main' into reasoning-output

3dd5a81

add error logging for unexpected streaming response in summary call

109ce7e

runtimeerror

f9550d4

fix: raise RuntimeError for unexpected streaming response in summary …

cee7cb9

…call Signed-off-by: Nehanth <nehanthnarendrula@gmail.com> Made-with: Cursor

Nehanth requested a review from mattf April 14, 2026 19:07

Nehanth added 8 commits April 15, 2026 11:25

Merge branch 'main' into reasoning-output

e955be9

Merge branch 'main' into reasoning-output

cb331ba

Merge branch 'main' into reasoning-output

2e57132

Merge branch 'main' into reasoning-output

aaee0cb

Merge branch 'main' into reasoning-output

f1a8fa6

Merge branch 'main' into reasoning-output

3aeb7a9

removed streaming for reasoning summary, attaching it directly to the…

a85170a

… output item

removed stale recordings

e8ee37b

mattf approved these changes Apr 20, 2026

View reviewed changes

This comment was marked as spam.

Sign in to view

Merge branch 'main' into reasoning-output

7221623

cdoern approved these changes Apr 21, 2026

View reviewed changes

cdoern enabled auto-merge April 21, 2026 14:39

cdoern added this pull request to the merge queue Apr 21, 2026

Merged via the queue into ogx-ai:main with commit 4958b3d Apr 21, 2026
67 checks passed

Conversation

Nehanth commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works

Key design decisions

Test plan

Automated tests

Manual end-to-end tests

Test 1: summary: "concise"

Test 2: summary: "detailed"

Test 3: No summary requested (no regression)

Comparison with OpenAI's native output

Future work

Uh oh!

robinnarsinghranabhat commented Apr 6, 2026

Observations :

thoughts on coding's practice :

Uh oh!

iamemilio commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattf left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nehanth commented Apr 6, 2026

Uh oh!

Nehanth commented Apr 6, 2026

Observations :

thoughts on coding's practice :

Uh oh!

robinnarsinghranabhat commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robinnarsinghranabhat commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mattf commented Apr 8, 2026

Uh oh!

Uh oh!

Uh oh!

mattf Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Nehanth Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mattf Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Nehanth Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Nehanth Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as spam.

This comment was marked as spam.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Nehanth commented Apr 6, 2026 •

edited

Loading

Test 1: `summary: "concise"`

Test 2: `summary: "detailed"`

iamemilio commented Apr 6, 2026 •

edited

Loading

mattf left a comment •

edited

Loading

robinnarsinghranabhat commented Apr 8, 2026 •

edited

Loading

robinnarsinghranabhat commented Apr 8, 2026 •

edited

Loading

Nehanth Apr 17, 2026 •

edited

Loading