Skip to content

fix(copilot): reduce premium request inflation, enable thinking, and use dynamic API limits#485

Merged
luispater merged 3 commits intorouter-for-me:mainfrom
kunish:fix/copilot-premium-request-inflation
Apr 3, 2026
Merged

fix(copilot): reduce premium request inflation, enable thinking, and use dynamic API limits#485
luispater merged 3 commits intorouter-for-me:mainfrom
kunish:fix/copilot-premium-request-inflation

Conversation

@kunish
Copy link
Copy Markdown

@kunish kunish commented Apr 3, 2026

Summary

  • Reduce premium request inflation by correctly detecting agent-initiated requests (tool loops, continuations) vs user-initiated requests, setting X-Initiator: agent to avoid consuming premium quota
  • Enable thinking/reasoning for Copilot Claude models (Opus 4.5/4.6, Sonnet 4/4.5/4.6) with low/medium/high effort levels
  • Use dynamic API limits from the Copilot /models endpoint to prevent "prompt token count exceeds the limit of 128000" errors — the proxy now reads max_prompt_tokens per account type (128K individual, 168K business) instead of hardcoding 200K
  • Fix GitLab Duo context-1m beta header not being applied when routing through the Anthropic gateway
  • Fix flaky parallel tests that shared global model registry state

Key changes

Premium request inflation (X-Initiator)

  • Replace naive last-message role check with isAgentInitiated() that detects tool_result content, preceding assistant tool_use, and Responses API function calls
  • Correctly marks tool loop continuations as agent even when Claude Code wraps tool results in role: "user" messages

Thinking support

  • Add Thinking: &ThinkingSupport{Levels: []string{"low", "medium", "high"}} to Copilot Claude model definitions
  • Set reasoning.summary: "auto" default when reasoning.effort is present (matches vscode-copilot-chat behavior)
  • Include reasoning.encrypted_content for reasoning content reuse across turns

Dynamic context limits

  • Add CopilotModelLimits struct and Limits() method to extract capabilities.limits from the Copilot /models API response
  • Override static ContextLength with max_prompt_tokens from the API — this is the hard limit that triggers 400 errors
  • Override static MaxCompletionTokens with max_output_tokens from the API
  • Fallback to static values when API limits are unavailable

Other fixes

  • Strip context-1m-2025-08-07 beta (unsupported by Copilot)
  • Normalize reasoning_textreasoning_content field mapping for streaming/non-streaming
  • Local token counting via tiktoken for CountTokens endpoint
  • Fix openai-intent header to conversation-edits
  • Fix GitLab Duo gitlab_duo_force_context_1m attribute not being read by Claude executor

Test plan

  • go build ./... compiles successfully
  • go test ./internal/auth/copilot/ ./internal/runtime/executor/ ./internal/registry/ — all pass
  • TestCopilotModelEntry_Limits — 7 scenarios covering nil/missing/partial/individual/business limits
  • TestGitLabExecutorExecuteStreamUsesAnthropicGateway — previously failing, now fixed
  • TestUseGitHubCopilotResponsesEndpoint_RegistryResponsesOnlyModel — flaky test fixed

Copilot AI review requested due to automatic review settings April 3, 2026 11:24
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the GitHub Copilot executor by implementing local token counting using tiktoken, refining the detection of agent-initiated requests for accurate billing headers, and adding support for reasoning models. It also introduces a new management endpoint for quotas, strips unsupported Anthropic beta headers, and ensures default settings for the Responses API. Feedback was provided regarding the use of the passed context in the CountTokens method to ensure proper cancellation and tracing propagation.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the GitHub Copilot integration to reduce premium request inflation (via corrected headers/defaults), prevent context overflow confusion from unsupported Anthropic betas, and enable “thinking” support for Copilot-hosted Claude models.

Changes:

  • Add Responses API defaults, forward upstream response headers for non-streaming calls, and implement local tiktoken-based token counting for Copilot.
  • Strip unsupported Anthropic betas (notably context-1m-2025-08-07) from translated request bodies and refine X-Initiator detection logic.
  • Enable level-based thinking support for Copilot Claude models and add reasoning_text fallback in the OpenAI→Claude response translator.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
internal/translator/openai/claude/openai_claude_response.go Adds fallback to reasoning_text when reasoning_content is absent/empty during translation.
internal/runtime/executor/github_copilot_executor.go Applies Responses defaults, strips unsupported betas, forwards non-stream response headers, adds local token counting, and changes initiator detection.
internal/runtime/executor/github_copilot_executor_test.go Expands/adjusts tests for initiator behavior, defaults, token counting, and beta stripping.
internal/registry/model_definitions.go Enables level-based thinking support (low/medium/high) for Copilot Claude models.
internal/api/server.go Registers GET /copilot-quota management route.
.github/workflows/fork-docker.yml Adds a branch-scoped Docker build/push workflow.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kunish kunish force-pushed the fix/copilot-premium-request-inflation branch 2 times, most recently from b078e5d to 0cbd45e Compare April 3, 2026 11:48
This commit addresses three issues with Claude Code through GitHub
Copilot:

1. **Premium request inflation**: Responses API requests were missing
   Openai-Intent headers and proper defaults, causing Copilot to bill
   each tool-loop continuation as a new premium request. Fixed by adding
   isAgentInitiated() heuristic (checks for tool_result content or
   preceding assistant tool_use), applying Responses API defaults
   (store, include, reasoning.summary), and local tiktoken-based token
   counting to avoid extra API calls.

2. **Context overflow**: Claude Code's modelSupports1M() hardcodes
   opus-4-6 as 1M-capable, but Copilot only supports ~128K-200K.
   Fixed by stripping the context-1m-2025-08-07 beta from translated
   request bodies. Also forwards response headers in non-streaming
   Execute() and registers the GET /copilot-quota management API route.

3. **Thinking not working**: Add ThinkingSupport with level-based
   reasoning to Claude models in the static definitions. Normalize
   Copilot's non-standard 'reasoning_text' response field to
   'reasoning_content' before passing to the SDK translator. Use
   caller-provided context in CountTokens instead of Background().
@kunish kunish force-pushed the fix/copilot-premium-request-inflation branch from 0cbd45e to 59af2c5 Compare April 3, 2026 12:24
@kunish kunish requested a review from Copilot April 3, 2026 12:31
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

kunish added 2 commits April 3, 2026 20:51
…t detection

- Strip SSE `data:` prefix before normalizing reasoning_text→reasoning_content
  in streaming mode; re-wrap afterward for the translator
- Iterate all choices in normalizeGitHubCopilotReasoningField (not just
  choices[0]) to support n>1 requests
- Remove over-broad tool-role fallback in isAgentInitiated that scanned
  all messages for role:"tool", aligning with opencode's approach of only
  detecting active tool loops — genuine user follow-ups after tool use are
  no longer mis-classified as agent-initiated
- Add 5 reasoning normalization tests; update 2 X-Initiator tests to match
  refined semantics
The Copilot API enforces per-account prompt token limits (128K individual,
168K business) that differ from the static 200K context length advertised
by the proxy. This mismatch caused Claude Code to accumulate context
beyond the actual limit, triggering "prompt token count exceeds the limit
of 128000" errors.

Changes:
- Extract max_prompt_tokens and max_output_tokens from the Copilot
  /models API response (capabilities.limits) and use them as the
  authoritative ContextLength and MaxCompletionTokens values
- Add CopilotModelLimits struct and Limits() helper to parse limits
  from the existing Capabilities map
- Fix GitLab Duo context-1m beta header not being set when routing
  through the Anthropic gateway (gitlab_duo_force_context_1m attr
  was set but only gin headers were checked)
- Fix flaky parallel tests that shared global model registry state
@kunish kunish changed the title fix(copilot): reduce premium request inflation and enable thinking fix(copilot): reduce premium request inflation, enable thinking, and use dynamic API limits Apr 3, 2026
@luispater luispater merged commit 98509f6 into router-for-me:main Apr 3, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants