Skip to content

fix: Stability improvements#41

Merged
frac merged 3 commits intomainfrom
stability_improvements
Apr 20, 2026
Merged

fix: Stability improvements#41
frac merged 3 commits intomainfrom
stability_improvements

Conversation

@frac
Copy link
Copy Markdown
Collaborator

@frac frac commented Apr 20, 2026

Changes Summary

Tier 1 — Critical stability fixes

  1. src/routes/responses.ts — Added try-catch around both streaming and non-streaming for await loops. Streaming errors are caught and the response is properly ended. Non-streaming path has a responseSent guard to prevent double res.json() calls.
  2. src/mcp.ts — Added client.close() in the finally block of callMcpTool() to prevent MCP connection leaks. src/routes/responses/mcpStream.ts — Added mcp.close() in the finally block of listMcpToolsStream().
  3. src/routes/responses/utils.ts — Added res.destroyed and res.writableEnded check before writing in writeWithBackpressure().
  4. Non-streaming double-send — Fixed as part of feat: Add otel and fix CVEs #1 with the responseSent flag.
  5. src/routes/responses/innerStream.ts — Replaced res.status(401).json() inside the generator with throw new Error(). Removed unused res parameter from innerRunStream.

Tier 2 — Reliability improvements

  1. src/routes/responses.ts — Added AbortController wired to res.on("close") for client disconnect detection. Signal is checked in the main tool loop to stop processing when client disconnects.
  2. src/mcp.ts — Added configurable MCP_TIMEOUT_MS (default 30s) timeout on MCP tool calls using Promise.race.
  3. src/routes/responses/handleOneTurn.ts — Added configurable LLM_REQUEST_TIMEOUT_MS (default 5 min) timeout passed to the OpenAI SDK create() call.
  4. src/middleware/validation.ts — Replaced console.warn with req.log.warn for structured logging with request context. Added req.log.error for non-Zod errors.

Tier 3 — Robustness hardening

  1. src/routes/responses/innerStream.ts — Replaced fragile "mcp_" + id.split("_")[1] ID derivation with generateUniqueId("mcp") and matching by tool name/arguments/server_label instead.
  2. src/routes/responses/innerStream.ts — Added proper type check after listMcpToolsStream instead of blind as cast on .at(-1).
  3. src/routes/responses/messageFormatting.ts — Added default case with log.warn() for unknown content types that would otherwise be silently dropped.
  4. src/routes/responses/innerStream.ts — Made MAX_ITERATIONS configurable via MAX_TOOL_ITERATIONS env var (default: 5).
  5. src/routes/responses/mcpStream.ts — Made callApprovedMCPToolStream consistent with closeOutputItem.ts: only push to payload on success, log warning on error.

Supporting changes

  • eslint.config.js — Added AbortController and AbortSignal to globals
  • Test mocks — Updated createMockRes() to include on, writableFinished, writableEnded; updated validation test mock to include req.log; updated MCP test mock to include client.close(); updated innerStream.test.ts to match new signature (no res param) and test thrown error instead of res.status(401)

Changes Summary

  Tier 1 — Critical stability fixes

  1. src/routes/responses.ts — Added try-catch around both streaming and non-streaming for await loops. Streaming errors are caught and the response is properly ended.
  Non-streaming path has a responseSent guard to prevent double res.json() calls.
  2. src/mcp.ts — Added client.close() in the finally block of callMcpTool() to prevent MCP connection leaks.
  src/routes/responses/mcpStream.ts — Added mcp.close() in the finally block of listMcpToolsStream().
  3. src/routes/responses/utils.ts — Added res.destroyed and res.writableEnded check before writing in writeWithBackpressure().
  4. Non-streaming double-send — Fixed as part of #1 with the responseSent flag.
  5. src/routes/responses/innerStream.ts — Replaced res.status(401).json() inside the generator with throw new Error(). Removed unused res parameter from innerRunStream.

  Tier 2 — Reliability improvements

  6. src/routes/responses.ts — Added AbortController wired to res.on("close") for client disconnect detection. Signal is checked in the main tool loop to stop processing
  when client disconnects.
  7. src/mcp.ts — Added configurable MCP_TIMEOUT_MS (default 30s) timeout on MCP tool calls using Promise.race.
  8. src/routes/responses/handleOneTurn.ts — Added configurable LLM_REQUEST_TIMEOUT_MS (default 5 min) timeout passed to the OpenAI SDK create() call.
  9. src/middleware/validation.ts — Replaced console.warn with req.log.warn for structured logging with request context. Added req.log.error for non-Zod errors.

  Tier 3 — Robustness hardening

  10. src/routes/responses/innerStream.ts — Replaced fragile "mcp_" + id.split("_")[1] ID derivation with generateUniqueId("mcp") and matching by tool
  name/arguments/server_label instead.
  11. src/routes/responses/innerStream.ts — Added proper type check after listMcpToolsStream instead of blind as cast on .at(-1).
  12. src/routes/responses/messageFormatting.ts — Added default case with log.warn() for unknown content types that would otherwise be silently dropped.
  13. src/routes/responses/innerStream.ts — Made MAX_ITERATIONS configurable via MAX_TOOL_ITERATIONS env var (default: 5).
  14. src/routes/responses/mcpStream.ts — Made callApprovedMCPToolStream consistent with closeOutputItem.ts: only push to payload on success, log warning on error.

  Supporting changes

  - eslint.config.js — Added AbortController and AbortSignal to globals
  - Test mocks — Updated createMockRes() to include on, writableFinished, writableEnded; updated validation test mock to include req.log; updated MCP test mock to include
  client.close(); updated innerStream.test.ts to match new signature (no res param) and test thrown error instead of res.status(401)
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 20, 2026

⚠️ Deprecation Warning: The deny-licenses option is deprecated for possible removal in the next major release. For more information, see issue 997.

Dependency Review

✅ No vulnerabilities or license issues or OpenSSF Scorecard issues found.

Scanned Files

None

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR focuses on improving stability and resource handling in the Responses API flow (streaming + non-streaming), especially around client disconnects, MCP lifecycle cleanup, and safer response writing.

Changes:

  • Hardened streaming/non-streaming response handling with better error/close behavior and reduced double-send risk.
  • Added timeouts and connection cleanup for upstream calls (LLM + MCP) and improved client-disconnect signaling.
  • Improved robustness in MCP tool/approval flows and added structured logging in request validation.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
src/routes/responses.ts Adds disconnect-driven abort controller; wraps streaming/non-streaming loops with improved error handling.
src/routes/responses/utils.ts Adds pre-checks before writing to the response stream.
src/routes/responses/innerStream.ts Removes direct Express response usage; improves MCP list-tools typing; adds configurable max iterations; adds abort-aware loop exit.
src/routes/responses/handleOneTurn.ts Adds configurable LLM streaming request timeout via AbortSignal.
src/routes/responses/messageFormatting.ts Adds logging for unknown content parts that are dropped during message formatting.
src/routes/responses/mcpStream.ts Ensures MCP client is closed in finally; only updates payload on successful approved MCP tool calls.
src/routes/responses/mcpStream.test.ts Updates MCP client mock to include close().
src/routes/responses/innerStream.test.ts Updates tests for new innerRunStream signature and thrown unauthorized behavior.
src/routes/responses/test_helpers/mocks.ts Updates Express response mock with on, writableFinished, writableEnded.
src/mcp.ts Adds MCP tool-call timeout and ensures client is closed in finally.
src/middleware/validation.ts Switches validation logging to req.log and logs unexpected errors.
src/middleware/validation.test.ts Updates request mock to include req.log.
eslint.config.js Adds AbortController/AbortSignal globals for linting.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/routes/responses/innerStream.ts Outdated
Comment thread src/routes/responses/innerStream.ts
Comment thread src/routes/responses/handleOneTurn.ts
Comment thread src/mcp.ts Outdated
Comment thread src/mcp.ts Outdated
Comment thread src/routes/responses/innerStream.ts
Comment thread src/routes/responses.ts
Comment thread src/routes/responses/utils.ts Outdated
Copy link
Copy Markdown

@phvalguima phvalguima left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

I think one thing we missed is a generic settings class. We keep loading env vars across the code base.

Comment thread src/routes/responses/handleOneTurn.ts
Comment thread src/mcp.ts Outdated
@frac
Copy link
Copy Markdown
Collaborator Author

frac commented Apr 20, 2026

Created src/lib/config.ts — a single centralized config module that:

  • Declares all 16 environment variables in one place with typed parsing helpers (parseIntEnv, parseBoolEnv, parseStringEnv)
  • Validates at startup (fails fast on invalid values like non-numeric PORT)
  • Documents each setting with JSDoc comments
  • Groups settings by area (Server, Upstream/LLM, MCP, Tool loop, Logging, OpenTelemetry)
  • Exports a frozen config object (as const)

Migrated 7 source files to import from config instead of reading process.env directly:

  • src/index.ts — removed local parseIntEnv, uses config.port, config.host, etc.
  • src/mcp.ts — removed MCP_TIMEOUT_MS constant, uses config.mcpTimeoutMs
  • src/lib/logger.ts — removed LOG_LEVEL/LOG_PRETTY constants, uses config.logLevel/config.logPretty
  • src/lib/otel.ts — uses config.otelDisabled
  • src/routes/responses/types.ts — uses config.otelGenaiCaptureToolContent
  • src/routes/responses/handleOneTurn.ts — removed parseIntEnv, uses config.openaiBaseUrl, config.upstreamMaxConnections, etc.
  • src/routes/responses/innerStream.ts — uses config.maxToolIterations

Result: Zero process.env access in source files outside of config.ts. No more risk of typos or inconsistent defaults.

  Created src/lib/config.ts — a single centralized config module that:
  - Declares all 16 environment variables in one place with typed parsing helpers (parseIntEnv, parseBoolEnv, parseStringEnv)
  - Validates at startup (fails fast on invalid values like non-numeric PORT)
  - Documents each setting with JSDoc comments
  - Groups settings by area (Server, Upstream/LLM, MCP, Tool loop, Logging, OpenTelemetry)
  - Exports a frozen config object (as const)

  Migrated 7 source files to import from config instead of reading process.env directly:
  - src/index.ts — removed local parseIntEnv, uses config.port, config.host, etc.
  - src/mcp.ts — removed MCP_TIMEOUT_MS constant, uses config.mcpTimeoutMs
  - src/lib/logger.ts — removed LOG_LEVEL/LOG_PRETTY constants, uses config.logLevel/config.logPretty
  - src/lib/otel.ts — uses config.otelDisabled
  - src/routes/responses/types.ts — uses config.otelGenaiCaptureToolContent
  - src/routes/responses/handleOneTurn.ts — removed parseIntEnv, uses config.openaiBaseUrl, config.upstreamMaxConnections, etc.
  - src/routes/responses/innerStream.ts — uses config.maxToolIterations

  Result: Zero process.env access in source files outside of config.ts. No more risk of typos or inconsistent defaults.
Comment thread src/routes/responses/innerStream.ts Outdated
Here's what was fixed from the PR feedback:

  1. Reverted MCP ID derivation — Pedro was right. The original "mcp_" + item.approval_request_id.split("_")[1] correctly maps mcpr_<ID> to mcp_<ID>. Our name/args
  matching was wrong because duplicate calls (same tool, same args) would collide.
  2. Threaded abort signal into LLM call — The disconnect signal now flows all the way from postCreateResponse -> innerRunStream -> handleOneTurnStream ->
  client.chat.completions.create(), combined with the timeout using AbortSignal.any([signal, AbortSignal.timeout(...)]). Client disconnect now actually cancels in-flight
  LLM requests.
  3. Fixed leaked timer in Promise.race — Added .finally(() => clearTimeout(timeoutId)) so the timeout timer is cleaned up when callTool wins the race.
  4. Added abort check in non-streaming loop — Non-streaming path now also breaks on abortController.signal.aborted, matching the streaming path.
  5. Wrapped res.write() in try-catch — Protects against synchronous throws between the destroyed check and the write call.
@frac frac merged commit 7b03cef into main Apr 20, 2026
9 checks passed
@frac frac deleted the stability_improvements branch April 20, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants