Skip to content

fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805

Open
GhimBoon wants to merge 5 commits intoChainlit:mainfrom
GhimBoon:fix/mcp-exit-stack-leak
Open

fix: harden MCP exit stack cleanup to prevent cross-task cancel-scope errors#2805
GhimBoon wants to merge 5 commits intoChainlit:mainfrom
GhimBoon:fix/mcp-exit-stack-leak

Conversation

@GhimBoon
Copy link
Contributor

@GhimBoon GhimBoon commented Feb 26, 2026

Summary

This change fixes intermittent MCP connection/disconnection failures that surfaced as:

  • 400 responses during MCP lifecycle operations
  • RuntimeError: Attempted to exit cancel scope in a different task
  • orphaned cleanup failures after reconnect attempts

Root cause

MCP AsyncExitStack cleanup was not consistently guaranteed in all failure paths, and some exceptions are wrapped in BaseExceptionGroup, which prevented targeted suppression/handling of the known cancel-scope mismatch case.

What changed

  • Added robust MCP exit stack close handling that supports both RuntimeError and BaseExceptionGroup forms of the cancel-scope mismatch.
  • Added a recursive detector for cancel-scope mismatch inside nested exception groups.
  • Ensured failed connect flows always close the temporary exit stack in finally when ownership is not transferred.
  • Updated disconnect/session teardown paths to use the safe close helper consistently.
  • Guarded reconnect-time on_mcp_disconnect callback so callback failures do not break cleanup flow.
  • Expanded unit tests to cover:
    • BaseExceptionGroup-wrapped cancel-scope mismatch
    • recursive mismatch detection helper behavior
    • existing safe-close behavior

Validation

  • pytest tests/test_session.py -v → 46 passed
  • pytest tests/test_mcp.py -v → 39 passed

Impact

  • No functional change to successful MCP connections.
  • Improves stability and cleanup correctness on failed/partial MCP lifecycle events.
  • Reduces noisy task/cleanup errors and prevents resource leaks.

Fixes #2182


Summary by cubic

Hardened MCP exit stack cleanup to prevent cross-task cancel-scope errors and resource leaks across connect, disconnect, and reconnect. Fixes intermittent 400s and runtime errors; successful connections are unaffected.

  • Bug Fixes

    • Added safe_mcp_exit_stack_close to suppress anyio cancel-scope mismatches (including BaseExceptionGroup) with clearer logging.
    • In connect_mcp: guarded on_mcp_disconnect errors, removed old session entries on reconnect, and closed un-stored exit stacks in finally; applied safe close in disconnect and session.delete.
  • Refactors

    • Minor lint and logging improvements.

Written for commit c4bb6e6. Summary will update on new commits.

…eaks

When an MCP connection fails (network error, auth failure, initialization
timeout), the AsyncExitStack is never closed because the except block
raises HTTPException before cleanup. The abandoned exit stack is later
garbage-collected in a different asyncio task, causing:

  RuntimeError: Attempted to exit cancel scope in a different task
  than it was entered in

This corrupts anyio's cancel-scope stack and can spin a CPU core at 100%.

Changes:
- Add safe_mcp_exit_stack_close() helper that suppresses the cross-task
  cancel scope RuntimeError from anyio during MCP cleanup
- connect_mcp: track whether exit_stack was stored via a flag; close it
  in a finally block when the connection was not successfully stored
- connect_mcp: properly delete the old session entry when reconnecting
- disconnect_mcp: use safe_mcp_exit_stack_close instead of bare
  try/except
- WebsocketSession.delete: use safe_mcp_exit_stack_close for consistent
  cleanup
- Add tests for safe_mcp_exit_stack_close and cancel-scope error handling

Fixes Chainlit#2182
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working unit-tests Has unit tests. labels Feb 26, 2026
Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 issues found across 3 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/chainlit/server.py">

<violation number="1" location="backend/chainlit/server.py:1422">
P2: exit_stack_stored is set to True before on_mcp_connect runs; if the callback raises, the finally block skips cleanup and the partially initialized session remains stored, leaking the exit stack/session.</violation>
</file>

<file name="backend/chainlit/session.py">

<violation number="1" location="backend/chainlit/session.py:34">
P2: BaseExceptionGroup is referenced without a compatibility import, but the project supports Python 3.10 where BaseExceptionGroup is undefined. This will raise NameError during exception matching on 3.10, breaking MCP exit stack cleanup.</violation>
</file>

Since this is your first cubic review, here's how it works:

  • cubic automatically reviews your code and comments on bugs and improvements
  • Teach cubic by replying to its comments. cubic learns from your replies and gets better over time
  • Add one-off context when rerunning by tagging @cubic-dev-ai with guidance or docs links (including llms.txt)
  • Ask questions if you need clarification on any suggestion

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Copy link
Contributor

@cubic-dev-ai cubic-dev-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 2 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="backend/chainlit/session.py">

<violation number="1" location="backend/chainlit/session.py:47">
P2: BaseExceptionGroup is no longer caught; since it inherits from BaseException (not Exception), BaseExceptionGroup-wrapped cancel-scope errors will bypass `except Exception`, so `_is_cancel_scope_error` never runs and cleanup errors can propagate again.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files. unit-tests Has unit tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Disconnecting an MCP server causes a CPU core to get stuck at 100%

1 participant