Skip to content

fix: surface bundle subprocess stderr in runtime logs (#116)#132

Merged
mgoldsborough merged 3 commits intomainfrom
worktree-issue-116-bundle-stderr
Apr 29, 2026
Merged

fix: surface bundle subprocess stderr in runtime logs (#116)#132
mgoldsborough merged 3 commits intomainfrom
worktree-issue-116-bundle-stderr

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Closes #116.

Summary

  • Drain the MCP bundle subprocess's transport.stderr into a per-source 50-line ring buffer and a default-on dimmed [bundle:<name>] live print. Attached at construction so initialize-time crashes are captured.
  • Add the previously missing stdio onclose handler so subprocess exits are detected eagerly instead of on the next tool call. Include stderrTail in the source.crashed payload across all transport modes.
  • Render stderrTail in ConsoleEventSink, dimmed and indented under the error line, so the traceback is visible without digging through separate streams.
  • Suppress source.crashed on graceful stop() via a stopping flag — transport.close() fires onclose synchronously, which would otherwise surface every deliberate teardown as a crash. This was a latent bug across remote/in-process; adding stdio made it loud.

Why default-on (not NB_DEBUG-gated)

Bundle stderr is the bundle author's deliberate diagnostic output — Python uses stderr for tracebacks, warnings, deprecation notices, logs. Different concern than NB's own protocol tracing (NB_DEBUG=mcp). The cost asymmetry is severe: hiding signal causes the multi-hour debug loop the issue was filed about; showing dimmed prefixed lines is recoverable with shell redirection or by quieting the bundle. Visual prefix + dim formatting makes the channel tunable by eye without a config flag — same pattern as Docker / foreman / overmind.

Architecture

Three concerns kept separate:

  1. Live drain — bytes-to-lines, attached for the subprocess's lifetime.
  2. Tail capture — bounded ring buffer (50 lines, 8 KB per line) feeding the crash event payload.
  3. Surfacing — dev console (default-on) + run.error.data.stderrTail (additive field for downstream consumers like a future web UI).

Edge cases handled: partial lines (print(end="")), CRLF endings, runaway-line truncation, lossy UTF-8 decode, restart cycles (ring buffer reset at start()), graceful stops (suppressed via stopping flag).

Reader is attached at transport construction, not after connect(), because StdioClientTransport.stderr is a PassThrough exposed synchronously from the constructor — the SDK's own doc-comment calls this out specifically to avoid losing early child output. No race, no monkey-patching of start().

Did NOT introduce a new bundle.log engine event for live lines. Out of scope per the issue, and live console output covers the dev-loop case. Web UI surfacing falls out cheaply later by reading event.data.stderrTail.

Test plan

  • bun run verify passes (2141 unit + 213 web + 416 integration + 17 smoke, exit 0)
  • Smoke output shows live [bundle:echo] ... lines streaming through during real subprocess runs
  • Manual: trigger a Python ModuleNotFoundError in a bundle and confirm the traceback appears under the crash line in bun run dev
  • Manual: bun run dev, restart a bundle (stop + start) and confirm no spurious source.crashed events fire

When an MCP bundle subprocess crashes, the developer previously saw
only a generic source.crashed event with no payload — the actual
traceback / import error / config failure was written to stderr but
nothing in NB ever read it. Two-hour debug loops were the symptom; the
unconsumed pipe was the cause.

- Drain transport.stderr into a per-source 50-line ring buffer and a
  default-on dimmed [bundle:<name>] live print. Attached at construction
  so initialize-time crashes are captured (the SDK exposes the
  PassThrough synchronously from the constructor for exactly this
  reason).
- Add the previously missing stdio onclose handler so subprocess exits
  are detected eagerly instead of on the next tool call. Include
  stderrTail in the source.crashed event payload across all transport
  modes for shape consistency.
- Render stderrTail in ConsoleEventSink, dimmed and indented under the
  error line, so the traceback is visible without digging through
  separate streams.
- Suppress source.crashed on graceful stop() via a stopping flag —
  transport.close() fires onclose synchronously, which would otherwise
  surface every deliberate teardown as a crash.

bundle stderr is intentionally default-on, not gated behind NB_DEBUG.
It is the bundle author's deliberate diagnostic output (Python uses
stderr for tracebacks, warnings, logs), a different concern than NB's
own protocol tracing. Visual prefix + dim formatting makes the channel
tunable by eye without a config flag.
Addresses QA review on PR #132.

- Consolidate four near-identical onclose handlers + execute()'s catch
  emit into a single emitSourceCrashed helper. Adds a `dead` guard so a
  subprocess death observed by both the transport's onclose and the
  in-flight execute()'s catch fires source.crashed exactly once.
  Whichever path runs first wins; its payload is canonical.
- Set this.stopping = true in cleanupOnStartFailure so a failed start
  doesn't surface as a crash event for a source the listener never saw
  alive. The OAuth retry path resets the flag before the second connect
  attempt so successful retries re-arm crash detection.
- Refactor attachStderrReader to take the stream directly instead of a
  StdioClientTransport, which makes the chunk/CRLF/partial-line/runaway
  logic testable against a synthetic Readable.
- Add 13 unit tests covering: chunk-boundary line splitting, partial-
  line buffering across chunks, end-of-stream flush, CRLF stripping,
  ring-buffer FIFO eviction at 50, runaway-line truncation, dead-guard
  de-dup, payload shape with stderrTail, graceful stop suppression,
  failed-start suppression, restart re-arms detection, ring-buffer
  reset across start cycles.
- Document the default-on [bundle:<name>] channel in AGENTS.md
  Debugging section as `log.ts` instructed.
@mgoldsborough
Copy link
Copy Markdown
Contributor Author

Addressed QA review (commit 9a2b3be).

Fixed:

  • Warning 2 (duplicate emit): Consolidated all four onclose handlers and execute()'s catch into a single emitSourceCrashed helper with dead/stopping guards. A subprocess death observed by both the transport's onclose and an in-flight execute()'s catch now fires exactly one source.crashed. Whichever path runs first wins; its payload is canonical.
  • Warning 3 (cleanupOnStartFailure): Sets stopping = true so a failed-start no longer emits a parallel crash event for a source the listener never saw alive. The OAuth retry path explicitly clears the flag before the second connect, so successful retries re-arm crash detection.
  • Warning 1 (test coverage): Added 13 unit tests in test/unit/mcp-source-stderr.test.ts covering: chunk-boundary line splitting, cross-chunk partial-line buffering, end-of-stream flush, CRLF stripping, ring-buffer FIFO eviction at 50, runaway-line truncation, dead-guard de-dup, payload shape with stderrTail, graceful-stop suppression, failed-start suppression, restart re-arms detection, ring-buffer reset across start cycles. Bounded to high-value cases — silent regressions (buffer cap, partial-line flush) and the new lifecycle invariants (dead/stopping guards).
  • Suggestion 1 (CLAUDE.md): Added a "Bundle subprocess stderr (default-on)" subsection under Debug Logging documenting the [bundle:<name>] channel, why it's not gated, and how to quiet a chatty bundle.
  • Suggestion 3 (helper): Done as part of warning 2 — emitSourceCrashed collapses ~50 LoC of duplicated onclose blocks. Net negative LOC despite adding the helper.

Skipped, with reasoning:

  • Suggestion 2 (console.error event-loop blocking): Premature. Node PassThrough applies backpressure — a bundle writing faster than we drain blocks on its own write(). The system self-throttles. A bundle that genuinely streams 100 MB to stderr is itself the problem; the fix would belong at that bundle, not in our drain.
  • Suggestion 4 (empty-string stderrTail): Cosmetic. Uniform shape simplifies downstream consumers (no ?. chains).

Verify: bun run verify clean — 2154 unit (+13) / 213 web / 416 integration / 17 smoke.

@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Apr 29, 2026
scripts/repro-issue-116.ts spawns a deliberately-broken Python MCP
server, runs the production McpSource against it, and asserts:

  1. start() throws (subprocess dies during initialize)
  2. Live [bundle:...] drain rendered stderr to the developer
  3. Pre-crash output was captured (proves attach-at-construction)
  4. Exactly one source.crashed event was emitted (dead-guard works)
  5. source.crashed.stderrTail contains the Python traceback

Exits non-zero on any contract violation. Runs in ~1s and exercises
the whole drain → ring buffer → console renderer → crash event chain
against a real subprocess — closer to user-visible behavior than the
unit tests, faster than spinning up `bun run dev` and configuring a
bundle by hand.
@mgoldsborough mgoldsborough merged commit 39f8199 into main Apr 29, 2026
4 checks passed
@mgoldsborough mgoldsborough deleted the worktree-issue-116-bundle-stderr branch April 29, 2026 09:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Surface bundle subprocess stderr in runtime logs

1 participant