Skip to content

docs(perf): Chapter 3 + Task-Manager-first diagnosis in perf-diagnose skill#619

Merged
srid merged 1 commit intomasterfrom
perf/618-residual-leak
Apr 17, 2026
Merged

docs(perf): Chapter 3 + Task-Manager-first diagnosis in perf-diagnose skill#619
srid merged 1 commit intomasterfrom
perf/618-residual-leak

Conversation

@srid
Copy link
Copy Markdown
Member

@srid srid commented Apr 17, 2026

Summary

Documents the Chapter 3 investigation so future agents don't burn three days chasing proxy metrics the way we did.

Source of truth edits (regenerated into .claude/ via just ai::apm):

  • docs/perf-investigations/memory-learnings.md — new Chapter 3 section covering perf(memory): cut canvas-toggle retention (-89% Contexts, -91% SVG bytes) #614 (closed without merge, the false trail) and fix(xterm): patch IntersectionObserver retention (-367 MB/30 mode-toggles) #617 (the one-line WeakRef that actually moved Task Manager Memory Footprint by −81%).
  • agents/.apm/skills/perf-diagnose/SKILL.md — runbook additions:
    • New "Ground truth: Task Manager Memory Footprint, not proxies" rule at the top. performance.memory, system/Context count, and closure:* count are all proxies — they can drift 100× from Task Manager and mislead you into declaring a fix that does nothing.
    • New "Quiet-session A/B" step. Active agent terminals grow scrollback buffers legitimately; a busy-session measurement looks indistinguishable from retention.
    • New third leak shape: "Callback retained past dispose()". When a Window.<Observer> native registry (or a DevTools extension wrapping it) holds the callback closure past the explicit observer.disconnect(), the callback keeps this (and its whole service graph) reachable. Fix pattern: wrap this in WeakRef inside the callback.
    • Promoted diff-heap.mjs + find-retainers.mjs to the top of the analyzer list with "start here" language. Sort heap diffs by bytes, not count — a 220 MB Uint32Array leak drowns any number of 40-byte Context churn.

New committed tooling (had been sitting untracked):

  • docs/perf-investigations/scripts/diff-heap.mjs — per-class byte-delta between two heap snapshots.
  • docs/perf-investigations/scripts/find-retainers.mjs — BFS from GC roots to every instance of a target class, grouped by path signature.

Why

The three-day version of the story: I jumped straight to heap snapshots, found tens of thousands of retained system/Context objects, and shipped six refactoring commits on #614 that reduced that count 89%. Task Manager didn't move. The actual load-bearing retention was a single IntersectionObserver callback in xterm's RenderService holding 220 MB of Uint32Array BufferLines. One heap diff sorted by bytes (not count) named the culprit in one line of output; one WeakRef wrap fixed it.

Codifying these so the next agent doesn't re-tread the path.

Test plan

  • just fmt clean
  • just ai::apm regenerated .claude/skills/perf-diagnose/SKILL.md to match APM source
  • ci/apm-sync validates .claude/ matches sources
  • ci/fmt + ci/nix green

🤖 Generated with Claude Code

…osis

Adds the IntersectionObserver / BufferLine retention story to
memory-learnings.md (#617, upstream xtermjs/xterm.js#5820 + #5821).

Skill updates:

- New "Ground truth: Task Manager Memory Footprint, not proxies" section.
  performance.memory, system/Context count, closure:* count are all
  proxies that can diverge from Task Manager by 100x. #614 reduced
  Context growth 89% across six commits with zero Memory Footprint
  improvement — the cautionary tale.

- New "Quiet-session A/B" requirement. Active agent terminals grow
  xterm scrollback legitimately; measurements on a busy session look
  indistinguishable from retention. #618 closed after a quiet-session
  A/B showed the +69 MB residual was all agent-stream activity.

- New leak shape "Callback retained past dispose()" — when an
  observer's disconnect()/dispose() doesn't fully release the callback
  closure in practice (DevTools instrumentation, extensions, native
  registry quirks). Fix pattern: wrap `this` in WeakRef inside the
  callback. This is what #617 did for xterm's RenderService.

- Promoted diff-heap.mjs + find-retainers.mjs to the top of the
  analyzer list, sorted by "start here". Sort heap diffs by bytes, not
  count — a 220 MB Uint32Array leak dominates any number of 40-byte
  Context churn.

Also commits the two diagnostic scripts that had been sitting
untracked in docs/perf-investigations/scripts/.

Regenerated .claude/skills/perf-diagnose/SKILL.md via `just ai::apm`.
@srid srid force-pushed the perf/618-residual-leak branch from 5d08b2b to b3b5655 Compare April 17, 2026 23:47
@srid srid merged commit eabeb9d into master Apr 17, 2026
4 checks passed
@srid srid deleted the perf/618-residual-leak branch April 17, 2026 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant