From 11eac1bb67ffe325f51b13b855bee6c2e760c3fb Mon Sep 17 00:00:00 2001 From: Sridhar Ratnakumar Date: Fri, 17 Apr 2026 19:59:10 -0400 Subject: [PATCH 1/4] docs(blog): the leak that wasn't in any Context MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Blog post tracing the IntersectionObserver / BufferLine retention story — wrong turn in #614, byte-delta heap diff as the pivot, the one-line WeakRef patch in xtermjs/xterm.js#5821. Written in the voice of the debugging-story canon: cold open on the symptom, the wrong turns compressed rather than storied, anticlimactic fix. Targets readers from backend / systems backgrounds who may not be fluent in web-frontend memory tooling — links to MDN and Chrome DevTools docs for the web-specific concepts. Credit: I drove; Claude Code (https://claude.com/claude-code) did the agent-side work. Under docs/perf-investigations/ alongside the technical chapters in memory-learnings.md. --- .../the-leak-that-wasnt-in-any-context.md | 242 ++++++++++++++++++ 1 file changed, 242 insertions(+) create mode 100644 docs/perf-investigations/the-leak-that-wasnt-in-any-context.md diff --git a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md new file mode 100644 index 00000000..39ec1493 --- /dev/null +++ b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md @@ -0,0 +1,242 @@ +# The leak that wasn't in any `Context` + +_One afternoon, two xterm.js contributions, and a reminder that proxy +metrics can be wrong by three orders of magnitude._ + +[Kolu](https://github.com/juspay/kolu) is a browser cockpit for +coding agents — `claude`, `aider`, `codex`, whatever ships next week. +The terminal is the universal interface: every pane is a real +[xterm.js](https://xtermjs.org/) in the browser, connected over +WebSocket to a PTY on the server, and Kolu watches what you already +do (the repos you `cd` into, the agents you run) to populate its +UI. No agent adapters, no preferences pane. Run a new agent once +and it appears in the command palette the next time you need it. + +Yesterday I shipped [canvas mode](https://x.com/sridca/status/2044953014100726221): +instead of stacking terminals in a sidebar, you drag them around a +freeform 2D canvas like desktop windows. Cute demo, popular feature, +and — within hours of me updating the always-on Kolu instance on my +headless dev box — the thing that made the tab footprint climb to +1.2 GB. + +Toggle canvas-on, toggle canvas-off, repeat thirty times. Chrome Task +Manager kept climbing. Stop toggling, leave the tab alone, come back +in an hour: still 1.2 GB. Close the tab. Reopen. 300 MB again. +Toggle thirty times. 1.2 GB. + +This is the story of finding the leak, told honestly: the two +wrong hours, the one good diff, the one-line fix, and the two small +patches I upstreamed to xterm.js along the way. I drove; [Claude +Code](https://claude.com/claude-code) did the agent-side work. + +## The wrong turn + +Kolu uses [SolidJS](https://www.solidjs.com/), which tracks +reactivity through `system/Context` objects — V8's name for the +block of memory that holds a closure's captured variables. If a +component's scope fails to clean up on unmount, its `Context` +lingers, and everything that scope closes over lingers with it. +Classic retention. + +So I took the usual first steps. Open Chrome DevTools → Memory tab. +Take a [heap snapshot](https://developer.chrome.com/docs/devtools/memory-problems/heap-snapshots) +before, thirty toggles, snapshot after. Look at instance count growth +per class. Tens of thousands of new `system/Context` and `closure` +objects between the two snapshots. Chase the retainer chains. Find +the usual SolidJS-shaped culprits: + +- Inline JSX event handlers (`
terminal.focus()}>`) + that share a V8 lexical scope with everything else in the component + body. One closure in that scope captures something heavy; the whole + scope gets pinned. +- Third-party component libraries (`@corvu/resizable`, + `@thisbeyond/solid-dnd`) that register internal contexts and don't + always tear them down cleanly. + +Six commits landed on [a branch](https://github.com/juspay/kolu/pull/614) +over the afternoon. Replaced the two libraries with 200 lines of +custom code. Delegated every inline handler to the parent. `Context` +count per 30-toggle run went from +11,025 down to +1,208. An 89% +reduction. I wrote the PR, drew a mermaid graph of the staircase, +shipped to my dev box. + +Chrome Task Manager showed no change. Zero. Identical to before. + +## What I was actually measuring + +Chrome's [Task Manager](https://developer.chrome.com/docs/devtools/memory) +(`⇧⎋`) shows, per-tab, the number the operating system assigns to the +tab process. It includes: + +- The live JavaScript heap (post-GC). +- GPU memory — textures, compositor layers. +- The renderer's baseline, ~100-150 MB. +- **Native-side DOM state.** SVG element attributes, detached + canvases, and — this turned out to matter — [`ArrayBuffer`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer) + backing stores. This is the layer that holds the actual bytes of a + typed array, sitting outside what [`performance.memory`](https://web.dev/articles/monitor-total-page-memory-usage) + can see. +- V8's code cache. + +`system/Context` count is a sub-metric of the first bullet. Reducing +it by 89% is meaningful if that's where the leak is. It's invisible +if the leak is in the fourth bullet. + +The leak was in the fourth bullet. + +## The one-line fix that took hours to find + +I threw the PR away and started over with a different analyzer: +aggregate `self_size` bytes per class across a snapshot pair, sort by +byte growth. Five minutes of code, one line of output: + +``` + dBytes dCount Class + 220,963,752 175,594 native:system/JSArrayBufferData + 10,535,640 175,594 object:Uint32Array +``` + +220 megabytes. 175,594 retained [`Uint32Array`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Uint32Array)s +per 30 toggles. + +The number factored obviously: 30 toggles × 7 terminals × ~830 +scrollback lines per terminal = 174,300. Every `xterm.js BufferLine` +of every `Terminal` instance that had ever existed during those +thirty toggles was still in memory. `terminal.dispose()` had fired +for every one. The buffers were supposed to be gone. + +A BFS walk from the GC root to every retained `Uint32Array` found the +same retainer chain for all 175,594 of them: + +``` +Window.IntersectionObserver (native browser registry) + → callback closure + → RenderService (this) + → _bufferService.buffers + → BufferLine + → Uint32Array +``` + +xterm's [`RenderService`](https://github.com/xtermjs/xterm.js/blob/master/src/browser/services/RenderService.ts) +wires an [`IntersectionObserver`](https://developer.mozilla.org/en-US/docs/Web/API/IntersectionObserver) +(a browser API for "tell me when this element scrolls into or out of +view") to the terminal's DOM element so it can pause rendering when +the terminal isn't visible. Perfectly reasonable. The callback is an +arrow function — it closes over `this` (the `RenderService` with its +whole service graph). On dispose, xterm calls `observer.disconnect()`. +In a clean environment, that releases the callback and the service +graph can GC. + +In my environment, the callback stayed alive. Maybe a Chrome +extension monkey-patched `window.IntersectionObserver`. Maybe DevTools +was instrumenting it. I don't know. I spent some time trying to find +out and gave up. The heap snapshot told me one thing that mattered: +the callback was still in the native registry, holding `this`. + +You can break this chain defensively without knowing who's holding +what. `WeakRef` a reference that tells the GC "hold this only if +someone else is": + +```diff + if ('IntersectionObserver' in w) { +- const observer = new w.IntersectionObserver( +- e => this._handleIntersectionChange(e[e.length - 1]), +- { threshold: 0 } +- ); ++ const weakSelf = new WeakRef(this); ++ const observer = new w.IntersectionObserver( ++ e => weakSelf.deref()?._handleIntersectionChange(e[e.length - 1]), ++ { threshold: 0 } ++ ); + observer.observe(screenElement); + this._observerDisposable.value = toDisposable(() => observer.disconnect()); + } +``` + +While the `RenderService` has live strong references (which it does, +as long as the terminal is on screen), `weakSelf.deref()` returns it +and the handler runs exactly as before. When `terminal.dispose()` +drops the strong references, `deref()` starts returning `undefined` +and the entire `BufferService → BufferLine → Uint32Array` graph +becomes collectable — which is what `disconnect()` was supposed to +guarantee but doesn't, in practice. + +Deploy. Fresh tab, thirty toggles, quiet session: **Task Manager +footprint flat.** The original +367 MB/30-toggles regression dropped +to zero. + +## The xterm.js side + +Two small contributions to xterm.js came out of this line of work: + +- [xtermjs/xterm.js#5817](https://github.com/xtermjs/xterm.js/pull/5817) + — two cases where xterm's `Disposable` pattern registered a child + disposable via `= new MutableDisposable()` but forgot the + `this._register(...)` wrapper. Both leaked a `setInterval` past + dispose. Six lines of source, found by walking a separate retainer + chain during an earlier investigation. +- [xtermjs/xterm.js#5821](https://github.com/xtermjs/xterm.js/pull/5821) + — the `WeakRef` patch above. One line of real code, plus a comment + explaining why. + +Both patches look laughably small. Both took hours of measurement, +retainer-walking, and wrong turns to find. That's the shape of this +kind of work; the ratio of code-volume to investigation-time is +always roughly zero. + +I consume them via the juspay/xterm.js fork and `pnpm.overrides`, +stacked as a Kolu-consumption branch: + +```json +"@xterm/xterm": "github:juspay/xterm.js#fix/kolu-xterm-fixes-built" +``` + +When upstream merges, the override collapses to a plain version bump. + +## What I'd tell past-me + +Three things to internalise if you came here from a backend or +systems-programming background and web-frontend memory tooling feels +murky: + +**The browser's Task Manager is the only ground truth.** Everything +else — `performance.memory.usedJSHeapSize`, heap snapshot class +counts, anything derived from the JS-side heap alone — is a proxy for +what the tab actually uses. Proxies can drift from the truth by +orders of magnitude, because the truth includes native DOM state, +GPU buffers, and compositor layers that JS introspection can't reach. +Before claiming a fix works: fresh tab, Task Manager baseline, +reproducer, Task Manager after. No exceptions. + +**Sort heap diffs by bytes, not by instance count.** A 220 MB leak +across 175,594 `Uint32Array` instances dominates any amount of churn +in `system/Context` or `closure` counts. The biggest class by bytes +almost always holds everything else via its closure chain; fixing +something smaller first gets you zero footprint improvement. + +**`.disconnect()`, `.dispose()`, and `removeEventListener()` are +best-effort in the presence of browser extensions, DevTools, and +native registries.** If a callback closes over heavy state and lives +past its owner, the graph stays alive. `WeakRef` is cheap insurance: +one `.deref()?.` in the callback path, zero behavioural change when +the reference is live, clean GC when it isn't. Use it defensively on +any callback you hand to `IntersectionObserver`, `MutationObserver`, +`ResizeObserver`, or `EventTarget.addEventListener`. + +The commit hash is +[c9794db](https://github.com/juspay/kolu/pull/617). My always-on Kolu +tab sits at 300 MB now, and stays there. + +The full investigation history — including the wrong turns I glossed +over here — lives in Kolu's repo alongside the tools that did the +work: + +- [`docs/perf-investigations/memory-learnings.md`](https://github.com/juspay/kolu/blob/master/docs/perf-investigations/memory-learnings.md) + — three chapters of leak-hunts, with all the failed theories + preserved. +- [`agents/.apm/skills/perf-diagnose/SKILL.md`](https://github.com/juspay/kolu/blob/master/agents/.apm/skills/perf-diagnose/SKILL.md) + — the runbook future Claude Code sessions will read before they + re-tread the proxy-metric path I spent the afternoon on. +- [`docs/perf-investigations/scripts/`](https://github.com/juspay/kolu/tree/master/docs/perf-investigations/scripts) + — the heap-snapshot analyzers, including the byte-delta diff that + named the leak in one line. From 045e0594072fef2bb6627edb9d0831d744f18c1d Mon Sep 17 00:00:00 2001 From: Sridhar Ratnakumar Date: Fri, 17 Apr 2026 20:00:11 -0400 Subject: [PATCH 2/4] add bus-stop / Claude-on-phone anecdote for #5817 --- .../the-leak-that-wasnt-in-any-context.md | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md index 39ec1493..f115b088 100644 --- a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md +++ b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md @@ -173,8 +173,11 @@ Two small contributions to xterm.js came out of this line of work: — two cases where xterm's `Disposable` pattern registered a child disposable via `= new MutableDisposable()` but forgot the `this._register(...)` wrapper. Both leaked a `setInterval` past - dispose. Six lines of source, found by walking a separate retainer - chain during an earlier investigation. + dispose. Six lines of source. Diagnosed during an earlier pass of + this same investigation [standing at a bus stop and on the bus + itself](https://x.com/sridca/status/2045164268341895434), typing + instructions to Claude Code on my phone and watching the retainer + walks come back. - [xtermjs/xterm.js#5821](https://github.com/xtermjs/xterm.js/pull/5821) — the `WeakRef` patch above. One line of real code, plus a comment explaining why. From f032a0f06947d9ec9a7d50f76cbb54934d5a8878 Mon Sep 17 00:00:00 2001 From: Sridhar Ratnakumar Date: Fri, 17 Apr 2026 20:06:40 -0400 Subject: [PATCH 3/4] reorder: bus-stop #5817 fix (GPU) precedes the wrong turn on #614 --- .../the-leak-that-wasnt-in-any-context.md | 67 +++++++++++-------- 1 file changed, 40 insertions(+), 27 deletions(-) diff --git a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md index f115b088..621ffcc8 100644 --- a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md +++ b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md @@ -29,16 +29,34 @@ wrong hours, the one good diff, the one-line fix, and the two small patches I upstreamed to xterm.js along the way. I drove; [Claude Code](https://claude.com/claude-code) did the agent-side work. +## First pass: the bus-stop fix + +The first pass at the leak happened earlier that day, [from a bus +stop and on the bus](https://x.com/sridca/status/2045164268341895434), +typing instructions to Claude Code on my phone and watching retainer +walks come back between WhatsApp messages. That pass found a +dispose-registration gap inside xterm itself: two `MutableDisposable` +fields in `RenderService` and `WebglRenderer` were declared with `= +new MutableDisposable()` but never wrapped in `this._register(...)`. +Without that registration, xterm's `Disposable` base class never +disposed them on teardown, so a `setInterval` for the cursor blink +and a debounced resize task kept ticking past `terminal.dispose()`. +Six lines of source. [xtermjs/xterm.js#5817](https://github.com/xtermjs/xterm.js/pull/5817). + +Deploy. Chrome Task Manager, GPU Memory column: dropped from steady- +climbing to flat. Memory Footprint column: unchanged. GPU was a +symptom of its own leak, not the big one. + ## The wrong turn Kolu uses [SolidJS](https://www.solidjs.com/), which tracks -reactivity through `system/Context` objects — V8's name for the -block of memory that holds a closure's captured variables. If a -component's scope fails to clean up on unmount, its `Context` -lingers, and everything that scope closes over lingers with it. -Classic retention. +reactivity through [`system/Context`](https://developer.chrome.com/docs/devtools/memory-problems/heap-snapshots#system-context) +objects — V8's name for the block of memory that holds a closure's +captured variables. If a component's scope fails to clean up on +unmount, its `Context` lingers, and everything that scope closes +over lingers with it. Classic retention. -So I took the usual first steps. Open Chrome DevTools → Memory tab. +Claude took the usual first steps. Open Chrome DevTools → Memory tab. Take a [heap snapshot](https://developer.chrome.com/docs/devtools/memory-problems/heap-snapshots) before, thirty toggles, snapshot after. Look at instance count growth per class. Tens of thousands of new `system/Context` and `closure` @@ -54,11 +72,11 @@ the usual SolidJS-shaped culprits: always tear them down cleanly. Six commits landed on [a branch](https://github.com/juspay/kolu/pull/614) -over the afternoon. Replaced the two libraries with 200 lines of -custom code. Delegated every inline handler to the parent. `Context` -count per 30-toggle run went from +11,025 down to +1,208. An 89% -reduction. I wrote the PR, drew a mermaid graph of the staircase, -shipped to my dev box. +over the afternoon. Claude replaced the two libraries with 200 lines +of custom code. Delegated every inline handler to the parent. +`Context` count per 30-toggle run went from +11,025 down to +1,208. +An 89% reduction. Claude wrote the PR, drew a mermaid graph of the +staircase. I deployed to my dev box. Chrome Task Manager showed no change. Zero. Identical to before. @@ -86,9 +104,9 @@ The leak was in the fourth bullet. ## The one-line fix that took hours to find -I threw the PR away and started over with a different analyzer: -aggregate `self_size` bytes per class across a snapshot pair, sort by -byte growth. Five minutes of code, one line of output: +I told Claude to throw the PR away and start over with a different +analyzer: aggregate `self_size` bytes per class across a snapshot +pair, sort by byte growth. Five minutes of code, one line of output: ``` dBytes dCount Class @@ -105,8 +123,9 @@ of every `Terminal` instance that had ever existed during those thirty toggles was still in memory. `terminal.dispose()` had fired for every one. The buffers were supposed to be gone. -A BFS walk from the GC root to every retained `Uint32Array` found the -same retainer chain for all 175,594 of them: +Claude then walked BFS from the GC root to every retained +`Uint32Array`. Every one of the 175,594 instances came back with the +same retainer chain: ``` Window.IntersectionObserver (native browser registry) @@ -167,20 +186,14 @@ to zero. ## The xterm.js side -Two small contributions to xterm.js came out of this line of work: +Two upstream contributions fell out of the day's work: - [xtermjs/xterm.js#5817](https://github.com/xtermjs/xterm.js/pull/5817) - — two cases where xterm's `Disposable` pattern registered a child - disposable via `= new MutableDisposable()` but forgot the - `this._register(...)` wrapper. Both leaked a `setInterval` past - dispose. Six lines of source. Diagnosed during an earlier pass of - this same investigation [standing at a bus stop and on the bus - itself](https://x.com/sridca/status/2045164268341895434), typing - instructions to Claude Code on my phone and watching the retainer - walks come back. + — the bus-stop patch above. Register the two `MutableDisposable` + fields. Six lines of source. Dropped the GPU-memory leak. - [xtermjs/xterm.js#5821](https://github.com/xtermjs/xterm.js/pull/5821) - — the `WeakRef` patch above. One line of real code, plus a comment - explaining why. + — the `WeakRef` patch. One line of real code plus a comment + explaining why. Dropped the Memory-Footprint leak. Both patches look laughably small. Both took hours of measurement, retainer-walking, and wrong turns to find. That's the shape of this From 5316841d3cbbe5a27662025e58824e0d8bcb3df1 Mon Sep 17 00:00:00 2001 From: Sridhar Ratnakumar Date: Fri, 17 Apr 2026 20:07:54 -0400 Subject: [PATCH 4/4] clarify Memory Footprint is the aggregate column; bus-to-pool beat --- .../the-leak-that-wasnt-in-any-context.md | 50 +++++++++++-------- 1 file changed, 29 insertions(+), 21 deletions(-) diff --git a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md index 621ffcc8..37bbbdf2 100644 --- a/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md +++ b/docs/perf-investigations/the-leak-that-wasnt-in-any-context.md @@ -31,10 +31,10 @@ Code](https://claude.com/claude-code) did the agent-side work. ## First pass: the bus-stop fix -The first pass at the leak happened earlier that day, [from a bus -stop and on the bus](https://x.com/sridca/status/2045164268341895434), +The first pass at the leak happened earlier that day [on the bus to +the swimming pool, then again checking it on the way back](https://x.com/sridca/status/2045164268341895434), typing instructions to Claude Code on my phone and watching retainer -walks come back between WhatsApp messages. That pass found a +walks come back between stops. That pass found a dispose-registration gap inside xterm itself: two `MutableDisposable` fields in `RenderService` and `WebglRenderer` were declared with `= new MutableDisposable()` but never wrapped in `this._register(...)`. @@ -83,24 +83,32 @@ Chrome Task Manager showed no change. Zero. Identical to before. ## What I was actually measuring Chrome's [Task Manager](https://developer.chrome.com/docs/devtools/memory) -(`⇧⎋`) shows, per-tab, the number the operating system assigns to the -tab process. It includes: - -- The live JavaScript heap (post-GC). -- GPU memory — textures, compositor layers. -- The renderer's baseline, ~100-150 MB. -- **Native-side DOM state.** SVG element attributes, detached - canvases, and — this turned out to matter — [`ArrayBuffer`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer) - backing stores. This is the layer that holds the actual bytes of a - typed array, sitting outside what [`performance.memory`](https://web.dev/articles/monitor-total-page-memory-usage) - can see. -- V8's code cache. - -`system/Context` count is a sub-metric of the first bullet. Reducing -it by 89% is meaningful if that's where the leak is. It's invisible -if the leak is in the fourth bullet. - -The leak was in the fourth bullet. +(`⇧⎋`) has three columns that matter for a tab: `JavaScript Memory`, +`GPU Memory`, and `Memory Footprint`. The first two are what they +sound like. `Memory Footprint` is the one that matters: the total +resident size the operating system assigns to the tab's renderer +process. It's an aggregate — it rolls up the JS heap, the GPU +textures, Chrome's per-renderer baseline (~100-150 MB), V8's code +cache, and a category that isn't called out as its own column but +turned out to be the big one here: + +**Native-side state backing the DOM and typed-array objects.** SVG +element attributes, detached canvases, and — the one that mattered — +[`ArrayBuffer`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/ArrayBuffer) +backing stores. An `ArrayBuffer` is the raw byte block that a typed +array (a `Uint32Array` etc.) is a typed view of; it lives outside +what [`performance.memory`](https://web.dev/articles/monitor-total-page-memory-usage) +can see. Kilobytes of typed-array object metadata in the JS heap can +correspond to megabytes of `ArrayBuffer` bytes in the native heap. +The JS-side instance count tells you how many arrays exist; the +aggregate `Memory Footprint` tells you how much memory they actually +cost. + +`system/Context` count is a JS-heap metric. Reducing it by 89% is +meaningful if the leak is there. It's invisible if the leak is in +native-side `ArrayBuffer` bytes. + +The leak was in native-side `ArrayBuffer` bytes. ## The one-line fix that took hours to find