feat: compose__effective_context debug tool by mgoldsborough · Pull Request #129 · NimbleBrainInc/nimblebrain

mgoldsborough · 2026-04-29T04:13:47Z

Summary

Single-call answer to "what's in the system prompt for this conversation, with provenance per layer." Replaces the manual jsonl-grep workflow that the recent prod incidents (Opus 4.7 thinking, skills tools surface) each required for diagnosis. Closes #119.

Two modes, one tool

compose__effective_context({
  conversation_id?,   // defaults to current conv (the F7 pattern)
  run_id?,            // historical mode trigger
  bundle?,            // filter to one bundle's contributions
})

Live (default). Re-gathers the same inputs runtime.chat() uses (workspace, identity, registry, skills, instructions, apps, prefs) and runs composeSystemPromptTraced to produce the full per-layer breakdown for the current state.

Historical (`run_id` set). Reads the skills.loaded event for the given run from the conv jsonl and audits each layer-3 skill:

Re-reads the on-disk body, computes its SHA-256, compares to the recorded contentHash (from feat: emit contentHash on skills.loaded events #125).
Match → display the body verbatim (`hashStatus: "match"`).
Drift + matching _versions/ snapshot → display the snapshot body (`hashStatus: "recovered"`).
Drift, no snapshot → display current with a warning that the body may differ (`hashStatus: "drift"`).
File deleted → `hashStatus: "missing"`.

Three structural changes

New traced compose function. Added composeSystemPromptTraced to prompt/compose.ts returning {text, layers, totalTokens}. The existing composeSystemPrompt is now a thin wrapper that calls the traced variant and returns .text. Joined layer texts are byte-identical to the legacy string output — verified by a property test.
New compose platform source. New file src/tools/platform/compose.ts exposing one tool (`compose__effective_context`). Visible to agents; read-only; no security implications.
Two private runtime helpers made public. buildAppsList(wsId) and readPromptOverlays(wsId) — needed by the tool's live-mode gather path. Workspace-scoped, no privilege escalation.

Layer kinds

kind	granularity	subItems
`core_skill`	one per priority ≤ 10 context skill	—
`default_identity`	one (only when no core skills)	—
`user_context_skill`	one per priority > 10 context skill	—
`user_prefs`	one	—
`participants`	one (shared convs only)	—
`workspace_context`	one	—
`org_overlay` / `workspace_overlay`	one each (when non-empty)	—
`layer3_skills`	one section	per-skill (hash-verified, bundle-attributed)
`apps`	one section	per-app (bundle-attributed)
`app_state` / `focused_app` / `matched_skill`	one each (when set)	—

Granular for operator-authored content (skills, apps), single rows for runtime-derived state.

Honest scope limits

Historical mode reconstructs layer-3 skills only. Identity, prefs, overlays, apps, focused-app, and app-state aren't recorded in events with enough detail to reconstruct; the tool surfaces a `non_l3_layers_omitted` warning directing the caller at live mode for the rest.
Mutation detection is layer-3-only. Hashes only exist for L3 skills (feat: emit contentHash on skills.loaded events #125). Other layers can drift between run-time and inspect-time without a hash check. Future PRs can extend hashing if needed.

Test plan

bun test test/unit/prompt.test.ts — 70 pass (62 existing + 8 new for composeSystemPromptTraced)
bun test test/integration/compose-effective-context.test.ts — 8 pass (live mode, historical match/drift, bundle filter, conv-id resolution)
bun run test:unit — 2149 pass (was 2137; added 12)
bun run verify:static — clean (only pre-existing biome warnings on unrelated files)

Refs

SKILL_MANAGEMENT_DESIGN.md § Phase 1 (`compose_effective_context`)
Closes Phase 1 gap: implement compose_effective_context debug surface #119
Builds on feat: emit contentHash on skills.loaded events #125 (contentHash telemetry — historical hash verification depends on it)

Single-call answer to "what's in the system prompt for this conversation, with provenance per layer." Replaces the manual jsonl-grep workflow that the recent prod incidents (Opus 4.7 thinking, skills tools surface) each required for diagnosis. ## What it does Two modes against one tool: - Live (default): re-gathers the same inputs `runtime.chat()` uses (workspace, identity, registry, skills, instructions, apps, prefs) and runs `composeSystemPromptTraced` to produce the full per-layer breakdown for the current state. Honest answer to "what would compose right now if a turn started in this conversation." - Historical (`run_id` set): reads the `skills.loaded` event for the given run from the conv jsonl and audits each layer-3 skill — re-reads the on-disk body, computes its SHA-256, compares to the recorded `contentHash`. Match → display the body verbatim. Drift → look up `_versions/` snapshots; if one matches the recorded hash, display that body (`hashStatus: "recovered"`). No match → display current with `hashStatus: "drift"` and a warning. Bundle filter narrows layers + subItems to one bundle's contributions (apps section row + layer-3 skills under the bundle's affined dir). ## Three structural changes 1. New traced compose function. Added `composeSystemPromptTraced(...)` to `prompt/compose.ts` returning `{text, layers, totalTokens}`. Existing `composeSystemPrompt` is now a one-liner that calls the traced variant and returns `.text`. Layer kinds enumerated: core_skill, default_identity, user_context_skill, user_prefs, participants, workspace_context, org_overlay, workspace_overlay, layer3_skills, apps, app_state, focused_app, matched_skill. `apps` and `layer3_skills` carry `subItems` for per-app / per-skill filtering and inspection without re-parsing the section text. The joined `.text` is byte-identical to the legacy string output — verified by a property test. 2. New `compose` platform source. New file `src/tools/platform/compose.ts` exposing one tool (`compose__effective_context`). Visible to agents; read-only; no security implications. Registered via the existing `createPlatformSources` factory. 3. Two private runtime helpers made public. `buildAppsList(wsId)` and `readPromptOverlays(wsId)` — needed by the tool's live-mode gather path. Workspace-scoped, no privilege escalation. Doc comments explain the new visibility. ## Honest scope limits - Historical mode reconstructs layer-3 skills only. Identity, prefs, overlays, apps, focused-app, and app-state aren't recorded in events with enough detail to reconstruct, and snapshotting full bodies into events was rejected as too expensive (10-100× event size). The tool surfaces a `non_l3_layers_omitted` warning in historical mode directing the caller at live mode for the rest. - Mutation detection is layer-3-only. Hashes only exist for L3 skills (PR #125). Other layers can drift between run-time and inspect-time without a hash check. Future PRs can extend hashing to bundle instructions / overlays / app instructions if needed. ## Tests - test/unit/prompt.test.ts +8 tests for composeSystemPromptTraced (lossless reconstruction, per-skill rows, bundle attribution, overlay emission, token totals). - test/integration/compose-effective-context.test.ts +8 tests for the tool (live mode, historical match/drift, bundle filter, conv-id resolution via context fallback / explicit-wins). 2149 unit tests pass, 8 integration tests pass, verify:static green. Refs: - SKILL_MANAGEMENT_DESIGN.md § Phase 1 ("compose_effective_context") - Closes #119

…-id gate QA pass on PR #129. Six items addressed: W1. Test gap on the recovered hash-status path (the headline value of historical mode). Added an integration test that plants a `_versions/<basename>.<ts>.md` snapshot whose body hashes to the recorded contentHash, edits the live file, and asserts the audit returns hashStatus="recovered" + the snapshot path. W2. Bundle filter coverage was layer3-only. Exported `applyBundleFilter` + `ComposeResponse` and added a focused unit-test file covering all filter branches: apps section per-subItem narrowing, focused_app ownBundleMatches branch, layers without bundle attribution dropped, mixed subItems pared, totalTokens recomputed against survivors. S1. Validated conv_id shape with `CONVERSATION_ID_RE` before any filesystem access. Without this, a malformed id (`../foo`) would reach `existsSync` first, leaking boolean fs existence. Existing test fixtures updated to use valid 16-hex-char ids. S3. Promoted `deriveBundleFromSkillPath` to an export of `prompt/compose.ts` and removed the byte-for-byte duplicate in `tools/platform/compose.ts`. The original justification ("avoid importing a private helper") was moot once `composeSystemPromptTraced` itself was public. S4. Cached `deriveBundleFromSkillPath(entry.sourcePath)` to a const in the layer3_skills subItems map — was being computed twice per skill. S5. Added a comment on the `path.startsWith("/")` POSIX-only check in `auditL3Skill` documenting the deployed-targets assumption (Linux, macOS) and what a future Windows port would need to broaden. S6. Historical-mode `text` now leads with a banner — "[historical mode — layer 3 only. Use live mode for the full composition; see warnings[] for details.]" — so a programmatic caller treating `text` as "the prompt as recorded" sees the incompleteness up front. Filed S2 (no ConversationAccessContext on historical event reads) as #130 — pre-existing pattern that also exists in skills.ts; deserves its own audit, not band-aid in this PR. 2154 unit tests pass (was 2149; +5 bundle-filter unit), 14 compose-tool tests pass (was 8; +5 unit + 1 integration). verify:static green.

mgoldsborough · 2026-04-29T07:07:11Z

QA pass — pushed `f6b3fbd`.

W1 (recovered path uncovered): Fixed. New integration test `recovers the loaded body from a _versions/ snapshot when the live file has drifted` plants the snapshot, edits the live file, asserts `hashStatus="recovered"` + the snapshot path is reported. The headline historical-mode feature now has a regression test.

W2 (bundle filter only L3): Fixed via a new unit-test file `test/unit/tools/platform/compose-bundle-filter.test.ts`. Exported `applyBundleFilter` + `ComposeResponse` and added 5 tests covering: apps section per-subItem narrowing, focused_app `ownBundleMatches` branch (which the L3 integration test couldn't reach), layers without bundle attribution dropped, mixed subItems pared, totalTokens recomputed. The same per-subItem code path covers both apps and L3, but now both branches plus the layer-level branch are explicitly asserted.

S1 (conv-id existsSync before validation): Fixed. Gated up front with `CONVERSATION_ID_RE.test(convId)`. Existing test fixtures updated to use valid 16-hex-char ids — they had been using sentinel strings (`conv_filter`, `conv_ws`) that never matched the regex.

S2 (no access check on historical reads): Filed as #130 ("compose__effective_context: add ConversationAccessContext check on historical reads"). The same pattern exists in `skills.ts::skills__loading_log` and `skills__active_for` — fixing one without the other would be band-aid; the issue tracks the broader audit.

S3 (deriveBundleFromSkillPath duplicated): Fixed. Promoted to an export of `prompt/compose.ts` and removed the duplicate. Original justification was moot once the parent function went public.

S4 (called twice per skill): Fixed with a `const bundle = …` cache.

S5 (POSIX-only path check): Comment added documenting the deployed-targets assumption (Linux/macOS) and what a future Windows port would need.

S6 (historical `text` misleading): Fixed with a leading banner — `[historical mode — layer 3 only. Use live mode for the full composition; see warnings[] for details.]` — so a programmatic caller reading `text` sees the incompleteness up front instead of silently getting an L3 slice.

Tests: 2154 unit (was 2149; +5 bundle-filter unit). 14 compose-tool tests (was 8; +5 unit + 1 integration). `verify:static` green.

QA pass on PR #129 (round 2). Five items: 1. Live mode silently dropped the workspace identity override. `composeLive` was passing `runtime.getContextSkills()` directly — which is the global static set — but `runtime.chat()` augments it per-request with `makeIdentitySkill(workspace.identity)` (a priority-1 core context skill). Net effect: any workspace with a custom identity would be told by the trace that DEFAULT_IDENTITY was in its prompt, when in reality LegalBot/whatever override was. Fix: export `makeIdentitySkill` from runtime.ts, fetch `workspace.identity` in composeLive, append the same override skill the chat path appends. Regression test: integration test plants `workspace.identity` and asserts the trace contains a core_skill row matching the override AND that no default_identity row is emitted. 2. `hasProxiedTools` was hard-coded to `false`. The apps section in the prompt emits a 3-sentence block about discovering tools via `nb__search` only when `proxied.length > 0` — common for any Tier 2/3 workspace (more than DEFAULT_MAX_DIRECT_TOOLS tools). Trace was understating the prompt for those workspaces. Fix: replicate runtime.chat()'s split — call `surfaceTools(allTools, null, {})` and read `proxied.length`. 3. Tool list wasn't orgRole-filtered before L3 selection. runtime.chat filters via `isToolVisibleToRole(toolName, identity?.orgRole)` before feeding tools into `selectLayer3Skills` — non-admin users get a narrower active set. Live mode skipped this filter, so a non-admin caller would see L3 skills that wouldn't actually be selected for them. Fix: apply the same filter before the L3 selection. 4. `hashStatus: "no-recorded-hash"` declared in the union but never produced. With `contentHash` required (per the previous QA round), this branch can't fire. Removed the status from the union — type honest about what we actually emit. 5. Replaced the tautological "lossless trace" property test with a structural-integrity check. The previous test compared `composeSystemPromptTraced(...).text` to itself (because `composeSystemPrompt` is now a thin wrapper around it). Real safety net is the existing 62 string-variant tests now exercising the wrapper; the structural test asserts every emitted layer carries a known kind, non-empty id/source/text, and non-negative tokens. Items 6-13 from the QA review were already addressed in `f6b3fbd` (the earlier QA pass): recovered-path test, bundle filter unit coverage, conv-id regex gate, deduplicated bundle helper, POSIX-only comment, historical text banner. Reviewer ran on an older commit. 2154 unit tests pass. 85 affected tests across 3 files green.

mgoldsborough · 2026-04-29T07:22:02Z

QA round 2 — pushed `fe44a33`.

Five new findings from this review, all real:

(1) Live mode dropped the workspace identity override. Critical fidelity bug — the trace would tell an operator using `workspace.identity = "You are LegalBot..."` that `DEFAULT_IDENTITY` was in the prompt. Fixed by exporting `makeIdentitySkill` from runtime.ts and replicating the per-workspace override append in `composeLive`. Added an integration regression test that plants `workspace.identity`, calls the tool, and asserts the override appears as a `core_skill` row + `default_identity` does not.

(2) `hasProxiedTools: false` hard-coded. The reviewer was right — my "informational; safe default" comment was wrong. The apps section emits a 3-sentence block based on `proxied.length > 0`, common for Tier 2/3 workspaces. Fix: call `surfaceTools(allTools, null, {})` and read `proxied.length` exactly like `runtime.chat()` does.

(3) Tool list not orgRole-filtered before L3 selection. Same direction as #2 — non-admin users see more tools than they get, inflating L3 selection. Fix: apply `isToolVisibleToRole(toolName, identity?.orgRole)` filter before passing tool names to `selectLayer3Skills`.

(4) `hashStatus: "no-recorded-hash"` declared but never produced. With `contentHash` required (last QA round), this branch is unreachable. Dropped from the union — type now describes what's actually emitted.

(5) Tautological lossless-trace test. Reviewer is right: `composeSystemPrompt` is now a thin wrapper, so the test compared the same value to itself. Replaced with a structural-integrity check (every emitted layer has a known kind, non-empty id/source/text, non-negative tokens). The 62 existing string-variant tests are the real safety net for text reproduction now that they exercise the wrapper.

Items 6-13 from this review were already addressed in `f6b3fbd` (the previous QA pass — see PR comment trail). The reviewer appears to have run on an older commit. Specifically:

1. Recovered hash status test ✓ (`historical mode > recovers the loaded body from a _versions/ snapshot...`)
1. Bundle filter against apps section ✓ (`test/unit/tools/platform/compose-bundle-filter.test.ts` — apps subItem narrowing, ownBundleMatches, mixed subs, totalTokens recompute)
1. POSIX-only path comment ✓
1. `deriveBundleFromSkillPath` deduplicated ✓
1. Conv-id regex gate ✓
1. ConversationAccessContext on historical reads → tracked as compose__effective_context: add ConversationAccessContext check on historical reads #130 (broader audit)
1. Historical `text` banner ✓
1. `deriveBundleFromSkillPath` cached to a const ✓

Tests: 2154 unit pass, 85 tests across the 3 affected files green. `verify:static` clean.

QA round 3 on PR #129 — three new findings + one comment fix. 3. **applyBundleFilter produced a self-inconsistent response.** After filtering, `layers` was the filtered subset and `totalTokens` was re-summed, but `text` was left untouched — a caller using `r.totalTokens` to budget context against `r.text` would be misled by mismatched fields. Fixed: rebuild `text` from `filtered.map(l => l.text).join(SEPARATOR)` so all three fields describe the same subset. Note: section-internal text isn't re- formatted from pared subItems (apps section keeps its full original text); consumers wanting per-bundle prompt slices should walk the subItems. New unit test asserts post-filter text equals the joined filtered-layer texts. 4. **conversation_id was cosmetic in live mode.** The tool's description implied convId drove workspace selection, but live mode uses `runtime.requireWorkspaceId()` (current request context's workspace) — convId is just echoed back as a label. Fixed the description to be honest: "live mode composes against the calling request's workspace; convId is echoed as a label and does NOT select the workspace. To inspect a different workspace, call from within that workspace's context." Architecturally the right answer is to look up the conv to find its workspace, but conv ids don't encode the workspace and a global resolver across all workspace dirs is overkill for v1. Description honesty closes the gap. 5. **composeLive's docstring claimed "everything else is workspace- scoped and faithful."** `runtime.getContextSkills()` is global — the workspace-scoping comes from the per-request identity override (which fe44a33 added). Refreshed the doc to spell out what's global vs. what's workspace-scoped via the override append, plus what's skipped vs. runtime.chat() and why. 17. **`contextAssembled` event was fetched and mostly discarded** in composeHistorical — only `totalTokens` is used. Added a comment explaining the narrow read is intentional (the design contract for historical mode v1 is "L3 only," and surfacing other recorded sources would imply we can reconstruct layers we can't). QA reviewer's "unchanged from 2nd pass" items 1, 2, 6, 7, 8 (and 9, 10 from earlier passes) were already addressed in fe44a33 and f6b3fbd respectively. Reviewer ran on an older commit — code at fe44a33 has the workspace identity override (line 214), surfaceTools proxied check (line 241), isToolVisibleToRole filter (line 239), removed `no-recorded-hash` from the union (line 414), structural- integrity property test, recovered-path test, and the bundle-filter unit test file. PR comment will point reviewer at the actual diff. 2155 unit pass (was 2154; +1 bundle filter consistency). 86 tests across 3 affected files green. verify:static clean.

mgoldsborough · 2026-04-29T07:45:54Z

QA round 3 — pushed `b927fe5`.

Two important notes for the reviewer up front:

(a) Your items marked "unchanged from 2nd pass" — items 1, 2, 6, 7, 8 — and items 9, 10 from earlier passes — are all already fixed. They were addressed in commit `fe44a33` (round 2) and `f6b3fbd` (round 1). The current code at `fe44a33` has:

Item 1 (workspace identity override): `makeIdentitySkill` imported (compose.ts:50), called (compose.ts:214) + integration regression test in `compose-effective-context.test.ts`
Item 2 (`hasProxiedTools`): replaced with `surfaceTools(allTools, null, {}).proxied.length > 0` (compose.ts:241)
Item 6 (orgRole filter): `isToolVisibleToRole(t.name, identity?.orgRole)` filter applied at compose.ts:239
Item 7 (`no-recorded-hash`): removed from the type union (compose.ts:414, only 4 statuses now)
Item 8 (tautological test): replaced with structural-integrity check in prompt.test.ts
Item 9 (recovered hash status test): `recovers the loaded body from a _versions/ snapshot...` in compose-effective-context.test.ts
Item 10 (bundle filter coverage): `test/unit/tools/platform/compose-bundle-filter.test.ts` covers apps subItem, ownBundleMatches, no-attribution drop, mixed subs, totalTokens recompute

The reviewer appears to have run against `ff68dfd` (the original PR open) rather than the latest. Force-pushing isn't possible since these are real commits — please pull `b927fe5` and re-review.

(b) Items 11-16 (suggestions from the 1st/2nd passes) are also all addressed in `f6b3fbd` or tracked as #130 (access-control audit) and #131 (gather-inputs refactor).

Three genuinely-new findings from this round, all fixed:

3 (bundle filter self-inconsistent): Real bug. After `applyBundleFilter`, `layers` and `totalTokens` were filtered but `text` wasn't — a caller using `totalTokens` to budget against `text` would be misled. Fixed by rebuilding `text` from the filtered layer texts. New unit test asserts the joined-text invariant after filtering.

4 (`conversation_id` cosmetic in live mode): The reviewer is right — the description implied convId drove workspace selection but live mode uses the current request's workspace. Fixed the description to be honest: "live mode composes against the calling request's workspace; convId is echoed as a label and does NOT select the workspace." Looking up the conv's workspace would require a global resolver across all workspace dirs (conv ids don't encode workspace) — overkill for v1.

5 (composeLive docstring inaccurate): Real. The comment claimed "everything else is workspace-scoped and faithful" but `getContextSkills()` is global; the workspace-scoping comes from the identity-override append. Refreshed the docstring to spell out what's global, what's workspace-scoped via the override, what's skipped vs. `runtime.chat()`, and why.

17 (contextAssembled fetched-and-discarded): Added a comment explaining the narrow read. Historical mode v1's contract is "L3 only" — surfacing the rest of the event's data (per-source breakdown, exclusions) would imply we can reconstruct layers we can't. A future "rich historical mode" PR could expand this; not in scope here.

Tests: 2155 unit pass, 86 tests across the 3 affected files green. `verify:static` clean.

mgoldsborough mentioned this pull request Apr 29, 2026

compose__effective_context: add ConversationAccessContext check on historical reads #130

Open

mgoldsborough mentioned this pull request Apr 29, 2026

Extract runtime.chat()'s gather-inputs phase so the compose tool can't drift from it #131

Open

mgoldsborough added the qa-reviewed QA review completed with no critical issues label Apr 29, 2026

mgoldsborough merged commit f170fe3 into main Apr 29, 2026
4 checks passed

mgoldsborough deleted the feat/compose-effective-context branch April 29, 2026 08:13

This was referenced Apr 29, 2026

compose__diff(run_a, run_b): show how the prompt changed between two runs #133

Open

Hash bundle instructions / overlays / app instructions in their telemetry #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: compose__effective_context debug tool#129

feat: compose__effective_context debug tool#129
mgoldsborough merged 4 commits intomainfrom
feat/compose-effective-context

mgoldsborough commented Apr 29, 2026

Uh oh!

mgoldsborough commented Apr 29, 2026

Uh oh!

mgoldsborough commented Apr 29, 2026

Uh oh!

mgoldsborough commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant