feat: compose__effective_context debug tool#129
Conversation
Single-call answer to "what's in the system prompt for this conversation,
with provenance per layer." Replaces the manual jsonl-grep workflow that
the recent prod incidents (Opus 4.7 thinking, skills tools surface) each
required for diagnosis.
## What it does
Two modes against one tool:
- Live (default): re-gathers the same inputs `runtime.chat()` uses
(workspace, identity, registry, skills, instructions, apps, prefs)
and runs `composeSystemPromptTraced` to produce the full per-layer
breakdown for the current state. Honest answer to "what would
compose right now if a turn started in this conversation."
- Historical (`run_id` set): reads the `skills.loaded` event for the
given run from the conv jsonl and audits each layer-3 skill —
re-reads the on-disk body, computes its SHA-256, compares to the
recorded `contentHash`. Match → display the body verbatim. Drift
→ look up `_versions/` snapshots; if one matches the recorded
hash, display that body (`hashStatus: "recovered"`). No match
→ display current with `hashStatus: "drift"` and a warning.
Bundle filter narrows layers + subItems to one bundle's contributions
(apps section row + layer-3 skills under the bundle's affined dir).
## Three structural changes
1. New traced compose function. Added `composeSystemPromptTraced(...)`
to `prompt/compose.ts` returning `{text, layers, totalTokens}`.
Existing `composeSystemPrompt` is now a one-liner that calls the
traced variant and returns `.text`. Layer kinds enumerated:
core_skill, default_identity, user_context_skill, user_prefs,
participants, workspace_context, org_overlay, workspace_overlay,
layer3_skills, apps, app_state, focused_app, matched_skill.
`apps` and `layer3_skills` carry `subItems` for per-app / per-skill
filtering and inspection without re-parsing the section text. The
joined `.text` is byte-identical to the legacy string output —
verified by a property test.
2. New `compose` platform source. New file
`src/tools/platform/compose.ts` exposing one tool
(`compose__effective_context`). Visible to agents; read-only; no
security implications. Registered via the existing
`createPlatformSources` factory.
3. Two private runtime helpers made public. `buildAppsList(wsId)` and
`readPromptOverlays(wsId)` — needed by the tool's live-mode gather
path. Workspace-scoped, no privilege escalation. Doc comments
explain the new visibility.
## Honest scope limits
- Historical mode reconstructs layer-3 skills only. Identity, prefs,
overlays, apps, focused-app, and app-state aren't recorded in
events with enough detail to reconstruct, and snapshotting full
bodies into events was rejected as too expensive (10-100× event
size). The tool surfaces a `non_l3_layers_omitted` warning in
historical mode directing the caller at live mode for the rest.
- Mutation detection is layer-3-only. Hashes only exist for L3
skills (PR #125). Other layers can drift between run-time and
inspect-time without a hash check. Future PRs can extend hashing
to bundle instructions / overlays / app instructions if needed.
## Tests
- test/unit/prompt.test.ts +8 tests for composeSystemPromptTraced
(lossless reconstruction, per-skill rows, bundle attribution,
overlay emission, token totals).
- test/integration/compose-effective-context.test.ts +8 tests for
the tool (live mode, historical match/drift, bundle filter,
conv-id resolution via context fallback / explicit-wins).
2149 unit tests pass, 8 integration tests pass, verify:static green.
Refs:
- SKILL_MANAGEMENT_DESIGN.md § Phase 1 ("compose_effective_context")
- Closes #119
…-id gate QA pass on PR #129. Six items addressed: W1. Test gap on the recovered hash-status path (the headline value of historical mode). Added an integration test that plants a `_versions/<basename>.<ts>.md` snapshot whose body hashes to the recorded contentHash, edits the live file, and asserts the audit returns hashStatus="recovered" + the snapshot path. W2. Bundle filter coverage was layer3-only. Exported `applyBundleFilter` + `ComposeResponse` and added a focused unit-test file covering all filter branches: apps section per-subItem narrowing, focused_app ownBundleMatches branch, layers without bundle attribution dropped, mixed subItems pared, totalTokens recomputed against survivors. S1. Validated conv_id shape with `CONVERSATION_ID_RE` before any filesystem access. Without this, a malformed id (`../foo`) would reach `existsSync` first, leaking boolean fs existence. Existing test fixtures updated to use valid 16-hex-char ids. S3. Promoted `deriveBundleFromSkillPath` to an export of `prompt/compose.ts` and removed the byte-for-byte duplicate in `tools/platform/compose.ts`. The original justification ("avoid importing a private helper") was moot once `composeSystemPromptTraced` itself was public. S4. Cached `deriveBundleFromSkillPath(entry.sourcePath)` to a const in the layer3_skills subItems map — was being computed twice per skill. S5. Added a comment on the `path.startsWith("/")` POSIX-only check in `auditL3Skill` documenting the deployed-targets assumption (Linux, macOS) and what a future Windows port would need to broaden. S6. Historical-mode `text` now leads with a banner — "[historical mode — layer 3 only. Use live mode for the full composition; see warnings[] for details.]" — so a programmatic caller treating `text` as "the prompt as recorded" sees the incompleteness up front. Filed S2 (no ConversationAccessContext on historical event reads) as #130 — pre-existing pattern that also exists in skills.ts; deserves its own audit, not band-aid in this PR. 2154 unit tests pass (was 2149; +5 bundle-filter unit), 14 compose-tool tests pass (was 8; +5 unit + 1 integration). verify:static green.
|
QA pass — pushed `f6b3fbd`. W1 (recovered path uncovered): Fixed. New integration test `recovers the loaded body from a _versions/ snapshot when the live file has drifted` plants the snapshot, edits the live file, asserts `hashStatus="recovered"` + the snapshot path is reported. The headline historical-mode feature now has a regression test. W2 (bundle filter only L3): Fixed via a new unit-test file `test/unit/tools/platform/compose-bundle-filter.test.ts`. Exported `applyBundleFilter` + `ComposeResponse` and added 5 tests covering: apps section per-subItem narrowing, focused_app `ownBundleMatches` branch (which the L3 integration test couldn't reach), layers without bundle attribution dropped, mixed subItems pared, totalTokens recomputed. The same per-subItem code path covers both apps and L3, but now both branches plus the layer-level branch are explicitly asserted. S1 (conv-id existsSync before validation): Fixed. Gated up front with `CONVERSATION_ID_RE.test(convId)`. Existing test fixtures updated to use valid 16-hex-char ids — they had been using sentinel strings (`conv_filter`, `conv_ws`) that never matched the regex. S2 (no access check on historical reads): Filed as #130 ("compose__effective_context: add ConversationAccessContext check on historical reads"). The same pattern exists in `skills.ts::skills__loading_log` and `skills__active_for` — fixing one without the other would be band-aid; the issue tracks the broader audit. S3 (deriveBundleFromSkillPath duplicated): Fixed. Promoted to an export of `prompt/compose.ts` and removed the duplicate. Original justification was moot once the parent function went public. S4 (called twice per skill): Fixed with a `const bundle = …` cache. S5 (POSIX-only path check): Comment added documenting the deployed-targets assumption (Linux/macOS) and what a future Windows port would need. S6 (historical `text` misleading): Fixed with a leading banner — `[historical mode — layer 3 only. Use live mode for the full composition; see warnings[] for details.]` — so a programmatic caller reading `text` sees the incompleteness up front instead of silently getting an L3 slice. Tests: 2154 unit (was 2149; +5 bundle-filter unit). 14 compose-tool tests (was 8; +5 unit + 1 integration). `verify:static` green. |
QA pass on PR #129 (round 2). Five items: 1. Live mode silently dropped the workspace identity override. `composeLive` was passing `runtime.getContextSkills()` directly — which is the global static set — but `runtime.chat()` augments it per-request with `makeIdentitySkill(workspace.identity)` (a priority-1 core context skill). Net effect: any workspace with a custom identity would be told by the trace that DEFAULT_IDENTITY was in its prompt, when in reality LegalBot/whatever override was. Fix: export `makeIdentitySkill` from runtime.ts, fetch `workspace.identity` in composeLive, append the same override skill the chat path appends. Regression test: integration test plants `workspace.identity` and asserts the trace contains a core_skill row matching the override AND that no default_identity row is emitted. 2. `hasProxiedTools` was hard-coded to `false`. The apps section in the prompt emits a 3-sentence block about discovering tools via `nb__search` only when `proxied.length > 0` — common for any Tier 2/3 workspace (more than DEFAULT_MAX_DIRECT_TOOLS tools). Trace was understating the prompt for those workspaces. Fix: replicate runtime.chat()'s split — call `surfaceTools(allTools, null, {})` and read `proxied.length`. 3. Tool list wasn't orgRole-filtered before L3 selection. runtime.chat filters via `isToolVisibleToRole(toolName, identity?.orgRole)` before feeding tools into `selectLayer3Skills` — non-admin users get a narrower active set. Live mode skipped this filter, so a non-admin caller would see L3 skills that wouldn't actually be selected for them. Fix: apply the same filter before the L3 selection. 4. `hashStatus: "no-recorded-hash"` declared in the union but never produced. With `contentHash` required (per the previous QA round), this branch can't fire. Removed the status from the union — type honest about what we actually emit. 5. Replaced the tautological "lossless trace" property test with a structural-integrity check. The previous test compared `composeSystemPromptTraced(...).text` to itself (because `composeSystemPrompt` is now a thin wrapper around it). Real safety net is the existing 62 string-variant tests now exercising the wrapper; the structural test asserts every emitted layer carries a known kind, non-empty id/source/text, and non-negative tokens. Items 6-13 from the QA review were already addressed in `f6b3fbd` (the earlier QA pass): recovered-path test, bundle filter unit coverage, conv-id regex gate, deduplicated bundle helper, POSIX-only comment, historical text banner. Reviewer ran on an older commit. 2154 unit tests pass. 85 affected tests across 3 files green.
|
QA round 2 — pushed `fe44a33`. Five new findings from this review, all real: (1) Live mode dropped the workspace identity override. Critical fidelity bug — the trace would tell an operator using `workspace.identity = "You are LegalBot..."` that `DEFAULT_IDENTITY` was in the prompt. Fixed by exporting `makeIdentitySkill` from runtime.ts and replicating the per-workspace override append in `composeLive`. Added an integration regression test that plants `workspace.identity`, calls the tool, and asserts the override appears as a `core_skill` row + `default_identity` does not. (2) `hasProxiedTools: false` hard-coded. The reviewer was right — my "informational; safe default" comment was wrong. The apps section emits a 3-sentence block based on `proxied.length > 0`, common for Tier 2/3 workspaces. Fix: call `surfaceTools(allTools, null, {})` and read `proxied.length` exactly like `runtime.chat()` does. (3) Tool list not orgRole-filtered before L3 selection. Same direction as #2 — non-admin users see more tools than they get, inflating L3 selection. Fix: apply `isToolVisibleToRole(toolName, identity?.orgRole)` filter before passing tool names to `selectLayer3Skills`. (4) `hashStatus: "no-recorded-hash"` declared but never produced. With `contentHash` required (last QA round), this branch is unreachable. Dropped from the union — type now describes what's actually emitted. (5) Tautological lossless-trace test. Reviewer is right: `composeSystemPrompt` is now a thin wrapper, so the test compared the same value to itself. Replaced with a structural-integrity check (every emitted layer has a known kind, non-empty id/source/text, non-negative tokens). The 62 existing string-variant tests are the real safety net for text reproduction now that they exercise the wrapper. Items 6-13 from this review were already addressed in `f6b3fbd` (the previous QA pass — see PR comment trail). The reviewer appears to have run on an older commit. Specifically:
Tests: 2154 unit pass, 85 tests across the 3 affected files green. `verify:static` clean. |
QA round 3 on PR #129 — three new findings + one comment fix. 3. **applyBundleFilter produced a self-inconsistent response.** After filtering, `layers` was the filtered subset and `totalTokens` was re-summed, but `text` was left untouched — a caller using `r.totalTokens` to budget context against `r.text` would be misled by mismatched fields. Fixed: rebuild `text` from `filtered.map(l => l.text).join(SEPARATOR)` so all three fields describe the same subset. Note: section-internal text isn't re- formatted from pared subItems (apps section keeps its full original text); consumers wanting per-bundle prompt slices should walk the subItems. New unit test asserts post-filter text equals the joined filtered-layer texts. 4. **conversation_id was cosmetic in live mode.** The tool's description implied convId drove workspace selection, but live mode uses `runtime.requireWorkspaceId()` (current request context's workspace) — convId is just echoed back as a label. Fixed the description to be honest: "live mode composes against the calling request's workspace; convId is echoed as a label and does NOT select the workspace. To inspect a different workspace, call from within that workspace's context." Architecturally the right answer is to look up the conv to find its workspace, but conv ids don't encode the workspace and a global resolver across all workspace dirs is overkill for v1. Description honesty closes the gap. 5. **composeLive's docstring claimed "everything else is workspace- scoped and faithful."** `runtime.getContextSkills()` is global — the workspace-scoping comes from the per-request identity override (which fe44a33 added). Refreshed the doc to spell out what's global vs. what's workspace-scoped via the override append, plus what's skipped vs. runtime.chat() and why. 17. **`contextAssembled` event was fetched and mostly discarded** in composeHistorical — only `totalTokens` is used. Added a comment explaining the narrow read is intentional (the design contract for historical mode v1 is "L3 only," and surfacing other recorded sources would imply we can reconstruct layers we can't). QA reviewer's "unchanged from 2nd pass" items 1, 2, 6, 7, 8 (and 9, 10 from earlier passes) were already addressed in fe44a33 and f6b3fbd respectively. Reviewer ran on an older commit — code at fe44a33 has the workspace identity override (line 214), surfaceTools proxied check (line 241), isToolVisibleToRole filter (line 239), removed `no-recorded-hash` from the union (line 414), structural- integrity property test, recovered-path test, and the bundle-filter unit test file. PR comment will point reviewer at the actual diff. 2155 unit pass (was 2154; +1 bundle filter consistency). 86 tests across 3 affected files green. verify:static clean.
|
QA round 3 — pushed `b927fe5`. Two important notes for the reviewer up front: (a) Your items marked "unchanged from 2nd pass" — items 1, 2, 6, 7, 8 — and items 9, 10 from earlier passes — are all already fixed. They were addressed in commit `fe44a33` (round 2) and `f6b3fbd` (round 1). The current code at `fe44a33` has:
The reviewer appears to have run against `ff68dfd` (the original PR open) rather than the latest. Force-pushing isn't possible since these are real commits — please pull `b927fe5` and re-review. (b) Items 11-16 (suggestions from the 1st/2nd passes) are also all addressed in `f6b3fbd` or tracked as #130 (access-control audit) and #131 (gather-inputs refactor). Three genuinely-new findings from this round, all fixed: 3 (bundle filter self-inconsistent): Real bug. After `applyBundleFilter`, `layers` and `totalTokens` were filtered but `text` wasn't — a caller using `totalTokens` to budget against `text` would be misled. Fixed by rebuilding `text` from the filtered layer texts. New unit test asserts the joined-text invariant after filtering. 4 (`conversation_id` cosmetic in live mode): The reviewer is right — the description implied convId drove workspace selection but live mode uses the current request's workspace. Fixed the description to be honest: "live mode composes against the calling request's workspace; convId is echoed as a label and does NOT select the workspace." Looking up the conv's workspace would require a global resolver across all workspace dirs (conv ids don't encode workspace) — overkill for v1. 5 (composeLive docstring inaccurate): Real. The comment claimed "everything else is workspace-scoped and faithful" but `getContextSkills()` is global; the workspace-scoping comes from the identity-override append. Refreshed the docstring to spell out what's global, what's workspace-scoped via the override, what's skipped vs. `runtime.chat()`, and why. 17 (contextAssembled fetched-and-discarded): Added a comment explaining the narrow read. Historical mode v1's contract is "L3 only" — surfacing the rest of the event's data (per-source breakdown, exclusions) would imply we can reconstruct layers we can't. A future "rich historical mode" PR could expand this; not in scope here. Tests: 2155 unit pass, 86 tests across the 3 affected files green. `verify:static` clean. |
Summary
Single-call answer to "what's in the system prompt for this conversation, with provenance per layer." Replaces the manual jsonl-grep workflow that the recent prod incidents (Opus 4.7 thinking, skills tools surface) each required for diagnosis. Closes #119.
Two modes, one tool
Live (default). Re-gathers the same inputs
runtime.chat()uses (workspace, identity, registry, skills, instructions, apps, prefs) and runscomposeSystemPromptTracedto produce the full per-layer breakdown for the current state.Historical (`run_id` set). Reads the
skills.loadedevent for the given run from the conv jsonl and audits each layer-3 skill:contentHash(from feat: emit contentHash on skills.loaded events #125)._versions/snapshot → display the snapshot body (`hashStatus: "recovered"`).Three structural changes
New traced compose function. Added
composeSystemPromptTracedtoprompt/compose.tsreturning{text, layers, totalTokens}. The existingcomposeSystemPromptis now a thin wrapper that calls the traced variant and returns.text. Joined layer texts are byte-identical to the legacy string output — verified by a property test.New
composeplatform source. New filesrc/tools/platform/compose.tsexposing one tool (`compose__effective_context`). Visible to agents; read-only; no security implications.Two private runtime helpers made public.
buildAppsList(wsId)andreadPromptOverlays(wsId)— needed by the tool's live-mode gather path. Workspace-scoped, no privilege escalation.Layer kinds
core_skilldefault_identityuser_context_skilluser_prefsparticipantsworkspace_contextorg_overlay/workspace_overlaylayer3_skillsappsapp_state/focused_app/matched_skillGranular for operator-authored content (skills, apps), single rows for runtime-derived state.
Honest scope limits
Test plan
bun test test/unit/prompt.test.ts— 70 pass (62 existing + 8 new forcomposeSystemPromptTraced)bun test test/integration/compose-effective-context.test.ts— 8 pass (live mode, historical match/drift, bundle filter, conv-id resolution)bun run test:unit— 2149 pass (was 2137; added 12)bun run verify:static— clean (only pre-existing biome warnings on unrelated files)Refs
SKILL_MANAGEMENT_DESIGN.md§ Phase 1 (`compose_effective_context`)