Skip to content

feat: compose__effective_context debug tool#129

Merged
mgoldsborough merged 4 commits intomainfrom
feat/compose-effective-context
Apr 29, 2026
Merged

feat: compose__effective_context debug tool#129
mgoldsborough merged 4 commits intomainfrom
feat/compose-effective-context

Conversation

@mgoldsborough
Copy link
Copy Markdown
Contributor

Summary

Single-call answer to "what's in the system prompt for this conversation, with provenance per layer." Replaces the manual jsonl-grep workflow that the recent prod incidents (Opus 4.7 thinking, skills tools surface) each required for diagnosis. Closes #119.

Two modes, one tool

compose__effective_context({
  conversation_id?,   // defaults to current conv (the F7 pattern)
  run_id?,            // historical mode trigger
  bundle?,            // filter to one bundle's contributions
})

Live (default). Re-gathers the same inputs runtime.chat() uses (workspace, identity, registry, skills, instructions, apps, prefs) and runs composeSystemPromptTraced to produce the full per-layer breakdown for the current state.

Historical (`run_id` set). Reads the skills.loaded event for the given run from the conv jsonl and audits each layer-3 skill:

  • Re-reads the on-disk body, computes its SHA-256, compares to the recorded contentHash (from feat: emit contentHash on skills.loaded events #125).
  • Match → display the body verbatim (`hashStatus: "match"`).
  • Drift + matching _versions/ snapshot → display the snapshot body (`hashStatus: "recovered"`).
  • Drift, no snapshot → display current with a warning that the body may differ (`hashStatus: "drift"`).
  • File deleted → `hashStatus: "missing"`.

Three structural changes

  1. New traced compose function. Added composeSystemPromptTraced to prompt/compose.ts returning {text, layers, totalTokens}. The existing composeSystemPrompt is now a thin wrapper that calls the traced variant and returns .text. Joined layer texts are byte-identical to the legacy string output — verified by a property test.

  2. New compose platform source. New file src/tools/platform/compose.ts exposing one tool (`compose__effective_context`). Visible to agents; read-only; no security implications.

  3. Two private runtime helpers made public. buildAppsList(wsId) and readPromptOverlays(wsId) — needed by the tool's live-mode gather path. Workspace-scoped, no privilege escalation.

Layer kinds

kind granularity subItems
core_skill one per priority ≤ 10 context skill
default_identity one (only when no core skills)
user_context_skill one per priority > 10 context skill
user_prefs one
participants one (shared convs only)
workspace_context one
org_overlay / workspace_overlay one each (when non-empty)
layer3_skills one section per-skill (hash-verified, bundle-attributed)
apps one section per-app (bundle-attributed)
app_state / focused_app / matched_skill one each (when set)

Granular for operator-authored content (skills, apps), single rows for runtime-derived state.

Honest scope limits

  • Historical mode reconstructs layer-3 skills only. Identity, prefs, overlays, apps, focused-app, and app-state aren't recorded in events with enough detail to reconstruct; the tool surfaces a `non_l3_layers_omitted` warning directing the caller at live mode for the rest.
  • Mutation detection is layer-3-only. Hashes only exist for L3 skills (feat: emit contentHash on skills.loaded events #125). Other layers can drift between run-time and inspect-time without a hash check. Future PRs can extend hashing if needed.

Test plan

  • bun test test/unit/prompt.test.ts — 70 pass (62 existing + 8 new for composeSystemPromptTraced)
  • bun test test/integration/compose-effective-context.test.ts — 8 pass (live mode, historical match/drift, bundle filter, conv-id resolution)
  • bun run test:unit — 2149 pass (was 2137; added 12)
  • bun run verify:static — clean (only pre-existing biome warnings on unrelated files)

Refs

Single-call answer to "what's in the system prompt for this conversation,
with provenance per layer." Replaces the manual jsonl-grep workflow that
the recent prod incidents (Opus 4.7 thinking, skills tools surface) each
required for diagnosis.

## What it does

Two modes against one tool:

  - Live (default): re-gathers the same inputs `runtime.chat()` uses
    (workspace, identity, registry, skills, instructions, apps, prefs)
    and runs `composeSystemPromptTraced` to produce the full per-layer
    breakdown for the current state. Honest answer to "what would
    compose right now if a turn started in this conversation."

  - Historical (`run_id` set): reads the `skills.loaded` event for the
    given run from the conv jsonl and audits each layer-3 skill —
    re-reads the on-disk body, computes its SHA-256, compares to the
    recorded `contentHash`. Match → display the body verbatim. Drift
    → look up `_versions/` snapshots; if one matches the recorded
    hash, display that body (`hashStatus: "recovered"`). No match
    → display current with `hashStatus: "drift"` and a warning.

Bundle filter narrows layers + subItems to one bundle's contributions
(apps section row + layer-3 skills under the bundle's affined dir).

## Three structural changes

1. New traced compose function. Added `composeSystemPromptTraced(...)`
   to `prompt/compose.ts` returning `{text, layers, totalTokens}`.
   Existing `composeSystemPrompt` is now a one-liner that calls the
   traced variant and returns `.text`. Layer kinds enumerated:
     core_skill, default_identity, user_context_skill, user_prefs,
     participants, workspace_context, org_overlay, workspace_overlay,
     layer3_skills, apps, app_state, focused_app, matched_skill.
   `apps` and `layer3_skills` carry `subItems` for per-app / per-skill
   filtering and inspection without re-parsing the section text. The
   joined `.text` is byte-identical to the legacy string output —
   verified by a property test.

2. New `compose` platform source. New file
   `src/tools/platform/compose.ts` exposing one tool
   (`compose__effective_context`). Visible to agents; read-only; no
   security implications. Registered via the existing
   `createPlatformSources` factory.

3. Two private runtime helpers made public. `buildAppsList(wsId)` and
   `readPromptOverlays(wsId)` — needed by the tool's live-mode gather
   path. Workspace-scoped, no privilege escalation. Doc comments
   explain the new visibility.

## Honest scope limits

  - Historical mode reconstructs layer-3 skills only. Identity, prefs,
    overlays, apps, focused-app, and app-state aren't recorded in
    events with enough detail to reconstruct, and snapshotting full
    bodies into events was rejected as too expensive (10-100× event
    size). The tool surfaces a `non_l3_layers_omitted` warning in
    historical mode directing the caller at live mode for the rest.

  - Mutation detection is layer-3-only. Hashes only exist for L3
    skills (PR #125). Other layers can drift between run-time and
    inspect-time without a hash check. Future PRs can extend hashing
    to bundle instructions / overlays / app instructions if needed.

## Tests

  - test/unit/prompt.test.ts +8 tests for composeSystemPromptTraced
    (lossless reconstruction, per-skill rows, bundle attribution,
    overlay emission, token totals).

  - test/integration/compose-effective-context.test.ts +8 tests for
    the tool (live mode, historical match/drift, bundle filter,
    conv-id resolution via context fallback / explicit-wins).

2149 unit tests pass, 8 integration tests pass, verify:static green.

Refs:
- SKILL_MANAGEMENT_DESIGN.md § Phase 1 ("compose_effective_context")
- Closes #119
…-id gate

QA pass on PR #129. Six items addressed:

W1. Test gap on the recovered hash-status path (the headline value of
    historical mode). Added an integration test that plants a
    `_versions/<basename>.<ts>.md` snapshot whose body hashes to the
    recorded contentHash, edits the live file, and asserts the audit
    returns hashStatus="recovered" + the snapshot path.

W2. Bundle filter coverage was layer3-only. Exported `applyBundleFilter`
    + `ComposeResponse` and added a focused unit-test file covering all
    filter branches: apps section per-subItem narrowing, focused_app
    ownBundleMatches branch, layers without bundle attribution dropped,
    mixed subItems pared, totalTokens recomputed against survivors.

S1. Validated conv_id shape with `CONVERSATION_ID_RE` before any
    filesystem access. Without this, a malformed id (`../foo`) would
    reach `existsSync` first, leaking boolean fs existence. Existing
    test fixtures updated to use valid 16-hex-char ids.

S3. Promoted `deriveBundleFromSkillPath` to an export of `prompt/compose.ts`
    and removed the byte-for-byte duplicate in `tools/platform/compose.ts`.
    The original justification ("avoid importing a private helper") was
    moot once `composeSystemPromptTraced` itself was public.

S4. Cached `deriveBundleFromSkillPath(entry.sourcePath)` to a const in
    the layer3_skills subItems map — was being computed twice per skill.

S5. Added a comment on the `path.startsWith("/")` POSIX-only check in
    `auditL3Skill` documenting the deployed-targets assumption (Linux,
    macOS) and what a future Windows port would need to broaden.

S6. Historical-mode `text` now leads with a banner —
    "[historical mode — layer 3 only. Use live mode for the full
    composition; see warnings[] for details.]" — so a programmatic
    caller treating `text` as "the prompt as recorded" sees the
    incompleteness up front.

Filed S2 (no ConversationAccessContext on historical event reads) as
#130 — pre-existing pattern that also exists
in skills.ts; deserves its own audit, not band-aid in this PR.

2154 unit tests pass (was 2149; +5 bundle-filter unit), 14 compose-tool
tests pass (was 8; +5 unit + 1 integration). verify:static green.
@mgoldsborough
Copy link
Copy Markdown
Contributor Author

QA pass — pushed `f6b3fbd`.

W1 (recovered path uncovered): Fixed. New integration test `recovers the loaded body from a _versions/ snapshot when the live file has drifted` plants the snapshot, edits the live file, asserts `hashStatus="recovered"` + the snapshot path is reported. The headline historical-mode feature now has a regression test.

W2 (bundle filter only L3): Fixed via a new unit-test file `test/unit/tools/platform/compose-bundle-filter.test.ts`. Exported `applyBundleFilter` + `ComposeResponse` and added 5 tests covering: apps section per-subItem narrowing, focused_app `ownBundleMatches` branch (which the L3 integration test couldn't reach), layers without bundle attribution dropped, mixed subItems pared, totalTokens recomputed. The same per-subItem code path covers both apps and L3, but now both branches plus the layer-level branch are explicitly asserted.

S1 (conv-id existsSync before validation): Fixed. Gated up front with `CONVERSATION_ID_RE.test(convId)`. Existing test fixtures updated to use valid 16-hex-char ids — they had been using sentinel strings (`conv_filter`, `conv_ws`) that never matched the regex.

S2 (no access check on historical reads): Filed as #130 ("compose__effective_context: add ConversationAccessContext check on historical reads"). The same pattern exists in `skills.ts::skills__loading_log` and `skills__active_for` — fixing one without the other would be band-aid; the issue tracks the broader audit.

S3 (deriveBundleFromSkillPath duplicated): Fixed. Promoted to an export of `prompt/compose.ts` and removed the duplicate. Original justification was moot once the parent function went public.

S4 (called twice per skill): Fixed with a `const bundle = …` cache.

S5 (POSIX-only path check): Comment added documenting the deployed-targets assumption (Linux/macOS) and what a future Windows port would need.

S6 (historical `text` misleading): Fixed with a leading banner — `[historical mode — layer 3 only. Use live mode for the full composition; see warnings[] for details.]` — so a programmatic caller reading `text` sees the incompleteness up front instead of silently getting an L3 slice.

Tests: 2154 unit (was 2149; +5 bundle-filter unit). 14 compose-tool tests (was 8; +5 unit + 1 integration). `verify:static` green.

QA pass on PR #129 (round 2). Five items:

1. Live mode silently dropped the workspace identity override.
   `composeLive` was passing `runtime.getContextSkills()` directly —
   which is the global static set — but `runtime.chat()` augments it
   per-request with `makeIdentitySkill(workspace.identity)` (a
   priority-1 core context skill). Net effect: any workspace with a
   custom identity would be told by the trace that DEFAULT_IDENTITY
   was in its prompt, when in reality LegalBot/whatever override was.
   Fix: export `makeIdentitySkill` from runtime.ts, fetch
   `workspace.identity` in composeLive, append the same override skill
   the chat path appends.

   Regression test: integration test plants `workspace.identity` and
   asserts the trace contains a core_skill row matching the override
   AND that no default_identity row is emitted.

2. `hasProxiedTools` was hard-coded to `false`. The apps section in
   the prompt emits a 3-sentence block about discovering tools via
   `nb__search` only when `proxied.length > 0` — common for any
   Tier 2/3 workspace (more than DEFAULT_MAX_DIRECT_TOOLS tools).
   Trace was understating the prompt for those workspaces. Fix:
   replicate runtime.chat()'s split — call `surfaceTools(allTools,
   null, {})` and read `proxied.length`.

3. Tool list wasn't orgRole-filtered before L3 selection. runtime.chat
   filters via `isToolVisibleToRole(toolName, identity?.orgRole)`
   before feeding tools into `selectLayer3Skills` — non-admin users
   get a narrower active set. Live mode skipped this filter, so a
   non-admin caller would see L3 skills that wouldn't actually be
   selected for them. Fix: apply the same filter before the L3
   selection.

4. `hashStatus: "no-recorded-hash"` declared in the union but never
   produced. With `contentHash` required (per the previous QA round),
   this branch can't fire. Removed the status from the union — type
   honest about what we actually emit.

5. Replaced the tautological "lossless trace" property test with a
   structural-integrity check. The previous test compared
   `composeSystemPromptTraced(...).text` to itself (because
   `composeSystemPrompt` is now a thin wrapper around it). Real safety
   net is the existing 62 string-variant tests now exercising the
   wrapper; the structural test asserts every emitted layer carries a
   known kind, non-empty id/source/text, and non-negative tokens.

Items 6-13 from the QA review were already addressed in `f6b3fbd` (the
earlier QA pass): recovered-path test, bundle filter unit coverage,
conv-id regex gate, deduplicated bundle helper, POSIX-only comment,
historical text banner. Reviewer ran on an older commit.

2154 unit tests pass. 85 affected tests across 3 files green.
@mgoldsborough
Copy link
Copy Markdown
Contributor Author

QA round 2 — pushed `fe44a33`.

Five new findings from this review, all real:

(1) Live mode dropped the workspace identity override. Critical fidelity bug — the trace would tell an operator using `workspace.identity = "You are LegalBot..."` that `DEFAULT_IDENTITY` was in the prompt. Fixed by exporting `makeIdentitySkill` from runtime.ts and replicating the per-workspace override append in `composeLive`. Added an integration regression test that plants `workspace.identity`, calls the tool, and asserts the override appears as a `core_skill` row + `default_identity` does not.

(2) `hasProxiedTools: false` hard-coded. The reviewer was right — my "informational; safe default" comment was wrong. The apps section emits a 3-sentence block based on `proxied.length > 0`, common for Tier 2/3 workspaces. Fix: call `surfaceTools(allTools, null, {})` and read `proxied.length` exactly like `runtime.chat()` does.

(3) Tool list not orgRole-filtered before L3 selection. Same direction as #2 — non-admin users see more tools than they get, inflating L3 selection. Fix: apply `isToolVisibleToRole(toolName, identity?.orgRole)` filter before passing tool names to `selectLayer3Skills`.

(4) `hashStatus: "no-recorded-hash"` declared but never produced. With `contentHash` required (last QA round), this branch is unreachable. Dropped from the union — type now describes what's actually emitted.

(5) Tautological lossless-trace test. Reviewer is right: `composeSystemPrompt` is now a thin wrapper, so the test compared the same value to itself. Replaced with a structural-integrity check (every emitted layer has a known kind, non-empty id/source/text, non-negative tokens). The 62 existing string-variant tests are the real safety net for text reproduction now that they exercise the wrapper.

Items 6-13 from this review were already addressed in `f6b3fbd` (the previous QA pass — see PR comment trail). The reviewer appears to have run on an older commit. Specifically:

    1. Recovered hash status test ✓ (`historical mode > recovers the loaded body from a _versions/ snapshot...`)
    1. Bundle filter against apps section ✓ (`test/unit/tools/platform/compose-bundle-filter.test.ts` — apps subItem narrowing, ownBundleMatches, mixed subs, totalTokens recompute)
    1. POSIX-only path comment ✓
    1. `deriveBundleFromSkillPath` deduplicated ✓
    1. Conv-id regex gate ✓
    1. ConversationAccessContext on historical reads → tracked as compose__effective_context: add ConversationAccessContext check on historical reads #130 (broader audit)
    1. Historical `text` banner ✓
    1. `deriveBundleFromSkillPath` cached to a const ✓

Tests: 2154 unit pass, 85 tests across the 3 affected files green. `verify:static` clean.

QA round 3 on PR #129 — three new findings + one comment fix.

3. **applyBundleFilter produced a self-inconsistent response.** After
   filtering, `layers` was the filtered subset and `totalTokens` was
   re-summed, but `text` was left untouched — a caller using
   `r.totalTokens` to budget context against `r.text` would be misled
   by mismatched fields. Fixed: rebuild `text` from
   `filtered.map(l => l.text).join(SEPARATOR)` so all three fields
   describe the same subset. Note: section-internal text isn't re-
   formatted from pared subItems (apps section keeps its full original
   text); consumers wanting per-bundle prompt slices should walk the
   subItems. New unit test asserts post-filter text equals the joined
   filtered-layer texts.

4. **conversation_id was cosmetic in live mode.** The tool's
   description implied convId drove workspace selection, but live mode
   uses `runtime.requireWorkspaceId()` (current request context's
   workspace) — convId is just echoed back as a label. Fixed the
   description to be honest: "live mode composes against the calling
   request's workspace; convId is echoed as a label and does NOT
   select the workspace. To inspect a different workspace, call from
   within that workspace's context."

   Architecturally the right answer is to look up the conv to find its
   workspace, but conv ids don't encode the workspace and a global
   resolver across all workspace dirs is overkill for v1. Description
   honesty closes the gap.

5. **composeLive's docstring claimed "everything else is workspace-
   scoped and faithful."** `runtime.getContextSkills()` is global — the
   workspace-scoping comes from the per-request identity override
   (which fe44a33 added). Refreshed the doc to spell out what's global
   vs. what's workspace-scoped via the override append, plus what's
   skipped vs. runtime.chat() and why.

17. **`contextAssembled` event was fetched and mostly discarded** in
    composeHistorical — only `totalTokens` is used. Added a comment
    explaining the narrow read is intentional (the design contract for
    historical mode v1 is "L3 only," and surfacing other recorded
    sources would imply we can reconstruct layers we can't).

QA reviewer's "unchanged from 2nd pass" items 1, 2, 6, 7, 8 (and
9, 10 from earlier passes) were already addressed in fe44a33 and
f6b3fbd respectively. Reviewer ran on an older commit — code at
fe44a33 has the workspace identity override (line 214), surfaceTools
proxied check (line 241), isToolVisibleToRole filter (line 239),
removed `no-recorded-hash` from the union (line 414), structural-
integrity property test, recovered-path test, and the bundle-filter
unit test file. PR comment will point reviewer at the actual diff.

2155 unit pass (was 2154; +1 bundle filter consistency). 86 tests
across 3 affected files green. verify:static clean.
@mgoldsborough
Copy link
Copy Markdown
Contributor Author

QA round 3 — pushed `b927fe5`.

Two important notes for the reviewer up front:

(a) Your items marked "unchanged from 2nd pass" — items 1, 2, 6, 7, 8 — and items 9, 10 from earlier passes — are all already fixed. They were addressed in commit `fe44a33` (round 2) and `f6b3fbd` (round 1). The current code at `fe44a33` has:

  • Item 1 (workspace identity override): `makeIdentitySkill` imported (compose.ts:50), called (compose.ts:214) + integration regression test in `compose-effective-context.test.ts`
  • Item 2 (`hasProxiedTools`): replaced with `surfaceTools(allTools, null, {}).proxied.length > 0` (compose.ts:241)
  • Item 6 (orgRole filter): `isToolVisibleToRole(t.name, identity?.orgRole)` filter applied at compose.ts:239
  • Item 7 (`no-recorded-hash`): removed from the type union (compose.ts:414, only 4 statuses now)
  • Item 8 (tautological test): replaced with structural-integrity check in prompt.test.ts
  • Item 9 (recovered hash status test): `recovers the loaded body from a _versions/ snapshot...` in compose-effective-context.test.ts
  • Item 10 (bundle filter coverage): `test/unit/tools/platform/compose-bundle-filter.test.ts` covers apps subItem, ownBundleMatches, no-attribution drop, mixed subs, totalTokens recompute

The reviewer appears to have run against `ff68dfd` (the original PR open) rather than the latest. Force-pushing isn't possible since these are real commits — please pull `b927fe5` and re-review.

(b) Items 11-16 (suggestions from the 1st/2nd passes) are also all addressed in `f6b3fbd` or tracked as #130 (access-control audit) and #131 (gather-inputs refactor).

Three genuinely-new findings from this round, all fixed:

3 (bundle filter self-inconsistent): Real bug. After `applyBundleFilter`, `layers` and `totalTokens` were filtered but `text` wasn't — a caller using `totalTokens` to budget against `text` would be misled. Fixed by rebuilding `text` from the filtered layer texts. New unit test asserts the joined-text invariant after filtering.

4 (`conversation_id` cosmetic in live mode): The reviewer is right — the description implied convId drove workspace selection but live mode uses the current request's workspace. Fixed the description to be honest: "live mode composes against the calling request's workspace; convId is echoed as a label and does NOT select the workspace." Looking up the conv's workspace would require a global resolver across all workspace dirs (conv ids don't encode workspace) — overkill for v1.

5 (composeLive docstring inaccurate): Real. The comment claimed "everything else is workspace-scoped and faithful" but `getContextSkills()` is global; the workspace-scoping comes from the identity-override append. Refreshed the docstring to spell out what's global, what's workspace-scoped via the override, what's skipped vs. `runtime.chat()`, and why.

17 (contextAssembled fetched-and-discarded): Added a comment explaining the narrow read. Historical mode v1's contract is "L3 only" — surfacing the rest of the event's data (per-source breakdown, exclusions) would imply we can reconstruct layers we can't. A future "rich historical mode" PR could expand this; not in scope here.

Tests: 2155 unit pass, 86 tests across the 3 affected files green. `verify:static` clean.

@mgoldsborough mgoldsborough added the qa-reviewed QA review completed with no critical issues label Apr 29, 2026
@mgoldsborough mgoldsborough merged commit f170fe3 into main Apr 29, 2026
4 checks passed
@mgoldsborough mgoldsborough deleted the feat/compose-effective-context branch April 29, 2026 08:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

qa-reviewed QA review completed with no critical issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1 gap: implement compose_effective_context debug surface

1 participant