Skip to content

refactor(orchestration): migrate prompt builders to file-based templates#459

Merged
cosarah merged 12 commits intomainfrom
refactor/orchestration-prompts
Apr 20, 2026
Merged

refactor(orchestration): migrate prompt builders to file-based templates#459
cosarah merged 12 commits intomainfrom
refactor/orchestration-prompts

Conversation

@wyuc
Copy link
Copy Markdown
Contributor

@wyuc wyuc commented Apr 19, 2026

Summary

Phase 1 of a two-phase refactor of the orchestration prompts subsystem.

Migrates three orchestration-tier prompt builders (buildStructuredPrompt, buildDirectorPrompt, buildPBLSystemPrompt) from inline TS template literals to file-based markdown templates, sharing the loader infrastructure already used by the generation pipeline.

Unblocks much of Phase 2 content optimization via markdown edits (LaTeX few-shot examples, speech guidelines, snippet additions). A subset — viewport bounds (1000×562), role-specific length targets, and role guidelines — still lives as TS template literals in prompt-builder.ts and will migrate in a follow-up extraction pass.

What changed

  • lib/generation/prompts/lib/prompts/ (project-level loader, used by both generation + orchestration)
  • lib/orchestration/prompt-builder.ts 890 → 314 lines (thin assembler)
  • lib/orchestration/director-prompt.ts 290 → 271 lines (thin assembler)
  • lib/pbl/pbl-system-prompt.ts 93 → 30 lines (thin assembler)
  • New lib/orchestration/summarizers/ — 5 focused modules
  • New lib/orchestration/types.ts — shared WhiteboardActionRecord + AgentTurnSummary
  • New templates: lib/prompts/templates/{agent-system,director,pbl-design}/system.md
  • New snippet: lib/prompts/snippets/speech-guidelines.md
  • New lib/prompts/README.md — conventions, syntax, gotchas, local-testing recipe for template authors
  • PROMPT_IDS gains as const satisfies Record<string, PromptId> type guard

Test posture

  • tests/prompts/loader.test.ts — bounds the loader infrastructure (3 tests)
  • tests/prompts/templates.test.ts — bounds behavior-level invariants (8 tests): no unresolved {{...}} placeholders in any rendered prompt, role dispatch (teacher → LEAD TEACHER, student → not), scene-type action stripping (quiz scene strips spotlight/laser), director output-spec shape
  • pnpm eval:whiteboard — bounds end-to-end composition. 1-scenario smoke run on this branch passed (agent loop ran 3 turns, 2 screenshots captured, scores within sampling noise of pre-refactor baseline)

An earlier iteration of this branch added ~1900 lines of byte-equal snapshot tests as a refactor safety net. Those were removed after review surfaced that byte-equality was the wrong invariant for a refactor whose goal is enabling Phase 2 content changes — every intentional Phase 2 tweak would have produced large snapshot diffs for no benefit beyond what the eval and structural assertions already cover.

Phase 2 deferrals (deliberate)

  • ROLE_GUIDELINES, buildLengthGuidelines, buildWhiteboardGuidelines remain as TS template literals in prompt-builder.ts — role-conditional prose; snippet extraction deferred until eval-driven changes show what needs the treatment.
  • lib/prompts/templates/slide-content/{system,user}.md still uses snake_case placeholders (pre-existing). Normalize when generation pipeline gets a similar pass. The new README flags this.

Test plan

  • npx tsc --noEmit clean
  • pnpm lint zero errors
  • pnpm check (Prettier) clean
  • pnpm test tests/prompts — 10/10 pass
  • 1-scenario end-to-end smoke eval passed
  • CI green

Note: 3 unrelated failures in tests/server/provider-config.test.ts are environment-config tests that don't sandbox process.env; they fail in any setup with .env.local containing provider keys (verified same 3 fail on a fresh main checkout). Not a regression from this branch.

🤖 Generated with Claude Code

wyuc added a commit that referenced this pull request Apr 19, 2026
Addresses two items flagged by PR #459 final review:

1. lib/prompts/README.md — conventions, syntax, gotchas, and local-testing
   recipe for template authors. Previously there was no doc for the stated
   goal of "non-engineers can edit prompts."

2. tests/prompts/templates.test.ts — 8 structural assertions covering:
   - no surviving {{...}} placeholders in any rendered template
   - role dispatch (teacher → LEAD TEACHER, student → not)
   - scene-type action stripping (quiz scene has no spotlight/laser)
   - director prompt output spec mentions next_agent

   These replace the removed byte-equal snapshot suite at much lower
   maintenance cost — they assert behaviors the refactor must preserve,
   not exact bytes.
@wyuc wyuc force-pushed the refactor/orchestration-prompts branch from dd2617b to 45de0f8 Compare April 19, 2026 16:29
@wyuc
Copy link
Copy Markdown
Contributor Author

wyuc commented Apr 19, 2026

Agent Code Review Process

This PR was developed and reviewed using superpowers:subagent-driven-development. 20 subagent dispatches total — 7 implementers, 5 spec-compliance reviewers, 5 code-quality reviewers, 3 follow-up fix implementers, 2 final PR-level reviews. Every task ran the cycle: implementer → spec review → code-quality review.

Substantive findings and how they were addressed

Stage Reviewer finding Action taken
Task 2 snapshot coverage The original convertMessagesToOpenAI test passed currentAgentId equal to the message's agentId, so the cross-agent role-conversion branch was never exercised. Variant 7 ("whiteboard-open scene") was mis-named: the actual spotlight/laser strip path (non-slide scene types) was uncovered. Split into same-agent + cross-agent tests; added a real quiz-scene variant to cover the strip path; added an assistant-role variant.
Task 4 placeholder convention agent-system template used snake_case placeholders while other templates (slide-actions, slide-content) use camelCase. Inconsistency would compound in Task 5. Renamed all 17 placeholders to camelCase; extracted the static ## Speech Guidelines (CRITICAL) block into a reusable speech-guidelines.md snippet.
Final PR review — type placement WhiteboardActionRecord and AgentTurnSummary lived in lib/orchestration/director-prompt.ts but were imported by 6 modules including the neutral summarizers/ modules, creating an upward import direction. Extracted to lib/orchestration/types.ts; updated all 8 importers.
User observation Byte-equal snapshot tests (~1900 lines of .snap) locked the wrong invariant for this refactor — the goal is a stable substrate for future prompt editing, not frozen bytes. Phase 2 tweaks would produce large snapshot diffs per commit for no benefit beyond what the eval already provides. Removed the full snapshot suite.
Post-update review Missing lib/prompts/README.md (non-engineer editability was a stated goal). The PR description's "Phase 2 is pure markdown editing" claim didn't hold: viewport bounds, length targets, and role guidelines remain in TS. Two trailing empty ci: commits (workflow re-trigger attempts) polluted history. Added lib/prompts/README.md (syntax, naming conventions, silent-passthrough gotcha, local-testing recipe); added 8 structural assertion tests (tests/prompts/templates.test.ts) covering no-unresolved-placeholders, role dispatch, scene-type action stripping, and director output-spec shape; softened Phase 2 language in PR body; squashed out the empty commits.

Final state

  • pnpm test tests/prompts passes (10 tests: 3 loader + 8 structural)
  • npx tsc --noEmit, pnpm lint, pnpm check all clean
  • 1-scenario end-to-end smoke eval (econ-tech-innovation) passed: agent loop ran 3 turns, 2 screenshots captured
  • Previous CI run on dee3abd was all green; this incremental (~235 lines README + tests, plus squash) needs CI to re-trigger after the force-push.

wyuc and others added 11 commits April 20, 2026 00:38
Move file-based prompt loader from lib/generation/prompts/ to
project-level lib/prompts/ so it can be shared by orchestration
and PBL prompts in subsequent commits.

- Templates and snippets moved alongside the loader
- getPromptsDir() now points at lib/prompts/
- Generation pipeline imports updated to @/lib/prompts
  (outline-generator, scene-generator, search-query-builder,
   scene-outlines-stream route)
- Add tests/prompts/loader.test.ts as sanity gate

Pure file move — no behavior change.
Snapshot tests cover variant matrix:
- buildStructuredPrompt: role × scene-type × peer/ledger/discussion/profile
- convertMessagesToOpenAI: mixed message kinds
- summarizeConversation: truncation
- buildDirectorPrompt: Q&A vs discussion, ledger, profile
- buildPBLSystemPrompt: default config

These lock current output so the body refactor in subsequent
commits can be verified byte-equal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Split convertMessagesToOpenAI test: same-agent (existing) +
  cross-agent (new). The original test passed currentAgentId
  matching the message's agentId, so the cross-agent role-conversion
  branch was never exercised.

- Add 'teacher / quiz scene' snapshot to lock the spotlight/laser
  strip path in getEffectiveActions. The previous variant-7 test
  (whiteboard-open) only triggered the mutual-exclusion warning,
  not the strip — renamed it to reflect what it actually tests.

- Add assistant-role variant (was uncovered: ROLE_GUIDELINES,
  buildLengthGuidelines, buildWhiteboardGuidelines all branch on
  role).
Move state-context, virtual whiteboard ledger replay, peer context,
message converter, and conversation summary out of the 890-line
prompt-builder.ts into focused modules under lib/orchestration/summarizers/.

prompt-builder.ts now only owns prompt assembly. director-graph
imports updated to the new locations.

Snapshot tests pass byte-equal — no behavior change.
The buildStructuredPrompt template literal is now
lib/prompts/templates/agent-system/system.md, assembled by a
thin variable-pass through the shared prompt loader.

Per-variant ternaries (format example, ordering, spotlight
examples, mutual-exclusion note) are kept as module-level
constants in prompt-builder.ts. ROLE_GUIDELINES,
buildLengthGuidelines, and buildWhiteboardGuidelines stay too
(they may further migrate to template partials in Phase 2).

Also: PROMPT_IDS gains a 'satisfies Record<string, PromptId>'
clause to prevent constant/union drift as more IDs are added.

Snapshot tests (12 prompt-builder + 4 director + 1 pbl) pass
byte-equal.
- Rename agent-system placeholders snake_case → camelCase to match
  the convention already used by slide-actions/user.md and other
  generation templates. Settles convention before Task 5 lands two
  more templates that would otherwise inherit drift.

- Extract '## Speech Guidelines (CRITICAL)' into
  lib/prompts/snippets/speech-guidelines.md for reuse by director
  and PBL prompts in Task 5.

Snapshot tests pass byte-equal — placeholder name changes are
invisible in rendered output, and snippet inclusion happens at
load time.
director-prompt.ts and pbl-system-prompt.ts now use the shared
template loader. Bodies collapse to thin variable assembly.

Three orchestration-tier prompts now share one infrastructure:
agent-system, director, pbl-design.

Snapshot tests pass byte-equal.
…mmary to types module

These two interfaces are imported by 6+ modules including
summarizers/ — having them live in director-prompt.ts created
an awkward upward import direction (summarizers reaching back
to a sibling prompt builder for types).

Move them to lib/orchestration/types.ts and update all callers.
director-prompt.ts now imports them too.

Pure type-location refactor — snapshots pass byte-equal.
Per review feedback: byte-equal snapshots were the wrong invariant
for this refactor. The goal was a stable substrate for editing
prompts, not a frozen contract. Snapshots would have generated
maintenance churn (~100-200 line diffs per intentional Phase 2
prompt tweak) without providing value the eval doesn't already.

Test posture going forward:
- tests/prompts/loader.test.ts (3 tests) — bounds the loader
  infrastructure (template load, snippet inclusion, variable
  interpolation, missing-id behavior)
- pnpm eval:whiteboard — bounds end-to-end agent loop, template
  composition, and integration with chat/director/state-manager

Removed: 1879 lines of .snap + 320 lines of test scaffolding +
fixtures. Net PR diff drops by ~2200 lines.
Addresses two items flagged by PR #459 final review:

1. lib/prompts/README.md — conventions, syntax, gotchas, and local-testing
   recipe for template authors. Previously there was no doc for the stated
   goal of "non-engineers can edit prompts."

2. tests/prompts/templates.test.ts — 8 structural assertions covering:
   - no surviving {{...}} placeholders in any rendered template
   - role dispatch (teacher → LEAD TEACHER, student → not)
   - scene-type action stripping (quiz scene has no spotlight/laser)
   - director prompt output spec mentions next_agent

   These replace the removed byte-equal snapshot suite at much lower
   maintenance cost — they assert behaviors the refactor must preserve,
   not exact bytes.
PR #461 (interactive mode clean) added 7 new templates under
lib/generation/prompts/templates/ (the pre-refactor location).
Rebase onto origin/main landed them at the old path; git's rename
detection didn't forward them through the directory move in db56c40.

Manually move them to lib/prompts/templates/ to match the new
convention, and fix scene-generator.ts's relative import
('./prompts/types' → '@/lib/prompts/types') since the dir no
longer exists as a sibling.

No behavior change — templates read from PROMPT_IDS values via
getPromptsDir() which already points at lib/prompts/.
@wyuc wyuc force-pushed the refactor/orchestration-prompts branch from 45de0f8 to 0274e2c Compare April 19, 2026 16:42
Copy link
Copy Markdown
Collaborator

@cosarah cosarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavior preservation looks good to me for the three migrated prompts — summarizer extraction, types.ts move, and template naming convention are all improvements. Two things worth doing before merge, plus one real bug that's low-impact today but worth fixing.

Important

  • Structural tests cover only about half the conditional surface. tests/prompts/templates.test.ts asserts role dispatch and scene-type stripping but skips the branches Phase 2 is most likely to touch:

    • Director discussion-context branch — lib/orchestration/director-prompt.ts:51-60
    • Peer context section — lib/orchestration/prompt-builder.ts:145 (largest single conditional block in agent-system)
    • Assistant role — teacher and student are tested; buildLengthGuidelines/buildWhiteboardGuidelines assistant branches are silent
    • Language directive — fixture always sets languageDirective: 'zh-CN'; the null path isn't exercised
    • PBL {{issueCount}} appears three times in the template; no assertion confirms all three are filled

    These are small additions, not a re-snapshot.

  • README doesn't tell editors what is still in TS. The PR body correctly notes viewport bounds (1000×562), role-specific length targets, and ROLE_GUIDELINES still live in prompt-builder.ts. A contributor reading lib/prompts/README.md expecting a pure-markdown workflow won't know to go there. A short "Still in TS" section would prevent a real friction point.

Minor

  • Snippet typos ship silently. lib/prompts/loader.ts:44 returns the literal string `{{snippet:${id}}}` on a missing snippet and only logs a warn. The UNRESOLVED_PLACEHOLDER regex at tests/prompts/templates.test.ts:88 is /\{\{\w[\w-]*\}\}/, which does not match `{{snippet:foo}}` — the : breaks it. A typo like `{{snippet:speach-guidelines}}` would reach the LLM with no test failure. Either broaden the regex (e.g. /\{\{(\w[\w-]*|snippet:[\w-]+)\}\}/) or have loadSnippet throw on missing files.

  • interpolateVariables regex is narrower than processSnippets. loader.ts:102 uses \{\{(\w+)\}\} — a kebab-case placeholder like `{{next-agent}}` would silently pass through. The README mandates camelCase so it's not a bug today; worth either a comment above the regex or a lint test that scans every templates/*/{system,user}.md for non-camelCase placeholders.

  • lib/orchestration/summarizers/peer-context.ts has a single consumer. Not a blocker; flag if the one-caller situation persists after Phase 2 — at that point folding it back into prompt-builder.ts removes indirection.

Verdict

Approve with nits. The testing and README suggestions would substantially improve Phase 2's downstream safety; the snippet-regex fix is real but low-impact. Nothing here is structural.

- loadSnippet now throws on missing file instead of silently returning
  a literal {{snippet:id}} string. A typo like {{snippet:speach-guidelines}}
  now fails at load time instead of reaching the LLM.

- README gains a "Still in TypeScript" section listing the role-conditional
  content that still lives in prompt-builder.ts (ROLE_GUIDELINES,
  buildLengthGuidelines, buildWhiteboardGuidelines) so contributors
  expecting a pure-markdown workflow know where to look.

- Expand tests/prompts/templates.test.ts to cover the conditional branches
  Phase 2 is most likely to touch (9 new assertions):
  - assistant role dispatch
  - peer-context section toggles on agentResponses presence
  - language constraint toggles on stage.languageDirective presence
  - director Q&A vs discussion mode branching
  - pbl-design {{issueCount}} substituted at all 3 occurrences
  - placeholder-naming-convention lint scans every template for
    non-camelCase placeholders (slide-content grandfathered)

- Comment above interpolateVariables regex documents why kebab-case
  placeholders pass through silently (the lint test now catches them).

- New test in loader.test.ts locks the throw-on-missing-snippet behavior.
@wyuc
Copy link
Copy Markdown
Contributor Author

wyuc commented Apr 20, 2026

Thanks @cosarah. All 4 handled in 0c5fefa.

Brief on why each slipped:

  • Tests half-covered the conditionals — I wrote the structural assertions from memory after removing the snapshots, rather than walking the template conditionals. Re-did it by reading the {{...}} call sites; 9 more assertions cover peer/language/assistant/discussion/issueCount.
  • README missing "Still in TS" — I documented how the infra works but not where each kind of content lives. A Phase 2 contributor arrives needing the inverse map.
  • loadSnippet silent passthrough — pre-existing behavior inherited from the generation loader; I didn't audit its error paths when promoting it. Now throws.
  • Regex asymmetry — didn't put interpolateVariables's \w+ next to processSnippets's \w[\w-]*. Added a lint test so any kebab-case drift fails loud.

Leaving the peer-context.ts single-consumer note for Phase 2 as you suggested.

Copy link
Copy Markdown
Collaborator

@cosarah cosarah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested locally, LGTM

@cosarah cosarah merged commit f40c92f into main Apr 20, 2026
3 checks passed
@wyuc wyuc deleted the refactor/orchestration-prompts branch April 20, 2026 05:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants