refactor(orchestration): migrate prompt builders to file-based templates by wyuc · Pull Request #459 · THU-MAIC/OpenMAIC

wyuc · 2026-04-19T10:52:57Z

Summary

Phase 1 of a two-phase refactor of the orchestration prompts subsystem.

Migrates three orchestration-tier prompt builders (buildStructuredPrompt, buildDirectorPrompt, buildPBLSystemPrompt) from inline TS template literals to file-based markdown templates, sharing the loader infrastructure already used by the generation pipeline.

Unblocks much of Phase 2 content optimization via markdown edits (LaTeX few-shot examples, speech guidelines, snippet additions). A subset — viewport bounds (1000×562), role-specific length targets, and role guidelines — still lives as TS template literals in prompt-builder.ts and will migrate in a follow-up extraction pass.

What changed

lib/generation/prompts/ → lib/prompts/ (project-level loader, used by both generation + orchestration)
lib/orchestration/prompt-builder.ts 890 → 314 lines (thin assembler)
lib/orchestration/director-prompt.ts 290 → 271 lines (thin assembler)
lib/pbl/pbl-system-prompt.ts 93 → 30 lines (thin assembler)
New lib/orchestration/summarizers/ — 5 focused modules
New lib/orchestration/types.ts — shared WhiteboardActionRecord + AgentTurnSummary
New templates: lib/prompts/templates/{agent-system,director,pbl-design}/system.md
New snippet: lib/prompts/snippets/speech-guidelines.md
New lib/prompts/README.md — conventions, syntax, gotchas, local-testing recipe for template authors
PROMPT_IDS gains as const satisfies Record<string, PromptId> type guard

Test posture

tests/prompts/loader.test.ts — bounds the loader infrastructure (3 tests)
tests/prompts/templates.test.ts — bounds behavior-level invariants (8 tests): no unresolved {{...}} placeholders in any rendered prompt, role dispatch (teacher → LEAD TEACHER, student → not), scene-type action stripping (quiz scene strips spotlight/laser), director output-spec shape
pnpm eval:whiteboard — bounds end-to-end composition. 1-scenario smoke run on this branch passed (agent loop ran 3 turns, 2 screenshots captured, scores within sampling noise of pre-refactor baseline)

An earlier iteration of this branch added ~1900 lines of byte-equal snapshot tests as a refactor safety net. Those were removed after review surfaced that byte-equality was the wrong invariant for a refactor whose goal is enabling Phase 2 content changes — every intentional Phase 2 tweak would have produced large snapshot diffs for no benefit beyond what the eval and structural assertions already cover.

Phase 2 deferrals (deliberate)

ROLE_GUIDELINES, buildLengthGuidelines, buildWhiteboardGuidelines remain as TS template literals in prompt-builder.ts — role-conditional prose; snippet extraction deferred until eval-driven changes show what needs the treatment.
lib/prompts/templates/slide-content/{system,user}.md still uses snake_case placeholders (pre-existing). Normalize when generation pipeline gets a similar pass. The new README flags this.

Test plan

Note: 3 unrelated failures in tests/server/provider-config.test.ts are environment-config tests that don't sandbox process.env; they fail in any setup with .env.local containing provider keys (verified same 3 fail on a fresh main checkout). Not a regression from this branch.

🤖 Generated with Claude Code

Addresses two items flagged by PR #459 final review: 1. lib/prompts/README.md — conventions, syntax, gotchas, and local-testing recipe for template authors. Previously there was no doc for the stated goal of "non-engineers can edit prompts." 2. tests/prompts/templates.test.ts — 8 structural assertions covering: - no surviving {{...}} placeholders in any rendered template - role dispatch (teacher → LEAD TEACHER, student → not) - scene-type action stripping (quiz scene has no spotlight/laser) - director prompt output spec mentions next_agent These replace the removed byte-equal snapshot suite at much lower maintenance cost — they assert behaviors the refactor must preserve, not exact bytes.

wyuc · 2026-04-19T16:30:52Z

Agent Code Review Process

This PR was developed and reviewed using superpowers:subagent-driven-development. 20 subagent dispatches total — 7 implementers, 5 spec-compliance reviewers, 5 code-quality reviewers, 3 follow-up fix implementers, 2 final PR-level reviews. Every task ran the cycle: implementer → spec review → code-quality review.

Substantive findings and how they were addressed

Stage	Reviewer finding	Action taken
Task 2 snapshot coverage	The original `convertMessagesToOpenAI` test passed `currentAgentId` equal to the message's `agentId`, so the cross-agent role-conversion branch was never exercised. Variant 7 ("whiteboard-open scene") was mis-named: the actual spotlight/laser strip path (non-slide scene types) was uncovered.	Split into same-agent + cross-agent tests; added a real quiz-scene variant to cover the strip path; added an assistant-role variant.
Task 4 placeholder convention	agent-system template used `snake_case` placeholders while other templates (slide-actions, slide-content) use `camelCase`. Inconsistency would compound in Task 5.	Renamed all 17 placeholders to camelCase; extracted the static `## Speech Guidelines (CRITICAL)` block into a reusable `speech-guidelines.md` snippet.
Final PR review — type placement	`WhiteboardActionRecord` and `AgentTurnSummary` lived in `lib/orchestration/director-prompt.ts` but were imported by 6 modules including the neutral `summarizers/` modules, creating an upward import direction.	Extracted to `lib/orchestration/types.ts`; updated all 8 importers.
User observation	Byte-equal snapshot tests (~1900 lines of .snap) locked the wrong invariant for this refactor — the goal is a stable substrate for future prompt editing, not frozen bytes. Phase 2 tweaks would produce large snapshot diffs per commit for no benefit beyond what the eval already provides.	Removed the full snapshot suite.
Post-update review	Missing `lib/prompts/README.md` (non-engineer editability was a stated goal). The PR description's "Phase 2 is pure markdown editing" claim didn't hold: viewport bounds, length targets, and role guidelines remain in TS. Two trailing empty `ci:` commits (workflow re-trigger attempts) polluted history.	Added `lib/prompts/README.md` (syntax, naming conventions, silent-passthrough gotcha, local-testing recipe); added 8 structural assertion tests (`tests/prompts/templates.test.ts`) covering no-unresolved-placeholders, role dispatch, scene-type action stripping, and director output-spec shape; softened Phase 2 language in PR body; squashed out the empty commits.

Final state

pnpm test tests/prompts passes (10 tests: 3 loader + 8 structural)
npx tsc --noEmit, pnpm lint, pnpm check all clean
1-scenario end-to-end smoke eval (econ-tech-innovation) passed: agent loop ran 3 turns, 2 screenshots captured
Previous CI run on dee3abd was all green; this incremental (~235 lines README + tests, plus squash) needs CI to re-trigger after the force-push.

Move file-based prompt loader from lib/generation/prompts/ to project-level lib/prompts/ so it can be shared by orchestration and PBL prompts in subsequent commits. - Templates and snippets moved alongside the loader - getPromptsDir() now points at lib/prompts/ - Generation pipeline imports updated to @/lib/prompts (outline-generator, scene-generator, search-query-builder, scene-outlines-stream route) - Add tests/prompts/loader.test.ts as sanity gate Pure file move — no behavior change.

Snapshot tests cover variant matrix: - buildStructuredPrompt: role × scene-type × peer/ledger/discussion/profile - convertMessagesToOpenAI: mixed message kinds - summarizeConversation: truncation - buildDirectorPrompt: Q&A vs discussion, ledger, profile - buildPBLSystemPrompt: default config These lock current output so the body refactor in subsequent commits can be verified byte-equal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Split convertMessagesToOpenAI test: same-agent (existing) + cross-agent (new). The original test passed currentAgentId matching the message's agentId, so the cross-agent role-conversion branch was never exercised. - Add 'teacher / quiz scene' snapshot to lock the spotlight/laser strip path in getEffectiveActions. The previous variant-7 test (whiteboard-open) only triggered the mutual-exclusion warning, not the strip — renamed it to reflect what it actually tests. - Add assistant-role variant (was uncovered: ROLE_GUIDELINES, buildLengthGuidelines, buildWhiteboardGuidelines all branch on role).

Move state-context, virtual whiteboard ledger replay, peer context, message converter, and conversation summary out of the 890-line prompt-builder.ts into focused modules under lib/orchestration/summarizers/. prompt-builder.ts now only owns prompt assembly. director-graph imports updated to the new locations. Snapshot tests pass byte-equal — no behavior change.

The buildStructuredPrompt template literal is now lib/prompts/templates/agent-system/system.md, assembled by a thin variable-pass through the shared prompt loader. Per-variant ternaries (format example, ordering, spotlight examples, mutual-exclusion note) are kept as module-level constants in prompt-builder.ts. ROLE_GUIDELINES, buildLengthGuidelines, and buildWhiteboardGuidelines stay too (they may further migrate to template partials in Phase 2). Also: PROMPT_IDS gains a 'satisfies Record<string, PromptId>' clause to prevent constant/union drift as more IDs are added. Snapshot tests (12 prompt-builder + 4 director + 1 pbl) pass byte-equal.

- Rename agent-system placeholders snake_case → camelCase to match the convention already used by slide-actions/user.md and other generation templates. Settles convention before Task 5 lands two more templates that would otherwise inherit drift. - Extract '## Speech Guidelines (CRITICAL)' into lib/prompts/snippets/speech-guidelines.md for reuse by director and PBL prompts in Task 5. Snapshot tests pass byte-equal — placeholder name changes are invisible in rendered output, and snippet inclusion happens at load time.

director-prompt.ts and pbl-system-prompt.ts now use the shared template loader. Bodies collapse to thin variable assembly. Three orchestration-tier prompts now share one infrastructure: agent-system, director, pbl-design. Snapshot tests pass byte-equal.

…mmary to types module These two interfaces are imported by 6+ modules including summarizers/ — having them live in director-prompt.ts created an awkward upward import direction (summarizers reaching back to a sibling prompt builder for types). Move them to lib/orchestration/types.ts and update all callers. director-prompt.ts now imports them too. Pure type-location refactor — snapshots pass byte-equal.

Per review feedback: byte-equal snapshots were the wrong invariant for this refactor. The goal was a stable substrate for editing prompts, not a frozen contract. Snapshots would have generated maintenance churn (~100-200 line diffs per intentional Phase 2 prompt tweak) without providing value the eval doesn't already. Test posture going forward: - tests/prompts/loader.test.ts (3 tests) — bounds the loader infrastructure (template load, snippet inclusion, variable interpolation, missing-id behavior) - pnpm eval:whiteboard — bounds end-to-end agent loop, template composition, and integration with chat/director/state-manager Removed: 1879 lines of .snap + 320 lines of test scaffolding + fixtures. Net PR diff drops by ~2200 lines.

Addresses two items flagged by PR #459 final review: 1. lib/prompts/README.md — conventions, syntax, gotchas, and local-testing recipe for template authors. Previously there was no doc for the stated goal of "non-engineers can edit prompts." 2. tests/prompts/templates.test.ts — 8 structural assertions covering: - no surviving {{...}} placeholders in any rendered template - role dispatch (teacher → LEAD TEACHER, student → not) - scene-type action stripping (quiz scene has no spotlight/laser) - director prompt output spec mentions next_agent These replace the removed byte-equal snapshot suite at much lower maintenance cost — they assert behaviors the refactor must preserve, not exact bytes.

PR #461 (interactive mode clean) added 7 new templates under lib/generation/prompts/templates/ (the pre-refactor location). Rebase onto origin/main landed them at the old path; git's rename detection didn't forward them through the directory move in db56c40. Manually move them to lib/prompts/templates/ to match the new convention, and fix scene-generator.ts's relative import ('./prompts/types' → '@/lib/prompts/types') since the dir no longer exists as a sibling. No behavior change — templates read from PROMPT_IDS values via getPromptsDir() which already points at lib/prompts/.

cosarah

Behavior preservation looks good to me for the three migrated prompts — summarizer extraction, types.ts move, and template naming convention are all improvements. Two things worth doing before merge, plus one real bug that's low-impact today but worth fixing.

Important

Structural tests cover only about half the conditional surface. tests/prompts/templates.test.ts asserts role dispatch and scene-type stripping but skips the branches Phase 2 is most likely to touch:
- Director discussion-context branch — lib/orchestration/director-prompt.ts:51-60
- Peer context section — lib/orchestration/prompt-builder.ts:145 (largest single conditional block in agent-system)
- Assistant role — teacher and student are tested; buildLengthGuidelines/buildWhiteboardGuidelines assistant branches are silent
- Language directive — fixture always sets languageDirective: 'zh-CN'; the null path isn't exercised
- PBL {{issueCount}} appears three times in the template; no assertion confirms all three are filled
These are small additions, not a re-snapshot.
README doesn't tell editors what is still in TS. The PR body correctly notes viewport bounds (1000×562), role-specific length targets, and ROLE_GUIDELINES still live in prompt-builder.ts. A contributor reading lib/prompts/README.md expecting a pure-markdown workflow won't know to go there. A short "Still in TS" section would prevent a real friction point.

Minor

Snippet typos ship silently. lib/prompts/loader.ts:44 returns the literal string `{{snippet:${id}}}` on a missing snippet and only logs a warn. The UNRESOLVED_PLACEHOLDER regex at tests/prompts/templates.test.ts:88 is /\{\{\w[\w-]*\}\}/, which does not match `{{snippet:foo}}` — the : breaks it. A typo like `{{snippet:speach-guidelines}}` would reach the LLM with no test failure. Either broaden the regex (e.g. /\{\{(\w[\w-]*|snippet:[\w-]+)\}\}/) or have loadSnippet throw on missing files.
interpolateVariables regex is narrower than processSnippets. loader.ts:102 uses \{\{(\w+)\}\} — a kebab-case placeholder like `{{next-agent}}` would silently pass through. The README mandates camelCase so it's not a bug today; worth either a comment above the regex or a lint test that scans every templates/*/{system,user}.md for non-camelCase placeholders.
lib/orchestration/summarizers/peer-context.ts has a single consumer. Not a blocker; flag if the one-caller situation persists after Phase 2 — at that point folding it back into prompt-builder.ts removes indirection.

Verdict

Approve with nits. The testing and README suggestions would substantially improve Phase 2's downstream safety; the snippet-regex fix is real but low-impact. Nothing here is structural.

- loadSnippet now throws on missing file instead of silently returning a literal {{snippet:id}} string. A typo like {{snippet:speach-guidelines}} now fails at load time instead of reaching the LLM. - README gains a "Still in TypeScript" section listing the role-conditional content that still lives in prompt-builder.ts (ROLE_GUIDELINES, buildLengthGuidelines, buildWhiteboardGuidelines) so contributors expecting a pure-markdown workflow know where to look. - Expand tests/prompts/templates.test.ts to cover the conditional branches Phase 2 is most likely to touch (9 new assertions): - assistant role dispatch - peer-context section toggles on agentResponses presence - language constraint toggles on stage.languageDirective presence - director Q&A vs discussion mode branching - pbl-design {{issueCount}} substituted at all 3 occurrences - placeholder-naming-convention lint scans every template for non-camelCase placeholders (slide-content grandfathered) - Comment above interpolateVariables regex documents why kebab-case placeholders pass through silently (the lint test now catches them). - New test in loader.test.ts locks the throw-on-missing-snippet behavior.

wyuc · 2026-04-20T05:09:13Z

Thanks @cosarah. All 4 handled in 0c5fefa.

Brief on why each slipped:

Tests half-covered the conditionals — I wrote the structural assertions from memory after removing the snapshots, rather than walking the template conditionals. Re-did it by reading the {{...}} call sites; 9 more assertions cover peer/language/assistant/discussion/issueCount.
README missing "Still in TS" — I documented how the infra works but not where each kind of content lives. A Phase 2 contributor arrives needing the inverse map.
loadSnippet silent passthrough — pre-existing behavior inherited from the generation loader; I didn't audit its error paths when promoting it. Now throws.
Regex asymmetry — didn't put interpolateVariables's \w+ next to processSnippets's \w[\w-]*. Added a lint test so any kebab-case drift fails loud.

Leaving the peer-context.ts single-consumer note for Phase 2 as you suggested.

cosarah

tested locally, LGTM

wyuc force-pushed the refactor/orchestration-prompts branch from dd2617b to 45de0f8 Compare April 19, 2026 16:29

wyuc and others added 11 commits April 20, 2026 00:38

wyuc force-pushed the refactor/orchestration-prompts branch from 45de0f8 to 0274e2c Compare April 19, 2026 16:42

cosarah reviewed Apr 20, 2026

View reviewed changes

cosarah approved these changes Apr 20, 2026

View reviewed changes

cosarah merged commit f40c92f into main Apr 20, 2026
3 checks passed

wyuc deleted the refactor/orchestration-prompts branch April 20, 2026 05:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(orchestration): migrate prompt builders to file-based templates#459

refactor(orchestration): migrate prompt builders to file-based templates#459
cosarah merged 12 commits intomainfrom
refactor/orchestration-prompts

wyuc commented Apr 19, 2026 •

edited

Loading

Uh oh!

wyuc commented Apr 19, 2026

Uh oh!

cosarah left a comment

Uh oh!

wyuc commented Apr 20, 2026

Uh oh!

cosarah left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wyuc commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Test posture

Phase 2 deferrals (deliberate)

Test plan

Uh oh!

wyuc commented Apr 19, 2026

Agent Code Review Process

Substantive findings and how they were addressed

Final state

Uh oh!

cosarah left a comment

Choose a reason for hiding this comment

Important

Minor

Verdict

Uh oh!

wyuc commented Apr 20, 2026

Uh oh!

cosarah left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wyuc commented Apr 19, 2026 •

edited

Loading