Maestro

Conduct any output type at scale.

Maestro is an AI orchestration framework that defeats context window limits by decomposing large creative tasks into independent, parallelizable units of work. You describe what you want in plain English, Maestro plans the work, generates detailed instructions for each piece, executes them through specialized Claude instances, heals broken code automatically, runs visual QA, and assembles the final artifact.

It produces PowerPoint decks, Excel models, Word documents, HTML pages, code, and markdown, all from a single natural language goal.

"Create a 30-slide McKinsey-style presentation on the future of AI"
    ↓
30 independent Claude calls, each generating one slide
    ↓
Automatic code healing, visual QA via Gemini, critic review
    ↓
One merged .pptx file

Why This Exists

Claude can write excellent python-pptx code for a single slide. But ask it to write a 30-slide deck in one shot and you will hit the output token limit around slide 8. The code gets truncated mid-line. from pptx.util impor (missing the t). The whole script fails.

Every context-window-bound AI tool hits this wall eventually. The standard answer is "make it shorter." Maestro's answer is "make it parallel."

The Core Problem

LLM output has hard token limits. For code-generating tasks (python-pptx, openpyxl), these limits are brutal:

A single python-pptx slide with charts can be 300-500 lines of code
The Anthropic SDK enforces a non-streaming timeout for requests that would exceed ~10 minutes
Even at 32K max output tokens, chart-heavy slides overflow
Truncated code has no graceful failure mode: prs.save(OUTPUT_PA is a SyntaxError

Maestro solves this by never asking one Claude instance to produce more than one slide's worth of output. Each "Movement" is a self-contained unit that fits comfortably in a single context window.

How It Works

The pipeline has five phases, each named after a musical concept:

┌─────────────────────────────────────────────────────────┐
│                     COMPOSITION                         │
│  User defines: goal, output type, constraints           │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                     ARRANGEMENT                         │
│  Claude decomposes the goal into N Movements            │
│  "30-slide deck" → 30 independent work units            │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│              SCORE → CONDUCT (per Movement)             │
│                                                         │
│  For each Movement:                                     │
│    1. Score Generator writes a detailed blueprint       │
│    2. Conductor executes it through the Instrument      │
│    3. Motif captures what was produced (continuity)      │
│    4. Checkpoint saves progress (resumable)              │
│                                                         │
│  Failed? → Retry with error history injected into Score │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      ASSEMBLY                           │
│                                                         │
│  For code-generating outputs (pptx, xlsx):              │
│    1. Extract code blocks from each Movement output     │
│    2. CodeDoctor: deterministic healing (7 fix types)   │
│    3. Run each in isolated subprocess                   │
│    4. On failure: LLM healing loop (send error + code   │
│       to Claude, get fixed code, retry)                 │
│    5. Merge per-movement files into one artifact        │
│                                                         │
│  For text outputs (md, html, docx):                     │
│    Concatenate with section dividers                    │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                    VISUAL QA                            │
│                                                         │
│  SlideDoctor (pptx):                                    │
│    PPTX → PDF (LibreOffice) → JPEGs → Gemini Flash     │
│    Detects: overlapping text, cut-off elements,         │
│    broken layout, unreadable text                       │
│    Generates python-pptx fix code → applies in subprocess│
│                                                         │
│  WebDoctor (html):                                      │
│    HTML → headless Chrome screenshot → Gemini Flash     │
│    Same detect → fix → apply pattern                    │
└───────────────────────┬─────────────────────────────────┘
                        ▼
┌─────────────────────────────────────────────────────────┐
│                      CRITIQUE                           │
│                                                         │
│  Claude evaluates the final artifact against the goal   │
│  If rejected: re-conducts flagged Movements with        │
│  critique feedback injected, then re-assembles          │
└─────────────────────────────────────────────────────────┘

Architecture Deep Dive

The Orchestrator (`cli.py`)

The MaestroOrchestrator class runs the full pipeline. It is a linear state machine, not a DAG executor. Movements execute sequentially (parallel execution via Agent Teams is defined in the schema but not yet implemented). The key design choice: each Movement's output is checkpointed to disk immediately, so a crash at Movement 22 of 30 doesn't lose Movements 1-21.

The Arranger (`core/arranger.py`)

Uses Claude to decompose a goal into Movements. The system prompt enforces critical sizing rules learned through failure:

pptx: exactly 1 slide per Movement (python-pptx code for 2+ slides overflows tokens)
xlsx: 1 tab/section per Movement
docx: 1 chapter per Movement

The Arranger outputs a JSON plan with movement titles, descriptions, output types, and dependency edges. The schema includes valid output_type literals, which was a fix discovered after Sonnet invented "pptx_slide" as a type name.

The Score Generator (`core/score_generator.py`)

Writes a detailed markdown blueprint for each Movement. This is the instruction set that the Conductor will follow. The Score includes:

What this Movement must produce
Continuity context from previous Movements (via Motifs)
Style and format requirements from the Composition's constraints

Motif system: After each Movement completes, Haiku generates a 2-3 sentence summary of what was produced. These Motifs are passed to subsequent Score Generators to maintain narrative continuity across Movements. Only the 5 most recent Motifs are included to prevent O(n^2) token growth.

The Conductor (`core/conductor.py`)

Executes one Movement by sending the Score to Claude through the appropriate Instrument. Instruments are wrappers around Claude Code skills (SKILL.md files in ~/.claude/skills/). The Conductor appends instrument-specific guidance. For pptx, this is ~50 lines covering:

OUTPUT_PATH variable contract
The "no native charts" constraint (ChartData code is 5-10x more verbose than shape-based visuals)
Text fitting rules (font sizes relative to container widths)
Slide dimension boundaries (13.333 x 7.5 inches)

The Assembly System (`core/assembly.py`)

This is where the most battle-tested code lives. For pptx/xlsx outputs, assembly means:

Extract code blocks from each Movement's raw LLM output
Heal each block through the CodeDoctor (deterministic fixes)
Execute each in an isolated subprocess
Detect stray output files (scripts that save to hardcoded paths instead of OUTPUT_PATH)
Retry failures through the LLM healing loop
Merge successful per-movement files into one artifact

The merge for pptx uses python-pptx + lxml deep copy. It copies slide backgrounds, all shapes, and their XML elements from source presentations into a base presentation.

The critical lesson: never concatenate movement code into one script. Early versions did this. One truncated movement killed the entire script. Each movement now runs in its own subprocess with its own Python process.

The SubprocessConductor (`core/subprocess_conductor.py`)

When --persist mode is enabled, each Movement is executed in a completely fresh Python subprocess. No shared memory, no accumulated context. This is the anti-compaction guarantee: Claude Code compresses context as conversations grow, which can degrade output quality. By running each Movement in isolation, you get identical quality for Movement 30 as you got for Movement 1.

The subprocess reads a job spec from a temp JSON file, loads its own .env, creates its own Anthropic client, executes the Movement, writes output to disk, and exits.

The Musical Metaphor

Every concept in Maestro maps to a musical term:

Framework Concept	Musical Analogy	What It Actually Does
Composition	A piece of music to be performed	The user's goal definition (a Python file)
Arrangement	How a piece is structured for an ensemble	Claude's decomposition plan
Movement	A self-contained section of a larger work	One context-window-sized chunk of work
Score	Written notation for performers	Claude-generated blueprint for one Movement
Instrument	A specific tool for producing sound	The Claude skill that executes a Movement (pptx, xlsx, etc.)
Conductor	Leads the orchestra through the performance	Routes a Movement through its Instrument
Motif	A recurring musical phrase that ties a piece together	2-3 sentence summary passed between Movements for continuity
Performance	The final rendered piece	The output artifact (the .pptx, .xlsx, .html file)
Critique	Post-performance review	Claude's quality evaluation with pass/fail

The metaphor is not decoration. It solves a real naming problem: the word "task" is overloaded in every AI framework. "Prompt" is ambiguous. "Step" implies sequentiality. "Movement" communicates exactly what it is: a self-contained, independently executable section that contributes to a larger whole.

The Doctor System

Maestro has a three-layer quality assurance system. Each "Doctor" operates at a different stage and uses a different strategy.

CodeDoctor (`core/code_doctor.py`) - Pre-Execution

The CodeDoctor applies deterministic fixes to generated code before it runs. This is free and instant. For pptx, it applies 7 specific fixes:

OUTPUT_PATH injection - Always prepends OUTPUT_PATH = "/path/to/output.pptx" at the top. Earlier versions had a bug where this was conditional on the string OUTPUT_PATH not appearing in the code. If the code referenced prs.save(OUTPUT_PATH) without defining it, the injection was skipped. This single bug caused 15/16 failures in the 30-slide Future of AI deck.
Fresh Presentation enforcement - Regex-replaces Presentation(some_file) with Presentation(). Generated code sometimes tries to load an existing file that doesn't exist.
INPUT_PATH stripping - Removes os.environ.get("INPUT_PATH") patterns that have no external file to load.
Missing import injection - Scans code for usage patterns (Inches(, RGBColor(, MSO_SHAPE) and injects the corresponding import if missing. 10 rules for pptx, 5 for xlsx.
Chart code stripping - Comments out any line containing ChartData, add_chart, or XL_CHART_TYPE. Chart code is too verbose for the token budget.
Save call injection - If the code uses prs but never calls prs.save(), appends it.
Hardcoded path replacement - Regex-replaces prs.save("my_deck.pptx") with prs.save(OUTPUT_PATH).

After deterministic fixes, the CodeDoctor runs a Python compile() check and attempts to fix common syntax errors (unclosed triple-quotes, unclosed parentheses).

Layer 2: LLM Healing - If execution fails after deterministic fixes, the CodeDoctor sends the code + error message to Claude (using the healer_model, default Haiku for speed/cost). Claude returns fixed code. The deterministic fixes are re-applied to the LLM output (belt and suspenders). This loops up to assembly_retries times.

SlideDoctor (`core/slide_doctor.py`) - Post-Assembly Visual QA for PPTX

After assembly produces a .pptx file, the SlideDoctor runs a visual inspection pipeline:

PPTX to PDF via LibreOffice headless (soffice --headless --convert-to pdf)
PDF to JPEGs via pdftoppm -jpeg -r 150
Each JPEG to Gemini Flash with a structured analysis prompt looking for: overlapping text, misalignment, cut-off elements, broken layout, empty space, unreadable text, missing elements
Fix generation via Claude: issues are described and Claude generates a python-pptx fix script
Fix application in a subprocess with the PPTX_PATH passed as an environment variable

The SlideDoctor gracefully degrades. If LibreOffice, pdftoppm, or the Gemini API key is missing, it skips visual QA with a log message. No crash, no error.

WebDoctor (`core/web_doctor.py`) - Post-Assembly Visual QA for HTML

Same pattern as SlideDoctor, but for web outputs:

Render in headless Chrome (--headless=new --screenshot --window-size=1920,1080)
Screenshot to Gemini Flash looking for: broken layout, overflow, invisible text, missing content, alignment issues, contrast problems
Fix generation via Claude: generates a Python script that modifies the HTML
Fix application in subprocess with HTML_PATH passed as environment variable

Chrome is discovered automatically on macOS and Linux (/Applications/Google Chrome.app/..., which google-chrome, etc.).

Patterns Adapted from the Claude Code Harness

Several design patterns in Maestro were adapted from studying how the Claude Code harness (the CLI tool itself) approaches complex, multi-step work.

Isolated Subprocess Execution

Claude Code runs generated code in isolated subprocesses rather than in the main process. Maestro adopted this pattern for assembly: each Movement's python-pptx code runs in its own subprocess.Popen with a fresh Python interpreter. This means:

A crash in one Movement's code cannot affect other Movements
Each Movement gets clean namespace (no variable collisions)
Memory is fully reclaimed between Movements
Timeout enforcement is per-Movement (120s), not global

The SubprocessConductor extends this further: when --persist mode is active, even the LLM call itself runs in a subprocess, guaranteeing zero state accumulation across Movements.

The Director Pattern (Retry with Error Context)

When a Movement fails validation, Maestro doesn't just retry blindly. It injects the full error history into the Score:

## RETRY - Previous Attempts Failed
Attempt 1: NameError: name 'OUTPUT_PATH' is not defined
Attempt 2: SyntaxError: unexpected EOF while parsing

Fix ALL issues and produce correct, complete output.

This mirrors how Claude Code feeds tool execution errors back into the conversation, allowing the model to learn from each failure within the same task.

Streaming for Long-Running Operations

The Anthropic SDK rejects non-streaming requests that would exceed ~10 minutes (calculated from max_tokens). Maestro's create_message() wrapper automatically switches to streaming mode for any request with max_tokens > 8192:

if max_tokens > 8192:
    with client.messages.stream(**kwargs) as stream:
        return stream.get_final_message()

This is transparent to callers. They get back a standard Message object regardless of whether streaming was used. This pattern was discovered the hard way when bumping conductor tokens to 32K caused immediate SDK timeouts.

Deterministic Pre-Processing Before LLM Calls

The CodeDoctor's Layer 1 (deterministic fixes) runs before every execution, not just on failure. This is the same philosophy as Claude Code's tool input validation: fix what you can cheaply before spending tokens on LLM calls. The 7 regex-based pptx fixes are essentially free and prevent the majority of runtime errors.

Checkpoint-Based Resume

Every successful Movement is checkpointed to disk as a JSON file. If the process crashes, --resume picks up from the last checkpoint. This mirrors how Claude Code preserves conversation state across sessions, allowing you to continue a complex task without re-doing completed work.

Skill System (Instrument Abstraction)

Instruments map output types to Claude Code skill files (~/.claude/skills/pptx/SKILL.md). The skill content is loaded and injected into the system prompt. This mirrors how Claude Code loads skill definitions at runtime. If no skill file exists for a given output type (like code or markdown), the Instrument falls back to Claude's native capabilities.

Structured Output Validation Gate

Before proceeding past the Conductor phase, every Movement output is validated:

Not empty
Not too short (< 50 chars)
Contains a \``python` code block (for pptx/xlsx outputs)
Code block is substantial (> 5 lines)

This gate catches cases where the model returns a brief apology instead of actual code. Failed validation triggers a retry with error context injection.

What Works Really Well

1. The 1-Slide-per-Movement Strategy

This is the single most important architectural decision. By setting max_movements equal to the target slide count, each Movement generates exactly one slide's worth of python-pptx code. This keeps each code block well within token limits (200-400 lines vs. 2000+ for a full deck). Movement success rate went from ~50% to ~90% after adopting this strategy.

2. Shape-Based Visuals Instead of Native Charts

python-pptx's ChartData / add_chart API produces 500+ lines per chart. A simple bar chart built from colored rectangles and textboxes takes 30-50 lines. The visual quality is comparable (and sometimes better, because you get pixel-perfect control). This constraint is auto-injected for all pptx compositions.

3. Motif Continuity Threading

The Motif system keeps Movements narratively coherent without sharing full context. Movement 15's Score includes a summary of what Movements 10-14 produced. This is enough for Claude to maintain consistent design language, color schemes, and narrative flow without consuming tokens on the full output of every previous Movement.

4. Graceful Partial Assembly

If 3 out of 30 Movements fail, Maestro still produces a 27-slide deck. The Assembly phase reports which Movements are missing and continues. This is a critical property for large compositions: getting 90% of a 30-slide deck is vastly more useful than getting 0% because one slide crashed the pipeline.

5. The Two-Layer Healing Pipeline

Deterministic fixes (free, instant) catch ~80% of issues. LLM healing (costs a few cents) catches most of the rest. The combination means that even when the Conductor produces imperfect code, the Assembly phase usually recovers without user intervention.

6. Visual QA Catches What Code Validation Cannot

A python-pptx script can run perfectly and produce a slide where the title overlaps the subtitle, text is too small to read, or elements are positioned outside the visible area. The SlideDoctor renders the actual slide as an image and uses Gemini Flash to catch these visual issues. This catches a class of errors that no amount of code analysis can detect.

7. Resume from Checkpoint

A 30-slide deck takes ~45 minutes and ~75 API calls. If it crashes at slide 25, --resume picks up from slide 25. Movement outputs, Motifs, and the Arrangement are all persisted. This makes large compositions practical for production use.

Known Flaws and Limitations

1. Sequential Execution Only

Movements execute one at a time. The parallel field exists in the Composition schema but is not implemented. For a 30-slide deck, this means 30 sequential Claude calls. Parallel execution would cut wall-clock time by 5-10x for independent Movements.

2. The Critic Is Partially Blind to Binary Outputs

For pptx/xlsx files, the Critic extracts text content (slide titles, cell values) but cannot see the visual layout. It can tell you "only 14 slides instead of 30" but not "the chart on slide 5 has overlapping labels." The SlideDoctor partially covers this gap, but the Critic's evaluation of binary formats is fundamentally limited.

3. Chart-Heavy Slides Are Still Fragile

The "no native charts" constraint works well for most business slides, but some compositions genuinely need complex charts (scatter plots, multi-series line charts). Building these from rectangles and textboxes is possible but awkward. There is no good solution for chart-heavy compositions within the current token budget.

4. Assembly Merge Loses Slide Masters

The pptx merge copies shapes and backgrounds but uses slide_layouts[6] (blank) for every slide. Custom slide masters, layouts, and themes from individual Movements are not preserved. Every slide in the final deck uses the same blank layout. This means footer templates, page numbers, and master-level branding do not carry over.

5. No Real Dependency Graph Execution

The depends_on field exists in Movement definitions, but the executor ignores it. Movements always run in index order. For compositions where Movement 5 genuinely needs the output of Movement 3 (like a summary slide that references earlier content), the Motif system provides approximate continuity, but not exact data.

6. Visual QA Requires External Dependencies

SlideDoctor needs LibreOffice and pdftoppm. WebDoctor needs Chrome. Both need a Gemini API key. All three gracefully degrade (skip QA instead of crashing), but the default experience on a machine without these tools has no visual quality assurance.

7. The Conductor's OUTPUT_PATH Contract Is Fragile

The Conductor's system prompt tells Claude that OUTPUT_PATH will be available, but Claude sometimes generates code that defines its own output path, saves to a hardcoded filename, or references OUTPUT_PATH without defining it. The CodeDoctor patches most of these cases, but the underlying issue is that the contract between "what the Conductor promises" and "what the Assembler provides" is enforced through regex heuristics, not a formal interface.

8. Token Budget Estimation Is Absent

There is no pre-flight check for whether a Composition's goal is feasible within the token/cost budget. A 100-slide deck will happily start executing, spending API credits on each Movement, even if the total cost is $50+. Users must estimate costs manually.

9. Motif Window Is Fixed at 5

The MAX_RECENT_MOTIFS = 5 constant means Movement 20 only sees summaries from Movements 15-19. For compositions where early Movements establish design foundations (color palettes, visual language), this context is lost by mid-composition. The Motif window should ideally be adaptive or include "pinned" Motifs for foundational decisions.

10. Single Model per Role

Each role (Arranger, Scorer, Conductor, Critic) uses one model for the entire run. You cannot say "use Opus for the first 3 slides and Sonnet for the rest." The per-role model overrides (arranger_model, scorer_model, critic_model) help, but per-Movement model selection is not supported.

Quick Start

Prerequisites

Python 3.12+
uv (recommended) or pip
An Anthropic API key (ANTHROPIC_API_KEY)
(Optional) Gemini API key (GEMINI_API_KEY) for visual QA
(Optional) LibreOffice + pdftoppm for slide visual QA
(Optional) Chrome/Chromium for web visual QA

Install

git clone https://github.com/earlyaidopters/maestro.git
cd maestro
uv sync

Set up your API key

Create a .env file in the project root:

ANTHROPIC_API_KEY=sk-ant-...
GEMINI_API_KEY=...    # optional, for visual QA

Run an example

# Preview how Maestro would arrange a composition (no API cost beyond arrangement)
uv run maestro arrange maestro/compositions/examples/investor_deck.py

# Full run
uv run maestro run maestro/compositions/examples/investor_deck.py

# List past runs
uv run maestro runs

# Browse templates
uv run maestro templates

Writing a Composition

A Composition is a Python file that defines a composition variable:

from maestro.compositions.composition import Composition

composition = Composition(
    title="Q1 Board Deck",
    goal=(
        "Create a 12-slide board deck for Q1 2026 results. "
        "Cover revenue ($4.2M, +34% YoY), key wins, churn analysis, and Q2 outlook."
    ),
    output_type="pptx",
    constraints=[
        "Executive audience, no jargon",
        "Dark professional theme: navy/white/gold",
        "Max 4 bullet points per slide",
    ],
    max_movements=12,         # 1 slide per movement
    model_tier="balanced",    # fast | balanced | powerful
    critique_loops=1,
)

Then run it:

uv run maestro run path/to/my_deck.py

Composition Fields

Field	Type	Default	Description
`title`	str	required	Human-readable name
`goal`	str	required	Natural language description of what to create
`output_type`	str	required	`pptx`, `xlsx`, `docx`, `html`, `code`, `markdown`
`constraints`	list[str]	`[]`	Style, audience, format rules
`context_files`	list[str]	`[]`	File paths to include as reference material
`max_movements`	int	auto	Force movement count (set = slide count for pptx)
`model_tier`	str	`"balanced"`	`fast` (Haiku), `balanced` (Sonnet), `powerful` (Opus)
`arranger_model`	str	`"balanced"`	Override model for the Arranger
`scorer_model`	str	`"balanced"`	Override model for Score generation
`critic_model`	str	`"balanced"`	Override model for Critique
`healer_model`	str	`"fast"`	Override model for CodeDoctor LLM healing
`critique_loops`	int	`1`	How many critique-fix cycles to allow
`movement_retries`	int	`2`	Retry count for failed Movements
`visual_qa`	bool	`True`	Enable visual QA (SlideDoctor/WebDoctor)
`visual_qa_loops`	int	`1`	Visual QA cycles
`assembly_retries`	int	`2`	LLM healing attempts per failed assembly

Model Tiers

Tier	Model	Best For
`fast`	Claude Haiku 4.5	Quick drafts, simple documents
`balanced`	Claude Sonnet 4.6	Most work (default)
`powerful`	Claude Opus 4.6	High-stakes presentations
`balanced-1m`	Claude Sonnet 4.6 (1M context)	Very large compositions
`powerful-1m`	Claude Opus 4.6 (1M context)	Maximum quality + context

The Arranger always gets the arranger_model (defaults to balanced). The Score Generator always gets the scorer_model. Only the Conductor (actual execution) varies by model_tier.

Templates

Maestro ships with 10 composition templates covering common business artifacts:

Template	Category	Output
`board_deck`	Business	pptx
`investor_pitch`	Business	pptx
`competitive_analysis`	Strategy	pptx
`technical_spec`	Engineering	markdown
`saas_financial_model`	Finance	xlsx
`market_entry`	Strategy	pptx
`code_review`	Engineering	markdown
`due_diligence_report`	Finance	docx
`onboarding_guide`	Operations	docx
`sales_proposal`	Sales	pptx

# Browse templates
uv run maestro templates

# Instantiate a template (interactive variable collection)
uv run maestro use board_deck --output my_deck.py

# Run the generated composition
uv run maestro run my_deck.py

Templates define TEMPLATE_META with variable placeholders. maestro use prompts you to fill in the blanks (company name, quarter, metrics) and generates a runnable Composition file.

CLI Reference

maestro run <composition.py>        # Full pipeline execution
maestro run <comp.py> --dry-run     # Preview arrangement only
maestro run <comp.py> --persist     # Isolated subprocess per Movement
maestro run <comp.py> --resume      # Resume from last checkpoint

maestro arrange <composition.py>    # Preview arrangement (alias for --dry-run)
maestro runs                        # List past runs
maestro templates                   # Browse template library
maestro use <template> [-o file]    # Instantiate a template

Configuration

Output Directory Structure

maestro/
├── scores/                        # Run artifacts (auto-created, gitignored)
│   └── <run_name>/
│       ├── arrangement.json       # The Arranger's decomposition plan
│       ├── checkpoint.json        # Resume state
│       ├── motifs.json           # Continuity summaries
│       ├── movement_01_score.md   # Score for Movement 1
│       ├── movement_01_output.txt # Raw Conductor output for Movement 1
│       └── ...
├── performances/                  # Final output files (auto-created, gitignored)
│   └── my_deck_20260219.pptx
└── ...

Environment Variables

Variable	Required	Description
`ANTHROPIC_API_KEY`	Yes	Your Anthropic API key
`GEMINI_API_KEY`	No	Google Gemini API key for visual QA

Session Logs

The session_logs/ directory contains detailed post-mortems from real production runs. These document every failure mode encountered, the root cause analysis, and the fix applied.

These logs are the honest engineering record of building this framework. They document the 7 failed runs before the first successful 10-slide deck. They show the OUTPUT_PATH injection bug that killed 15 of 30 slides. They track every regex fix, every streaming workaround, every lesson learned.

If you are adapting Maestro for your own use, read these logs. They will save you from re-discovering the same failure modes.

Project Structure

maestro/
├── maestro/
│   ├── cli.py                    # CLI + MaestroOrchestrator (full pipeline)
│   ├── compositions/
│   │   ├── composition.py        # Pydantic models (Composition, Movement, Score, etc.)
│   │   ├── template_library.py   # Template discovery and instantiation
│   │   ├── templates/            # 10 built-in composition templates
│   │   └── examples/             # Ready-to-run example compositions
│   ├── core/
│   │   ├── arranger.py           # Decomposes goals into Movements
│   │   ├── score_generator.py    # Writes detailed blueprint per Movement
│   │   ├── conductor.py          # Executes Movement through Instrument
│   │   ├── subprocess_conductor.py # Isolated subprocess execution
│   │   ├── assembly.py           # Combines outputs into final artifact
│   │   ├── code_doctor.py        # Two-layer code healing pipeline
│   │   ├── slide_doctor.py       # Visual QA for PowerPoint
│   │   ├── web_doctor.py         # Visual QA for HTML
│   │   └── critic.py             # Quality evaluation with feedback loop
│   ├── instruments/
│   │   ├── instrument.py         # Loads Claude Code skill files
│   │   └── registry.py           # Maps output types to Instruments
│   ├── utils/
│   │   ├── client.py             # Anthropic client factory (streaming auto-switch)
│   │   ├── gemini_client.py      # Gemini client factory (graceful degradation)
│   │   ├── models.py             # Model IDs, token limits, tier configuration
│   │   ├── logger.py             # Rich-based musical vocabulary logger
│   │   └── file_utils.py         # Score/performance file management
│   └── agents/
│       └── movement_worker.py    # Standalone subprocess for --persist mode
├── scores/                       # Run artifacts (gitignored)
├── performances/                 # Output files (gitignored)
├── session_logs/                 # Engineering post-mortems from real runs
├── pyproject.toml
└── .gitignore

License

MIT

Built by the Early AI Adopters community.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
maestro		maestro
session_logs		session_logs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Maestro

Table of Contents

Why This Exists

The Core Problem

How It Works

Architecture Deep Dive

The Orchestrator (cli.py)

The Arranger (core/arranger.py)

The Score Generator (core/score_generator.py)

The Conductor (core/conductor.py)

The Assembly System (core/assembly.py)

The SubprocessConductor (core/subprocess_conductor.py)

The Musical Metaphor

The Doctor System

CodeDoctor (core/code_doctor.py) - Pre-Execution

SlideDoctor (core/slide_doctor.py) - Post-Assembly Visual QA for PPTX

WebDoctor (core/web_doctor.py) - Post-Assembly Visual QA for HTML

Patterns Adapted from the Claude Code Harness

Isolated Subprocess Execution

The Director Pattern (Retry with Error Context)

Streaming for Long-Running Operations

Deterministic Pre-Processing Before LLM Calls

Checkpoint-Based Resume

Skill System (Instrument Abstraction)

Structured Output Validation Gate

What Works Really Well

1. The 1-Slide-per-Movement Strategy

2. Shape-Based Visuals Instead of Native Charts

3. Motif Continuity Threading

4. Graceful Partial Assembly

5. The Two-Layer Healing Pipeline

6. Visual QA Catches What Code Validation Cannot

7. Resume from Checkpoint

Known Flaws and Limitations

1. Sequential Execution Only

2. The Critic Is Partially Blind to Binary Outputs

3. Chart-Heavy Slides Are Still Fragile

4. Assembly Merge Loses Slide Masters

5. No Real Dependency Graph Execution

6. Visual QA Requires External Dependencies

7. The Conductor's OUTPUT_PATH Contract Is Fragile

8. Token Budget Estimation Is Absent

9. Motif Window Is Fixed at 5

10. Single Model per Role

Quick Start

Prerequisites

Install

Set up your API key

Run an example

Writing a Composition

Composition Fields

Model Tiers

Templates

CLI Reference

Configuration

Output Directory Structure

Environment Variables

Session Logs

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The Orchestrator (`cli.py`)

The Arranger (`core/arranger.py`)

The Score Generator (`core/score_generator.py`)

The Conductor (`core/conductor.py`)

The Assembly System (`core/assembly.py`)

The SubprocessConductor (`core/subprocess_conductor.py`)

CodeDoctor (`core/code_doctor.py`) - Pre-Execution

SlideDoctor (`core/slide_doctor.py`) - Post-Assembly Visual QA for PPTX

WebDoctor (`core/web_doctor.py`) - Post-Assembly Visual QA for HTML

Packages