Smithbox-ai · Smithbox-ai · Apr 15, 2026 · Apr 15, 2026 · Apr 15, 2026 · Copilot
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -0,0 +1,31 @@
+name: CI
+
+on:
+  push:
+    branches: ["master"]
+  pull_request:
+    branches: ["master"]
-    branches: ["master"]
-  pull_request:
-    branches: ["master"]
+    branches: ["main", "master"]
+  pull_request:
+    branches: ["main", "master"]
-    branches: ["master"]
-  pull_request:
-    branches: ["master"]
+    branches: ["main", "master"]
+  pull_request:
+    branches: ["main", "master"]
+
+permissions:
+  contents: read
+
+jobs:
+  eval:
+    name: Eval Suite
+    runs-on: ubuntu-latest
+
+    steps:
+      - uses: actions/checkout@v4
+
+      - name: Set up Node.js
+        uses: actions/setup-node@v4
+        with:
+          node-version: "20"
+          cache: "npm"
+          cache-dependency-path: evals/package-lock.json
+
+      - name: Install eval dependencies
+        run: cd evals && npm ci
+
+      - name: Run evals
+        run: cd evals && npm test
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -0,0 +1,68 @@
+# Changelog
+
+All notable changes to ControlFlow are documented here.
+
+The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
+
+---
+
+## [1.0.0] — 2026-04-15
+
+### Added
+
+**Agent system (13 agents)**
+
+- `Orchestrator` — conductor, gate controller, wave-based parallel dispatch, failure routing
+- `Planner` — structured planning with idea interview, phased plans, Mermaid diagrams, semantic risk discovery across 7 non-functional risk categories
+- `PlanAuditor` — adversarial plan audit, architecture and risk review
+- `AssumptionVerifier` — assumption-fact confusion detection, mirage elimination
+- `ExecutabilityVerifier` — cold-start plan executability simulation
+- `CoreImplementer` — backend implementation with TDD enforcement
+- `UIImplementer` — frontend implementation
+- `PlatformEngineer` — CI/CD, containers, infrastructure, rollback contracts
+- `CodeReviewer` — code review, safety gates, verdict contracts
+- `Researcher` — evidence-first research with confidence scores and citations
+- `CodeMapper` — read-only codebase discovery
+- `TechnicalWriter` — documentation, diagrams, code-doc parity enforcement
+- `BrowserTester` — E2E browser testing with health-first verification and accessibility audits
+
+**Architecture**
+
+- P.A.R.T contract architecture (Prompt → Archive → Resources → Tools) enforced across all agents
+- Structured text outputs replacing raw JSON to conserve context tokens in delegation chains
+- Wave-based parallel execution — Orchestrator dispatches independent phases in parallel
+- Adversarial review pipeline — up to three independent reviewers before implementation (depth scales with complexity tier: TRIVIAL / SMALL / MEDIUM / LARGE)
+- Failure taxonomy (`transient` / `fixable` / `needs_replan` / `escalate`) with deterministic retry and escalation routing
+- Least-privilege tool grants — each agent's `tools:` frontmatter trimmed to minimum required by role
+- Semantic risk discovery — 7 non-functional risk categories evaluated before research delegation
+- Batch approval per execution wave, per-phase approval for destructive operations
+- `NEEDS_INPUT` clarification routing from subagents through Orchestrator to user via `askQuestions`
+
+**Governance and contracts**
+
+- JSON Schema contracts for all agent outputs in `schemas/`
+- Governance policies in `docs/agent-engineering/`: PART-SPEC, RELIABILITY-GATES, CLARIFICATION-POLICY, TOOL-ROUTING, SCORING-SPEC, MIGRATION-CORE-FIRST, PROMPT-BEHAVIOR-CONTRACT
+- Canonical tool grants in `governance/agent-grants.json`
+- Agent roster and complexity tier definitions in `plans/project-context.md`
+
+**Skill library**
+
+- 7 domain-specific skill patterns: Testing, Error Handling, Security, Performance, Completeness, Integration, Idea-to-Prompt
+- Skill index at `skills/index.md`
+
+**Eval suite (302 checks)**
+
+- Pass 1: Schema validity (Ajv strict mode, JSON Schema 2020-12)
+- Pass 2–3: Scenario integrity and cross-scenario structural regression (179 structural checks)
+- Pass 4: P.A.R.T section order enforcement
+- Pass 4b: Clarification trigger and tool routing section validation
+- Pass 5: Skill library registration integrity
+- Pass 6: Synthetic rename negative-path checks
+- Pass 7: Prompt behavior contract behavioral regression (74 checks across 9 agents)
+- Pass 8: Orchestration handoff contract regression (49 checks)
-**Eval suite (302 checks)**
-
- Pass 1: Schema validity (Ajv strict mode, JSON Schema 2020-12)
- Pass 2–3: Scenario integrity and cross-scenario structural regression (179 structural checks)
- Pass 4: P.A.R.T section order enforcement
- Pass 4b: Clarification trigger and tool routing section validation
- Pass 5: Skill library registration integrity
- Pass 6: Synthetic rename negative-path checks
- Pass 7: Prompt behavior contract behavioral regression (74 checks across 9 agents)
- Pass 8: Orchestration handoff contract regression (49 checks)
+**Eval suite**
+
+- Pass 1: Schema validity (Ajv strict mode, JSON Schema 2020-12)
+- Pass 2–3: Scenario integrity and cross-scenario structural regression
+- Pass 4: P.A.R.T section order enforcement
+- Pass 4b: Clarification trigger and tool routing section validation
+- Pass 5: Skill library registration integrity
+- Pass 6: Synthetic rename negative-path checks
+- Pass 7: Prompt behavior contract behavioral regression across agent prompts
+- Pass 8: Orchestration handoff contract regression
-**Eval suite (302 checks)**
-
- Pass 1: Schema validity (Ajv strict mode, JSON Schema 2020-12)
- Pass 2–3: Scenario integrity and cross-scenario structural regression (179 structural checks)
- Pass 4: P.A.R.T section order enforcement
- Pass 4b: Clarification trigger and tool routing section validation
- Pass 5: Skill library registration integrity
- Pass 6: Synthetic rename negative-path checks
- Pass 7: Prompt behavior contract behavioral regression (74 checks across 9 agents)
- Pass 8: Orchestration handoff contract regression (49 checks)
+**Eval suite**
+
+- Pass 1: Schema validity (Ajv strict mode, JSON Schema 2020-12)
+- Pass 2–3: Scenario integrity and cross-scenario structural regression
+- Pass 4: P.A.R.T section order enforcement
+- Pass 4b: Clarification trigger and tool routing section validation
+- Pass 5: Skill library registration integrity
+- Pass 6: Synthetic rename negative-path checks
+- Pass 7: Prompt behavior contract behavioral regression across agent prompts
+- Pass 8: Orchestration handoff contract regression
+- F7/F8: Complexity tier and reference integrity enforcement
+- Warm cache for fast repeated structural runs
+
+**CI**
+
+- GitHub Actions workflow running the full eval suite on every push and pull request to `master`
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,105 @@
+# Contributing to ControlFlow
+
+Thank you for your interest in contributing! This guide covers the key contribution paths.
+
+## Table of Contents
+
+- [Running the eval suite](#running-the-eval-suite)
+- [Adding a new agent](#adding-a-new-agent)
+- [Editing an existing agent](#editing-an-existing-agent)
+- [Adding skills](#adding-skills)
+- [Proposing changes](#proposing-changes)
+- [Code of conduct](#code-of-conduct)
+
+---
+
+## Running the eval suite
+
+The eval suite validates schema compliance, P.A.R.T contract structure, tool grant consistency, behavioral invariants, and orchestration handoff discipline across all 13 agents — without invoking live agents.
+
+```bash
+cd evals
+npm install
-npm install
+npm ci
-npm install
+npm ci
+npm test
+```
+
+All 302 checks must pass before any PR can be merged. The suite runs fully offline.
-All 302 checks must pass before any PR can be merged. The suite runs fully offline.
+All eval checks must pass before any PR can be merged. The suite runs fully offline.
-All 302 checks must pass before any PR can be merged. The suite runs fully offline.
+All eval checks must pass before any PR can be merged. The suite runs fully offline.
+
+For a faster structural-only pass:
+
+```bash
+npm run test:structural
+```
+
+For behavioral and orchestration regressions only:
+
+```bash
+npm run test:behavior
+```
+
+---
+
+## Adding a new agent
+
+1. **Create the agent file** at repo root: `<Name>.agent.md` or `<Name>-subagent.agent.md`.
+
+2. **Follow P.A.R.T structure** — every agent file must have exactly these top-level sections in order:
+   - `## Prompt` — mission, scope, deterministic output contracts, Non-Negotiable Rules
+   - `## Archive` — memory policies, context compaction rules
+   - `## Resources` — file references loaded on-demand
+   - `## Tools` — allowed/disallowed tools with routing rules
+
+   See `docs/agent-engineering/PART-SPEC.md` for the full specification.
+
+3. **Create a JSON Schema contract** in `schemas/<name>-output.schema.json`. Schema files serve as documentation contracts and eval references.
+
+4. **Add eval scenarios** in `evals/scenarios/` that cover:
+   - At least one happy-path execution
+   - `ABSTAIN` / `NEEDS_INPUT` / failure classification behavior
+   - Tool routing compliance if the agent uses external tools
+
+5. **Register the agent in governance files**:
+   - Add it to `governance/agent-grants.json` with its canonical tool grants.
+   - Add it to `plans/project-context.md` (agent roster table).
+
+6. **Update `README.md`**:
+   - Add a row to the appropriate agent table (Primary Agents or Specialized Subagents).
+   - Update the agent count badge if you bump past 13.
+
+7. **Run the full eval suite** and fix any failures before opening a PR.
+
+---
+
+## Editing an existing agent
+
+1. Read the current agent file carefully. Understand the Non-Negotiable Rules, clarification contract, and tool routing section before making changes.
+2. Run `cd evals && npm test` **before and after** your edit to confirm no regressions.
+3. If you change output contracts (status values, required fields), update the corresponding schema in `schemas/` and any eval scenarios that assert those fields.
+4. If you change tool grants in frontmatter, update `governance/agent-grants.json` to match — the eval suite enforces consistency between the two.
+
+---
+
+## Adding skills
+
+Skills are reusable domain pattern snippets that Planner selects per phase and implementation agents load at execution time. They live in `skills/patterns/*.md`.
+
+1. Create `skills/patterns/<topic>.md` following the style of existing patterns.
+2. Register the new file in `skills/index.md`.
+3. Run `npm test` — Pass 5 validates that every `skills/patterns/` file is registered in the index and every index entry resolves to a real file.
+
+---
+
+## Proposing changes
+
+- **Bug reports and feature requests:** Open a GitHub Issue describing the problem or proposal clearly.
+- **Pull requests:** Fork the repository, create a feature branch, and open a PR against `master`.
+  - Every PR must pass `cd evals && npm test`.
+  - Describe what you changed and why in the PR description.
+  - Reference any related Issues.
+- **Breaking changes:** Changes to shared governance files (`governance/`, `schemas/`, `.github/copilot-instructions.md`) affect all agents — test thoroughly and call this out explicitly in the PR description.
+
+---
+
+## Code of conduct
+
+Be respectful and constructive. This project follows the [Contributor Covenant](https://www.contributor-covenant.org/) v2.1.
diff --git a/README.md b/README.md
@@ -1,7 +1,29 @@
 # ControlFlow
 
+[![CI](https://github.com/Smithbox-ai/ControlFlow/actions/workflows/ci.yml/badge.svg)](https://github.com/Smithbox-ai/ControlFlow/actions/workflows/ci.yml)
+![Agents](https://img.shields.io/badge/agents-13-blue)
+![Eval Checks](https://img.shields.io/badge/eval%20checks-302-brightgreen)
-![Eval Checks](https://img.shields.io/badge/eval%20checks-302-brightgreen)
+![Eval Checks](https://img.shields.io/badge/eval%20checks-passing-brightgreen)
-![Eval Checks](https://img.shields.io/badge/eval%20checks-302-brightgreen)
+![Eval Checks](https://img.shields.io/badge/eval%20checks-passing-brightgreen)
+![License](https://img.shields.io/badge/license-MIT-green)
+
 A multi-agent orchestration system for VS Code Copilot. ControlFlow replaces single-agent workflows with a coordinated team of 13 specialized agents governed by deterministic **P.A.R.T contracts** (Prompt → Archive → Resources → Tools), structured text outputs, and reliability gates.
 
+## How It Works
+
+**Turn any vague idea into working code in three steps:**
+
+```
+1. @Planner  "Add OAuth login with Google"
+   → Idea interview → phased plan → Mermaid architecture diagram
+
+2. Approve the plan
+
+3. @Orchestrator  (runs automatically)
+   → PlanAuditor reviews → CoreImplementer + TechnicalWriter execute in parallel
+   → CodeReviewer gates each phase → done
+```
+
+Each agent operates within strict P.A.R.T contracts — deterministic status outputs, least-privilege tool grants, and explicit failure classification — so you get predictable, auditable results instead of unpredictable single-agent sprawl.
+
 ## Key Features
 
 - **Context-Efficient Output** — agents return structured text summaries instead of raw JSON, conserving context tokens across delegation chains.