Multi-Agent AI Coding Orchestrator
Unify Claude Code, Gemini CLI, Antigravity, and GitHub Copilot into a single autonomous build system.
Modern AI coding assistants are powerful individually β but they're even better together. Forge lets you dispatch tasks to multiple AI agents simultaneously, orchestrate collaboration between them, and run full autonomous builds with verification loops. Think of it as a CI/CD pipeline for AI-generated code, where agents plan, code, test, review, and fix until everything passes.
- π€ 5 agents, 1 CLI β Claude (Sonnet/Opus/Haiku), Gemini CLI, Antigravity (Pro/Flash), and GitHub Copilot
- π 6 orchestration modes β Single, parallel, chain, review, consensus, and swarm patterns
- ποΈ Duo pipeline β Planner + coder collaborate through plan β code β verify β review β fix cycles
- π Autonomous builds β Iterative code generation with real verification, error routing, and rollback
- π§ Persistent memory β Cross-run learning: agents remember what worked and avoid repeating failures
- π Dashboard & benchmarks β HTML dashboard with Chart.js, 5 standard benchmarks, A/B prompt testing
- π Plugin system β Hook-based architecture for custom verification, scoring, and post-processing
- π° Cost tracking β Per-agent cost, token counts, and budget caps
- Installation
- Quick Start
- Orchestration Modes
- Duo Pipeline
- Autonomous Build
- Dashboard & Analytics
- Benchmark Suite
- A/B Testing
- Plugin System
- Persistent Memory
- Project Templates
- Configuration
- Architecture
- Test Suite
- Contributing
- License
Requires Python 3.11 or later.
git clone https://github.com/Artaeon/forge-ai.git
cd forge-ai
python3 -m venv .venv
source .venv/bin/activate
pip install -e .At least one AI agent CLI or API key must be configured:
| Agent | Setup | Docs |
|---|---|---|
| Claude Code | npm install -g @anthropic-ai/claude-code |
claude.ai/code |
| Antigravity | pip install google-genai + set GOOGLE_API_KEY |
ai.google.dev |
| Gemini CLI | npm install -g @google/gemini-cli |
github.com/google/gemini-cli |
| GitHub Copilot | gh extension install github/gh-copilot |
docs.github.com/copilot |
Verify your setup:
forge config
forge agentsforge run "Write a Python fibonacci function"
forge run -a claude-opus "Design a database schema for a blog"
forge run -a antigravity-pro "Implement a binary search tree"
forge run -a antigravity-flash "Explain what asyncio does in three sentences"forge run --all "Implement a binary search" --bestforge duo "Build a Flask REST API with auth and tests" --new my-apiForge supports six orchestration patterns that control how agents interact:
| Mode | Description | How It Works |
|---|---|---|
single |
One agent, one shot | Direct dispatch to a single agent |
parallel |
All agents independently | Dispatch to all, auto-select best result |
chain |
Sequential pipeline | Each agent refines the previous output (A β B β C) |
review |
Produce, critique, refine | Three-round cycle with different agent roles |
consensus |
All produce, judge synthesizes | All agents produce, a judge combines the best parts |
swarm |
Decompose into subtasks | A planner splits the task, assigns subtasks to best-fit agents |
# Chain: fast agent drafts, strong agent polishes
forge run --mode chain -a claude-haiku -a claude-sonnet "Implement a linked list"
# Review: produce β critique β refine
forge run --mode review -a claude-sonnet -a claude-opus "Write a secure auth module"
# Consensus: all agents produce, judge picks best parts
forge run --all --mode consensus "Write a caching layer"
# Swarm: planner assigns subtasks to best agents
forge run --all --mode swarm "Build a full CRUD application"The duo pipeline is Forge's flagship feature β a collaborative build loop where a planner and coder agent work together through structured phases:
βββββββββββ βββββββββββ βββββββββββ βββββββββββ βββββββββββ
β PLAN β β β CODE β β β VERIFY β β β REVIEW β β β FIX β
β (gemini)β β(claude) β β (auto) β β(gemini) β β(claude) β
βββββββββββ βββββββββββ βββββββββββ βββββββββββ ββββββ¬βββββ
β
ββββββββββββββββββββββββββββββββββ
βΌ (repeat until approved or max rounds)
- SCAFFOLD β Auto-detect project type, create skeleton files and git repo
- PLAN β Planner creates README, architecture overview, and file manifest
- CODE β Coder implements all files from the plan with full context
- VERIFY β Run build + lint + tests, capture real error output and stack traces
- REVIEW β Planner reviews code against verification results
- FIX β Coder fixes issues using real stack traces (not hallucinated errors)
- Repeat 4β6 until the reviewer approves or max rounds are reached
- FINAL β Auto-commit, quality scoring, and persistent memory extraction
# Default: Gemini plans, Claude codes
forge duo "Build a todo app with user auth"
# Custom agent assignment
forge duo "Build a REST API" --planner gemini --coder claude-sonnet
# Interactive mode: pause after each phase for user review
forge duo "Create a CLI tool" --interactive
# Create a new project from scratch
forge duo "Build a Flask API" --new my-api
# Resume an interrupted build
forge duo "Build a todo app" --resume
# Control iteration depth
forge duo "Build a game" --rounds 5 --timeout 120- Context compaction β Only the most relevant file chunks are sent to agents, minimizing token waste
- Error classification β Failures are categorized (syntax, dependency, logic, architecture) for optimal retry strategy
- Auto-escalation β After 3 consecutive failures, Forge escalates to a stronger model
- Rollback protection β If a fix makes things worse, automatically rolls back to the last good state
- Auto dependency install β Detects
ModuleNotFoundError/Cannot find moduleand installs missing packages
The single-agent build pipeline operates in agentic mode β the AI creates and modifies files directly on disk, iterating until all verification commands pass.
# Create a new project from scratch
forge build --new my-api "Create a FastAPI REST API with authentication"
# Build in the current directory with custom test command
forge build "Add unit tests for all modules" --test-cmd "python -m pytest"
# Auto-commit successful iterations
forge build "Refactor to TypeScript" --auto-commit
# Initialize from a template, then build
forge init flask-api --dir ./my-app
forge build --dir ./my-app "Add user registration with email verification"Each iteration follows this sequence:
- Context gathering β Scans workspace for file tree, git status, framework detection
- Agent dispatch β Sends objective with full workspace context in agentic mode (file write access)
- File tracking β Detects created and modified files
- Dependency installation β Auto-installs from
requirements.txt,package.json, etc. - Verification β Runs test and lint commands (auto-detected if not specified)
- Error classification β Categorizes failures and routes to optimal retry strategy
- Rollback protection β Reverts regressions to last known good state
- Retry or escalate β Feeds errors back; escalates to stronger model after repeated failures
Forge generates a self-contained HTML dashboard with Chart.js visualizations showing pipeline performance over time.
# Generate and open the dashboard
forge dashboard --openThe dashboard includes:
- Stats cards β Total runs, average quality score, approval rate, total cost
- Quality trend chart β Score progression with cost overlay
- Run history table β Every run with objective, agents, grade, score, duration, and cost
Forge includes 5 standard benchmark objectives for reproducible quality measurement:
| Benchmark | Description |
|---|---|
| CLI Todo App | Python CLI with Click, JSON storage, colored output |
| REST API | Flask API with CRUD, SQLite, validation |
| Python Library | Text metrics library with clean API and tests |
| Terminal Game | Number guessing game with difficulty levels and persistence |
| MCP Server | Model Context Protocol server with filesystem tools |
# List available benchmarks
forge benchmark --list
# Run the full benchmark suite
forge benchmark --run standard
# View historical benchmark results
forge benchmark --historyCompare different prompt strategies to find which produces higher quality code:
| Variant | Strategy |
|---|---|
default |
Standard duo pipeline prompts (baseline) |
strict |
Emphasize strict typing and error handling |
tdd |
Test-driven development β write tests first |
minimal |
Lean code with no unnecessary abstractions |
production |
Production-grade with logging, configs, and docs |
# Compare two prompt variants
forge duo "Build a CLI tool" --variant-a default --variant-b tddA/B test results include per-variant quality scores, timing, cost, and a declared winner.
Forge's plugin system lets you extend the pipeline with custom hooks for verification, scoring, and post-processing.
Create a .forge/plugins/my_plugin.py file in your project:
from forge.build.plugins import ForgePlugin
class MyPlugin(ForgePlugin):
@property
def name(self) -> str:
return "my-plugin"
def extra_verify_commands(self, working_dir: str) -> list[str]:
return ["python -m mypy . --strict"]
def extra_scoring_rules(self, working_dir: str) -> list[tuple[str, int]]:
# Return (message, score_adjustment) pairs
return [("β
MyPy strict mode passed", +5)]
def on_pipeline_start(self, objective: str, working_dir: str) -> None:
print(f"π Starting build for: {objective}")
plugin = MyPlugin()- SecurityCheckPlugin β Scans for hardcoded secrets and insecure patterns in source files
| Hook | When | Use Case |
|---|---|---|
on_pipeline_start |
Before pipeline runs | Setup, notifications |
on_plan / on_code / on_verify / on_review |
After each phase | Modify phase output |
extra_verify_commands |
During VERIFY | Add custom checks |
extra_scoring_rules |
During scoring | Custom quality rules |
custom_template_files |
During scaffolding | Add boilerplate files |
on_pipeline_end |
After pipeline completes | Cleanup, reporting |
Forge learns from every run. Patterns, failures, and strategies are persisted in .forge-memory.json and automatically injected into future agent prompts.
LEARNINGS FROM PREVIOUS RUNS:
β
[success] Created app.py, requirements.txt, tests/ successfully
β [failure] Avoid: dependency β ModuleNotFoundError flask
π‘ [strategy] Flask apps need requirements.txt with flask>=3.0
The memory system:
- Deduplicates β Repeated patterns boost confidence instead of creating duplicates
- Relevance scoring β Only learnings relevant to the current objective are included
- Auto-extraction β Learnings are automatically extracted from successful and failed runs
- Cross-session β Persists across pipeline invocations
Quickly scaffold new projects with built-in templates:
# List available templates
forge init --list
# Create from template
forge init flask-api --dir ./my-app
forge init fastapi --dir ./my-app
forge init cli-tool --dir ./my-app| Template | Includes |
|---|---|
flask-api |
Flask REST API with config, routes, and pytest |
fastapi |
FastAPI with async routes, Pydantic models, and tests |
cli-tool |
Python CLI with Click, commands, and help text |
nextjs |
Next.js app (manual setup) |
Forge reads configuration from forge.yaml, searching upward from the working directory. Fallback: ~/.config/forge/forge.yaml.
global:
timeout: 120
max_parallel: 5
auto_commit: false
max_build_iterations: 10
agents:
claude-sonnet:
enabled: true
agent_type: claude
command: claude
model: sonnet
max_budget_usd: 1.0
claude-opus:
enabled: true
agent_type: claude
command: claude
model: opus
max_budget_usd: 5.0
claude-haiku:
enabled: true
agent_type: claude
command: claude
model: haiku
max_budget_usd: 0.25
antigravity-pro:
enabled: true
agent_type: antigravity
model: gemini-2.5-pro
antigravity-flash:
enabled: true
agent_type: antigravity
model: gemini-2.5-flash
gemini:
enabled: true
agent_type: gemini
command: gemini
copilot:
enabled: true
agent_type: copilot
command: gh
workspace:
default_dir: "."
create_git: true
projects_root: "~/Projects"
build:
test_commands:
- "python -m pytest"
lint_commands:
- "python -m ruff check ."| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Whether the agent is active |
agent_type |
string | β | Backend: claude, gemini, antigravity, or copilot |
command |
string | β | CLI binary name |
model |
string | β | Model variant (e.g., sonnet, opus, gemini-2.5-pro) |
max_budget_usd |
float | β | Per-request cost cap in USD |
skip_permissions |
bool | false |
Skip file write permission prompts |
extra_args |
list | [] |
Additional CLI arguments |
forge/
cli.py CLI entry point (Click)
config.py Configuration loading and validation (Pydantic)
engine.py Agent lifecycle and parallel dispatch (asyncio)
aggregator.py Result scoring and comparison
orchestrate.py Inter-agent communication patterns (6 modes)
agents/
base.py Agent protocol and data models
claude.py Claude Code adapter (print + agentic modes)
antigravity.py Antigravity adapter (Google GenAI SDK, streaming)
gemini.py Gemini CLI adapter (native file writing)
copilot.py GitHub Copilot adapter
build/
duo.py Duo pipeline orchestrator
pipeline.py Single-agent autonomous build loop
context.py Workspace-aware context gathering
compact.py Smart file chunking and context windowing
memory.py Session + persistent cross-run memory
testing.py Smart test generation and detection
errors.py Error classification and routing
templates.py Project templates for scaffolding
scoring.py Quality scoring (structure/code/tests/docs β 100pt scale)
validate.py Project validation gate
depfix.py Auto-resolve missing dependencies
resume.py Save/restore pipeline state for resumability
benchmark.py Standard benchmark objectives and runner
ab_test.py A/B testing for prompts and agent combos
dashboard.py HTML dashboard with Chart.js
plugins.py Plugin registry with hook-based extension
phases/
dispatch.py Agent dispatch with spinner, retry, and timeout
plan.py PLAN phase logic
code.py CODE phase logic
verify.py VERIFY phase logic (build + lint + tests)
review.py REVIEW + FIX phase logic
tui/
panels.py Terminal UI components (Rich)
- Modular phases β Each pipeline phase is a standalone module, independently testable
- Specific error handling β Zero
except Exceptionβ every handler catches specific exception types - Context efficiency β Smart file chunking and context windowing minimize token waste
- Graceful degradation β Missing agents are detected at startup; the system works with whatever's available
- Persistent learning β Cross-run memory ensures agents improve over time
Forge includes a comprehensive test suite with 148 tests across 7 test files:
# Run all tests
python -m pytest tests/ -q
# Run with coverage
python -m pytest tests/ --cov=forge --cov-report=term-missing| Test File | Tests | Coverage Area |
|---|---|---|
test_agents.py |
108 | Agent adapters, dispatch, file extraction, cost estimation |
test_build.py |
40 | Scoring, context, compaction, memory, errors, depfix |
GitHub Actions CI runs on every push:
- Linting β Ruff checks across the codebase
- Multi-version testing β Python 3.11, 3.12, 3.13
- Smoke tests β CLI commands verified functional
Forge tracks per-agent costs as reported by the underlying CLIs. Claude Code costs are deducted from your Anthropic subscription (Pro or Max plan). The max_budget_usd configuration acts as a safety cap per individual request.
| Agent | Pricing Model |
|---|---|
| Claude Code | Anthropic subscription (Pro/Max plan) |
| Antigravity | Google AI API (pay per token) |
| Gemini CLI | Google account (free tier available) |
| Copilot | GitHub Copilot subscription |
Contributions are welcome! Please open an issue to discuss proposed changes before submitting a pull request.
# Setup development environment
git clone https://github.com/Artaeon/forge-ai.git
cd forge-ai
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# Run tests
python -m pytest tests/ -q
# Lint
python -m ruff check forge/- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit with descriptive messages
- Open a pull request against
main
This project is licensed under the MIT License. See LICENSE for details.
Built by Artaeon
Making AI agents work together, so you don't have to.




