Adversarial Dev

A GAN-inspired three-agent harness that separates planning, building, and evaluation into distinct AI agents with distinct contexts. The evaluator's job is to break what the generator builds -- creating adversarial tension that drives quality far beyond what a single agent can achieve. Built with both the Claude Agent SDK and Codex SDK so you can run the same architecture on either platform.

Based on Anthropic's engineering article: Harness Design for Long-Running Application Development.

What This Demonstrates

Most AI coding agents fail on complex tasks not because the model is bad, but because nobody separated the work into specialized roles. A single agent that plans, builds, and evaluates its own work will reliably praise its own mediocre output. This is called self-evaluation bias, and it's the quiet killer of ambitious AI coding projects.

This project implements the fix: three agents, each with a focused job and its own context window.

Agent	Role	Analogy
Planner	Expands a short prompt into a full product spec with sprints	Product manager
Generator	Builds one feature at a time, commits to git	Software engineer
Evaluator	Actively tries to break what the generator built, scores ruthlessly	Adversarial QA

The evaluator doesn't just review code -- it's an adversary. It runs the application, probes for failures, tests edge cases the generator didn't think of, and scores each criterion on a 1-10 scale with a hard pass threshold. If any criterion fails, the sprint goes back to the generator with detailed, unforgiving feedback. The generator has to fight its way past the evaluator to advance. This adversarial pressure is what turns AI-generated code from "looks right" into "actually works."

Quick Start

Prerequisites

Bun runtime installed
Claude CLI authenticated (claude auth login)
Codex CLI authenticated (codex auth login)

Install

git clone https://github.com/coleam00/adversarial-dev.git
cd adversarial-dev
bun install

Run the Claude Harness

bun run claude-harness/index.ts "Build a personal task manager with a REST API, interactive dashboard with charts, task categories, priority levels, due dates, and search functionality"

Or pass a detailed prompt from a file:

bun run claude-harness/index.ts --file prompt.md

Run the Codex Harness

bun run codex-harness/index.ts "Build a personal task manager with a REST API, interactive dashboard with charts, task categories, priority levels, due dates, and search functionality"

Both harnesses write their output to workspace/claude/ and workspace/codex/ respectively. The built application lives in workspace/{sdk}/app/.

Configuration

Defaults are in shared/config.ts:

Setting	Default	Description
`maxSprints`	10	Maximum number of sprints
`maxRetriesPerSprint`	3	Max evaluation retries before failing a sprint
`passThreshold`	7	Minimum score (out of 10) for each criterion
`CLAUDE_MODEL`	`claude-sonnet-4-6`	Model for Claude harness
`CODEX_MODEL`	`gpt-5.4`	Model for Codex harness

How It Works

When you run a harness, here's what happens step by step:

1. Planning Phase

The planner takes your short prompt and generates a comprehensive product specification with features organized into sprints, a design language, and tech stack decisions. This spec is written to spec.md.

2. Contract Negotiation (per sprint)

The generator proposes what it will build and how success should be measured. The evaluator reviews the criteria, making them more specific, adding edge cases, and raising the bar. They iterate until locked in. The contract is saved as JSON.

3. Build Phase (per sprint)

The generator reads the spec and contract, then implements features one at a time with git commits after each. It has full access to create files, run commands, install dependencies, and test code.

4. Evaluation Phase (per sprint)

The evaluator reads the contract criteria, examines the code, runs the application, and tries to break it. It scores each criterion on a 1-10 scale. If all criteria pass (score >= 7/10), the sprint survives. If any fail, detailed feedback goes back to the generator -- with file paths, line numbers, and exact failure descriptions.

5. Retry Loop

The generator reads the adversarial feedback, decides whether to refine or pivot, and rebuilds. This cycles up to 3 times per sprint. If a sprint can't survive the evaluator after all retries, the harness stops.

6. Completion

Once all sprints pass, you have a working application built incrementally with quality gates at every step -- every feature tested by an agent whose job was to break it.

The Architecture

User Prompt (1-4 sentences)
         |
         v
   +-----------+
   |  PLANNER  |  --> writes spec.md (features, sprints, design language)
   +-----------+
         |
         v  (for each sprint)
   +---------------------+
   | CONTRACT NEGOTIATION |  Generator proposes criteria,
   | Generator <-> Eval   |  Evaluator tightens the screws,
   +---------------------+  both lock in "done"
         |
         v
   +-----------+     fail + feedback     +------------+
   | GENERATOR | <---------------------- | EVALUATOR  |
   | (build)   | ----------------------> | (attack)   |
   +-----------+     implementation      +------------+
         |                                      |
         v              pass                    |
    Next Sprint <-------------------------------+

Sprint Contracts

Before any code is written, the generator and evaluator negotiate a sprint contract: a JSON document defining exactly what "done" means. Each criterion is specific and testable -- not "works well" but "PUT /frames/reorder returns 200 and reorders frames in the database."

The evaluator uses contract negotiation to set traps -- adding edge cases, tightening thresholds, and demanding specifics that force the generator to build robust code from the start. This is directly from Anthropic's approach. They found that JSON contracts work better than markdown because models are less likely to tamper with structured JSON.

File-Based Communication

Agents communicate through files, not shared conversation history. This keeps each agent's context focused on its role:

spec.md -- Product specification from the planner
contracts/sprint-{n}.json -- Sprint contracts
feedback/sprint-{n}-round-{m}.json -- Evaluator feedback per attempt
progress.json -- Harness state tracking

The GAN Connection

This architecture is inspired by Generative Adversarial Networks (GANs), where a generator creates outputs and a discriminator tries to reject them, iterating until quality emerges from the tension between the two.

GANs	This Harness
Generator vs. discriminator	Generator vs. evaluator
Gradient descent	Hard pass/fail thresholds
Two networks	Three agents (adds planner)
Continuous training	Sprint-based iteration
Zero-sum game	Asymmetric adversarial -- evaluator tries to break, generator tries to survive

The core insight is the same: separate generation from evaluation, then pit them against each other. A generator that evaluates its own work converges on mediocrity. A separate evaluator with the explicit mandate to find failures creates the adversarial pressure that forces quality upward. The generator doesn't just build -- it builds knowing an adversary is waiting.

Why This Is the Future of AI Coding

We're at an inflection point. In 2025, the focus was on making individual agents smarter. In 2026, the focus has shifted to harness design -- the scaffolding around agents that makes them reliable.

Here's the key principle from Anthropic's article:

"Every component in a harness encodes an assumption about what the model can't do on its own."

As models improve, harnesses simplify. When Opus 4.5 shipped, Anthropic removed context resets from their harness because the model could maintain coherence natively. When Opus 4.6 shipped with 1M tokens, they removed sprint decomposition entirely because the model could sustain coherent work across two-hour builds.

But the frontier doesn't shrink -- it moves. Better models make previous scaffolding unnecessary while opening new possibilities for harnesses that achieve more complex tasks. The pattern of separating planning, building, and evaluation is durable even as the implementation details evolve.

Two principles that matter most:

Separate evaluation from generation. Don't let the agent grade its own homework.
Define "done" before you start. Sprint contracts are how you turn vibing into engineering.

Project Structure

adversarial-dev/
├── shared/              # Shared types, config, prompts, utilities
│   ├── types.ts         # TypeScript interfaces
│   ├── config.ts        # Model and threshold defaults
│   ├── prompts.ts       # Agent system prompts (identical for both SDKs)
│   ├── logger.ts        # Colored console output
│   └── files.ts         # File I/O for specs, contracts, feedback
├── claude-harness/      # Claude Agent SDK implementation
│   ├── index.ts         # CLI entry point
│   ├── harness.ts       # Orchestration loop
│   ├── planner.ts       # Planner agent
│   ├── generator.ts     # Generator agent
│   └── evaluator.ts     # Evaluator agent
├── codex-harness/       # Codex SDK implementation
│   ├── index.ts         # CLI entry point
│   ├── harness.ts       # Orchestration loop
│   ├── planner.ts       # Planner agent
│   ├── generator.ts     # Generator agent
│   └── evaluator.ts     # Evaluator agent
└── workspace/           # Runtime output (gitignored)
    ├── claude/          # Claude harness working directory
    └── codex/           # Codex harness working directory

Both harnesses share the same prompts, types, and orchestration flow. The only differences are the SDK-specific agent implementations -- query() async generators for Claude, Codex threads for Codex.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
claude-harness		claude-harness
codex-harness		codex-harness
shared		shared
.gitignore		.gitignore
AdversarialDevDiagram.png		AdversarialDevDiagram.png
README.md		README.md
bun.lock		bun.lock
package.json		package.json
prompt.md		prompt.md
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Adversarial Dev

What This Demonstrates

Quick Start

Prerequisites

Install

Run the Claude Harness

Run the Codex Harness

Configuration

How It Works

1. Planning Phase

2. Contract Negotiation (per sprint)

3. Build Phase (per sprint)

4. Evaluation Phase (per sprint)

5. Retry Loop

6. Completion

The Architecture

Sprint Contracts

File-Based Communication

The GAN Connection

Why This Is the Future of AI Coding

Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Adversarial Dev

What This Demonstrates

Quick Start

Prerequisites

Install

Run the Claude Harness

Run the Codex Harness

Configuration

How It Works

1. Planning Phase

2. Contract Negotiation (per sprint)

3. Build Phase (per sprint)

4. Evaluation Phase (per sprint)

5. Retry Loop

6. Completion

The Architecture

Sprint Contracts

File-Based Communication

The GAN Connection

Why This Is the Future of AI Coding

Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages