Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 26 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,9 +51,22 @@ bun run claude-harness/index.ts --file prompt.md
```bash
bun run codex-harness/index.ts "Build a personal task manager with a REST API, interactive dashboard with charts, task categories, priority levels, due dates, and search functionality"
```

Both harnesses write their output to `workspace/claude/` and `workspace/codex/` respectively. The built application lives in `workspace/{sdk}/app/`.

### Run the Mixed Harness (Claude generates, GPT-5.4 evaluates)

```bash
bun run mixed-harness/index.ts "Build a REST API with authentication"
```

### Run the Gemini Harness (Claude generates, Gemini 3.1 Pro evaluates)

```bash
GEMINI_API_KEY=your-key bun run gemini-harness/index.ts --file prompt.md
```

Mixed and Gemini harnesses write to `workspace/mixed/` and `workspace/gemini/` respectively. Set `HARNESS_LOG_DIR` to customize where conversation logs are saved (defaults to `./logs`).

## Configuration

Defaults are in `shared/config.ts`:
Expand All @@ -63,7 +76,7 @@ Defaults are in `shared/config.ts`:
| `maxSprints` | 10 | Maximum number of sprints |
| `maxRetriesPerSprint` | 3 | Max evaluation retries before failing a sprint |
| `passThreshold` | 7 | Minimum score (out of 10) for each criterion |
| `CLAUDE_MODEL` | `claude-sonnet-4-6` | Model for Claude harness |
| `CLAUDE_MODEL` | `claude-opus-4-6` | Model for Claude harness |
| `CODEX_MODEL` | `gpt-5.4` | Model for Codex harness |

## How It Works
Expand Down Expand Up @@ -180,9 +193,18 @@ adversarial-dev/
│ ├── planner.ts # Planner agent
│ ├── generator.ts # Generator agent
│ └── evaluator.ts # Evaluator agent
├── mixed-harness/ # Claude generator + Codex GPT-5.4 evaluator
│ ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts
├── gemini-harness/ # Claude generator + Gemini 3.1 Pro evaluator (sandboxed tools)
│ ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts
├── tests/ # Test suites (29 tests)
│ ├── mixed-harness.test.ts
│ └── conversation-logger.test.ts
└── workspace/ # Runtime output (gitignored)
├── claude/ # Claude harness working directory
└── codex/ # Codex harness working directory
├── claude/
├── codex/
├── mixed/
└── gemini/
```

Both harnesses share the same prompts, types, and orchestration flow. The only differences are the SDK-specific agent implementations -- `query()` async generators for Claude, `Codex` threads for Codex.
96 changes: 96 additions & 0 deletions RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Battle Report: First Multi-Model Harness Runs

**Date:** 2026-04-02
**Target:** brane-code SSE streaming fix (Codex proxy buffering bug)
**Prompt:** `brane-streaming-fix.md` — wire SSE streaming into brane-code's Codex proxy

## Scoreboard

| Harness | Generator | Evaluator | Result | Time | Sprints | Attempts |
|---------|-----------|-----------|--------|------|---------|----------|
| claude-harness | Claude Opus 4.6 | Claude Opus 4.6 | **5/5 PASSED** | 53.4 min | 5 | 6 total (S3 needed 2) |
| codex-harness | GPT-5.4 | GPT-5.4 | **0/1 FAILED** | 59.6 min | 0 of 1 | 4 (all failed) |
| mixed-harness | Claude Opus 4.6 | GPT-5.4 | **In Progress** | 60+ min | Sprint 4 (11/13 passing) | Still running |
| gemini-harness | Claude Opus 4.6 | Gemini 3.1 Pro | **5/5 PASSED** | 50.7 min | 5 | 6 total (S4 needed 2) |

## Key Findings

### Claude vs Itself (5/5 PASSED, 53.4 min)

Self-evaluation with the same model. Opus generated and evaluated its own code. All 5 sprints passed, with Sprint 3 needing a retry. Final sprint covered performance criteria (constant memory, no event loop blocking, time-to-first-token) — all passed with 7-8/10 scores.

**Concern:** Self-evaluation may be sycophantic. The same model may not catch its own blind spots.

### Codex Alone (0/1 FAILED, 59.6 min)

GPT-5.4 as both generator and evaluator. **Never wrote any code.** After 4 attempts, Sprint 1 scored 1/10 on all 15 criteria. The evaluator found:

> "No incremental stream path exists in the submitted app."
> "The app cannot be started, so first-token behavior is untestable."
> "Starting the expected app entrypoint failed with MODULE_NOT_FOUND."

**Root cause:** Codex CLI has context window limits (~272K), sandbox constraints (~200 files, ~10MB), and an auto-compression bug that causes it to lose context on long-running tasks. It couldn't sustain multi-sprint autonomous coding.

### Mixed: Opus Generates, GPT-5.4 Evaluates (Sprint 4, 11/13)

The adversarial matchup. Claude Opus generates code, GPT-5.4 rips it apart. This is where the GAN-inspired approach shines — zero sycophancy.

Sprint 4 (attempt 2) scored 11/13 criteria passing. GPT-5.4 caught real issues:

- **`http_401_403_token_refresh` (5/10):** "Only partially implemented. Expected one real OAuth refresh and one retry with a refreshed token."
- **`repl_token_display_nonzero` (2/10):** "This fails in the shipped app. After a completed streamed response, the REPL/CLI display shows zero tokens."

These are legitimate bugs that the Claude self-evaluation missed entirely.

### Gemini Evaluates Opus (5/5 PASSED, 50.7 min)

Claude Opus generates, Gemini 3.1 Pro evaluates using tool calling (readFile, runCommand, listFiles). Gemini was thorough — it ran the test suite (`bun test`: 132 pass, 26 pass across multiple suites) and read source files before scoring.

Sprint 4 needed 2 attempts. Sprint 5 scored perfect 10/10 across all 14 criteria. Gemini's evaluations were detailed and evidence-based, citing specific file paths and line numbers.

**Notable:** Gemini was a tougher evaluator than Claude self-eval (Sprint 4 failed first attempt) but more generous on final scores (10/10 vs 7-8/10). Different evaluation style — more binary pass/fail thinking.

## What the Adversarial Approach Caught

Bugs found by cross-model evaluation that self-evaluation missed:

1. **OAuth token refresh incomplete** — GPT-5.4 evaluator flagged partial implementation
2. **REPL token display showing zero** — GPT-5.4 caught display layer bug
3. **Division-by-zero in renegotiation logic** — CodeRabbit CLI review
4. **Dead branch in renegotiation** — CodeRabbit CLI review
5. **Symlink sandbox escape** — Code review agent
6. **`node -e` arbitrary code execution** — Security review agent
7. **`find -exec` subprocess spawn** — Security review agent

## Structural Improvements Made

Before running the harnesses, we hardened the original codebase:

1. **Iterative contract negotiation** — 3 rounds of generator/evaluator back-and-forth instead of single-shot
2. **Fail-closed contract parsing** — Throws on malformed JSON instead of falling back to defaults
3. **Mid-sprint renegotiation** — Triggers when avgScore < 4 or all criteria failing
4. **Gemini evaluator sandbox** — Command allowlisting, path confinement with realpath(), git read-only, find -exec blocking

## Harness Architecture

```
Planner (Claude Opus)
|
spec.md (product spec)
|
Contract Negotiation (3 rounds)
/ \
Generator Evaluator
(Claude Opus) (varies by harness)
| |
builds code scores 1-10
| |
+--- retry loop -----+
(max 3 attempts per sprint)
```

## Test Coverage

29 tests across 2 suites:
- `mixed-harness.test.ts` — 22 tests (parseContract, renegotiation triggers, parseEvalResult, negotiation rounds)
- `conversation-logger.test.ts` — 7 tests (entry logging, markdown format, JSONL validity, disk save)
81 changes: 81 additions & 0 deletions bun.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 1 addition & 1 deletion claude-harness/evaluator.ts
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,7 @@ function parseEvalResult(
for (const candidate of candidates) {
try {
const parsed = JSON.parse(candidate) as EvalResult;
if (parsed.feedback && Array.isArray(parsed.feedback)) {
if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) {
// Recalculate passed based on threshold
parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold);
return parsed;
Expand Down
Loading