coleam00 · lliWcWill · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/README.md b/README.md
@@ -51,9 +51,22 @@ bun run claude-harness/index.ts --file prompt.md
 ```bash
 bun run codex-harness/index.ts "Build a personal task manager with a REST API, interactive dashboard with charts, task categories, priority levels, due dates, and search functionality"
 ```
-
 Both harnesses write their output to `workspace/claude/` and `workspace/codex/` respectively. The built application lives in `workspace/{sdk}/app/`.
 
+### Run the Mixed Harness (Claude generates, GPT-5.4 evaluates)
+
+```bash
+bun run mixed-harness/index.ts "Build a REST API with authentication"
+```
+
+### Run the Gemini Harness (Claude generates, Gemini 3.1 Pro evaluates)
+
+```bash
+GEMINI_API_KEY=your-key bun run gemini-harness/index.ts --file prompt.md
+```
+
+Mixed and Gemini harnesses write to `workspace/mixed/` and `workspace/gemini/` respectively. Set `HARNESS_LOG_DIR` to customize where conversation logs are saved (defaults to `./logs`).
+
 ## Configuration
 
 Defaults are in `shared/config.ts`:
@@ -63,7 +76,7 @@ Defaults are in `shared/config.ts`:
 | `maxSprints` | 10 | Maximum number of sprints |
 | `maxRetriesPerSprint` | 3 | Max evaluation retries before failing a sprint |
 | `passThreshold` | 7 | Minimum score (out of 10) for each criterion |
-| `CLAUDE_MODEL` | `claude-sonnet-4-6` | Model for Claude harness |
+| `CLAUDE_MODEL` | `claude-opus-4-6` | Model for Claude harness |
 | `CODEX_MODEL` | `gpt-5.4` | Model for Codex harness |
 
 ## How It Works
@@ -180,9 +193,18 @@ adversarial-dev/
 │   ├── planner.ts       # Planner agent
 │   ├── generator.ts     # Generator agent
 │   └── evaluator.ts     # Evaluator agent
+├── mixed-harness/      # Claude generator + Codex GPT-5.4 evaluator
+│   ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts
+├── gemini-harness/     # Claude generator + Gemini 3.1 Pro evaluator (sandboxed tools)
+│   ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts
+├── tests/              # Test suites (29 tests)
+│   ├── mixed-harness.test.ts
+│   └── conversation-logger.test.ts
 └── workspace/           # Runtime output (gitignored)
-    ├── claude/          # Claude harness working directory
-    └── codex/           # Codex harness working directory
+    ├── claude/
+    ├── codex/
+    ├── mixed/
+    └── gemini/
 ```
 
 Both harnesses share the same prompts, types, and orchestration flow. The only differences are the SDK-specific agent implementations -- `query()` async generators for Claude, `Codex` threads for Codex.
diff --git a/RESULTS.md b/RESULTS.md
@@ -0,0 +1,96 @@
+# Battle Report: First Multi-Model Harness Runs
+
+**Date:** 2026-04-02
+**Target:** brane-code SSE streaming fix (Codex proxy buffering bug)
+**Prompt:** `brane-streaming-fix.md` — wire SSE streaming into brane-code's Codex proxy
+
+## Scoreboard
+
+| Harness | Generator | Evaluator | Result | Time | Sprints | Attempts |
+|---------|-----------|-----------|--------|------|---------|----------|
+| claude-harness | Claude Opus 4.6 | Claude Opus 4.6 | **5/5 PASSED** | 53.4 min | 5 | 6 total (S3 needed 2) |
+| codex-harness | GPT-5.4 | GPT-5.4 | **0/1 FAILED** | 59.6 min | 0 of 1 | 4 (all failed) |
+| mixed-harness | Claude Opus 4.6 | GPT-5.4 | **In Progress** | 60+ min | Sprint 4 (11/13 passing) | Still running |
+| gemini-harness | Claude Opus 4.6 | Gemini 3.1 Pro | **5/5 PASSED** | 50.7 min | 5 | 6 total (S4 needed 2) |
+
+## Key Findings
+
+### Claude vs Itself (5/5 PASSED, 53.4 min)
+
+Self-evaluation with the same model. Opus generated and evaluated its own code. All 5 sprints passed, with Sprint 3 needing a retry. Final sprint covered performance criteria (constant memory, no event loop blocking, time-to-first-token) — all passed with 7-8/10 scores.
+
+**Concern:** Self-evaluation may be sycophantic. The same model may not catch its own blind spots.
+
+### Codex Alone (0/1 FAILED, 59.6 min)
+
+GPT-5.4 as both generator and evaluator. **Never wrote any code.** After 4 attempts, Sprint 1 scored 1/10 on all 15 criteria. The evaluator found:
+
+> "No incremental stream path exists in the submitted app."
+> "The app cannot be started, so first-token behavior is untestable."
+> "Starting the expected app entrypoint failed with MODULE_NOT_FOUND."
+
+**Root cause:** Codex CLI has context window limits (~272K), sandbox constraints (~200 files, ~10MB), and an auto-compression bug that causes it to lose context on long-running tasks. It couldn't sustain multi-sprint autonomous coding.
+
+### Mixed: Opus Generates, GPT-5.4 Evaluates (Sprint 4, 11/13)
+
+The adversarial matchup. Claude Opus generates code, GPT-5.4 rips it apart. This is where the GAN-inspired approach shines — zero sycophancy.
+
+Sprint 4 (attempt 2) scored 11/13 criteria passing. GPT-5.4 caught real issues:
+
+- **`http_401_403_token_refresh` (5/10):** "Only partially implemented. Expected one real OAuth refresh and one retry with a refreshed token."
+- **`repl_token_display_nonzero` (2/10):** "This fails in the shipped app. After a completed streamed response, the REPL/CLI display shows zero tokens."
+
+These are legitimate bugs that the Claude self-evaluation missed entirely.
+
+### Gemini Evaluates Opus (5/5 PASSED, 50.7 min)
+
+Claude Opus generates, Gemini 3.1 Pro evaluates using tool calling (readFile, runCommand, listFiles). Gemini was thorough — it ran the test suite (`bun test`: 132 pass, 26 pass across multiple suites) and read source files before scoring.
+
+Sprint 4 needed 2 attempts. Sprint 5 scored perfect 10/10 across all 14 criteria. Gemini's evaluations were detailed and evidence-based, citing specific file paths and line numbers.
+
+**Notable:** Gemini was a tougher evaluator than Claude self-eval (Sprint 4 failed first attempt) but more generous on final scores (10/10 vs 7-8/10). Different evaluation style — more binary pass/fail thinking.
+
+## What the Adversarial Approach Caught
+
+Bugs found by cross-model evaluation that self-evaluation missed:
+
+1. **OAuth token refresh incomplete** — GPT-5.4 evaluator flagged partial implementation
+2. **REPL token display showing zero** — GPT-5.4 caught display layer bug
+3. **Division-by-zero in renegotiation logic** — CodeRabbit CLI review
+4. **Dead branch in renegotiation** — CodeRabbit CLI review
+5. **Symlink sandbox escape** — Code review agent
+6. **`node -e` arbitrary code execution** — Security review agent
+7. **`find -exec` subprocess spawn** — Security review agent
+
+## Structural Improvements Made
+
+Before running the harnesses, we hardened the original codebase:
+
+1. **Iterative contract negotiation** — 3 rounds of generator/evaluator back-and-forth instead of single-shot
+2. **Fail-closed contract parsing** — Throws on malformed JSON instead of falling back to defaults
+3. **Mid-sprint renegotiation** — Triggers when avgScore < 4 or all criteria failing
+4. **Gemini evaluator sandbox** — Command allowlisting, path confinement with realpath(), git read-only, find -exec blocking
+
+## Harness Architecture
+
+```
+                    Planner (Claude Opus)
+                         |
+                    spec.md (product spec)
+                         |
+              Contract Negotiation (3 rounds)
+                    /          \
+            Generator           Evaluator
+         (Claude Opus)       (varies by harness)
+              |                    |
+         builds code          scores 1-10
+              |                    |
+              +--- retry loop -----+
+                  (max 3 attempts per sprint)
+```
+
+## Test Coverage
+
+29 tests across 2 suites:
+- `mixed-harness.test.ts` — 22 tests (parseContract, renegotiation triggers, parseEvalResult, negotiation rounds)
+- `conversation-logger.test.ts` — 7 tests (entry logging, markdown format, JSONL validity, disk save)
diff --git a/bun.lock b/bun.lock
diff --git a/claude-harness/evaluator.ts b/claude-harness/evaluator.ts
@@ -93,7 +93,7 @@ function parseEvalResult(
   for (const candidate of candidates) {
     try {
       const parsed = JSON.parse(candidate) as EvalResult;
-      if (parsed.feedback && Array.isArray(parsed.feedback)) {
+      if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) {
         // Recalculate passed based on threshold
         parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold);
         return parsed;