diff --git a/README.md b/README.md index ce60a9b..143faa6 100644 --- a/README.md +++ b/README.md @@ -51,9 +51,22 @@ bun run claude-harness/index.ts --file prompt.md ```bash bun run codex-harness/index.ts "Build a personal task manager with a REST API, interactive dashboard with charts, task categories, priority levels, due dates, and search functionality" ``` - Both harnesses write their output to `workspace/claude/` and `workspace/codex/` respectively. The built application lives in `workspace/{sdk}/app/`. +### Run the Mixed Harness (Claude generates, GPT-5.4 evaluates) + +```bash +bun run mixed-harness/index.ts "Build a REST API with authentication" +``` + +### Run the Gemini Harness (Claude generates, Gemini 3.1 Pro evaluates) + +```bash +GEMINI_API_KEY=your-key bun run gemini-harness/index.ts --file prompt.md +``` + +Mixed and Gemini harnesses write to `workspace/mixed/` and `workspace/gemini/` respectively. Set `HARNESS_LOG_DIR` to customize where conversation logs are saved (defaults to `./logs`). + ## Configuration Defaults are in `shared/config.ts`: @@ -63,7 +76,7 @@ Defaults are in `shared/config.ts`: | `maxSprints` | 10 | Maximum number of sprints | | `maxRetriesPerSprint` | 3 | Max evaluation retries before failing a sprint | | `passThreshold` | 7 | Minimum score (out of 10) for each criterion | -| `CLAUDE_MODEL` | `claude-sonnet-4-6` | Model for Claude harness | +| `CLAUDE_MODEL` | `claude-opus-4-6` | Model for Claude harness | | `CODEX_MODEL` | `gpt-5.4` | Model for Codex harness | ## How It Works @@ -180,9 +193,18 @@ adversarial-dev/ │ ├── planner.ts # Planner agent │ ├── generator.ts # Generator agent │ └── evaluator.ts # Evaluator agent +├── mixed-harness/ # Claude generator + Codex GPT-5.4 evaluator +│ ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts +├── gemini-harness/ # Claude generator + Gemini 3.1 Pro evaluator (sandboxed tools) +│ ├── index.ts, harness.ts, planner.ts, generator.ts, evaluator.ts +├── tests/ # Test suites (29 tests) +│ ├── mixed-harness.test.ts +│ └── conversation-logger.test.ts └── workspace/ # Runtime output (gitignored) - ├── claude/ # Claude harness working directory - └── codex/ # Codex harness working directory + ├── claude/ + ├── codex/ + ├── mixed/ + └── gemini/ ``` Both harnesses share the same prompts, types, and orchestration flow. The only differences are the SDK-specific agent implementations -- `query()` async generators for Claude, `Codex` threads for Codex. diff --git a/RESULTS.md b/RESULTS.md new file mode 100644 index 0000000..4858a24 --- /dev/null +++ b/RESULTS.md @@ -0,0 +1,96 @@ +# Battle Report: First Multi-Model Harness Runs + +**Date:** 2026-04-02 +**Target:** brane-code SSE streaming fix (Codex proxy buffering bug) +**Prompt:** `brane-streaming-fix.md` — wire SSE streaming into brane-code's Codex proxy + +## Scoreboard + +| Harness | Generator | Evaluator | Result | Time | Sprints | Attempts | +|---------|-----------|-----------|--------|------|---------|----------| +| claude-harness | Claude Opus 4.6 | Claude Opus 4.6 | **5/5 PASSED** | 53.4 min | 5 | 6 total (S3 needed 2) | +| codex-harness | GPT-5.4 | GPT-5.4 | **0/1 FAILED** | 59.6 min | 0 of 1 | 4 (all failed) | +| mixed-harness | Claude Opus 4.6 | GPT-5.4 | **In Progress** | 60+ min | Sprint 4 (11/13 passing) | Still running | +| gemini-harness | Claude Opus 4.6 | Gemini 3.1 Pro | **5/5 PASSED** | 50.7 min | 5 | 6 total (S4 needed 2) | + +## Key Findings + +### Claude vs Itself (5/5 PASSED, 53.4 min) + +Self-evaluation with the same model. Opus generated and evaluated its own code. All 5 sprints passed, with Sprint 3 needing a retry. Final sprint covered performance criteria (constant memory, no event loop blocking, time-to-first-token) — all passed with 7-8/10 scores. + +**Concern:** Self-evaluation may be sycophantic. The same model may not catch its own blind spots. + +### Codex Alone (0/1 FAILED, 59.6 min) + +GPT-5.4 as both generator and evaluator. **Never wrote any code.** After 4 attempts, Sprint 1 scored 1/10 on all 15 criteria. The evaluator found: + +> "No incremental stream path exists in the submitted app." +> "The app cannot be started, so first-token behavior is untestable." +> "Starting the expected app entrypoint failed with MODULE_NOT_FOUND." + +**Root cause:** Codex CLI has context window limits (~272K), sandbox constraints (~200 files, ~10MB), and an auto-compression bug that causes it to lose context on long-running tasks. It couldn't sustain multi-sprint autonomous coding. + +### Mixed: Opus Generates, GPT-5.4 Evaluates (Sprint 4, 11/13) + +The adversarial matchup. Claude Opus generates code, GPT-5.4 rips it apart. This is where the GAN-inspired approach shines — zero sycophancy. + +Sprint 4 (attempt 2) scored 11/13 criteria passing. GPT-5.4 caught real issues: + +- **`http_401_403_token_refresh` (5/10):** "Only partially implemented. Expected one real OAuth refresh and one retry with a refreshed token." +- **`repl_token_display_nonzero` (2/10):** "This fails in the shipped app. After a completed streamed response, the REPL/CLI display shows zero tokens." + +These are legitimate bugs that the Claude self-evaluation missed entirely. + +### Gemini Evaluates Opus (5/5 PASSED, 50.7 min) + +Claude Opus generates, Gemini 3.1 Pro evaluates using tool calling (readFile, runCommand, listFiles). Gemini was thorough — it ran the test suite (`bun test`: 132 pass, 26 pass across multiple suites) and read source files before scoring. + +Sprint 4 needed 2 attempts. Sprint 5 scored perfect 10/10 across all 14 criteria. Gemini's evaluations were detailed and evidence-based, citing specific file paths and line numbers. + +**Notable:** Gemini was a tougher evaluator than Claude self-eval (Sprint 4 failed first attempt) but more generous on final scores (10/10 vs 7-8/10). Different evaluation style — more binary pass/fail thinking. + +## What the Adversarial Approach Caught + +Bugs found by cross-model evaluation that self-evaluation missed: + +1. **OAuth token refresh incomplete** — GPT-5.4 evaluator flagged partial implementation +2. **REPL token display showing zero** — GPT-5.4 caught display layer bug +3. **Division-by-zero in renegotiation logic** — CodeRabbit CLI review +4. **Dead branch in renegotiation** — CodeRabbit CLI review +5. **Symlink sandbox escape** — Code review agent +6. **`node -e` arbitrary code execution** — Security review agent +7. **`find -exec` subprocess spawn** — Security review agent + +## Structural Improvements Made + +Before running the harnesses, we hardened the original codebase: + +1. **Iterative contract negotiation** — 3 rounds of generator/evaluator back-and-forth instead of single-shot +2. **Fail-closed contract parsing** — Throws on malformed JSON instead of falling back to defaults +3. **Mid-sprint renegotiation** — Triggers when avgScore < 4 or all criteria failing +4. **Gemini evaluator sandbox** — Command allowlisting, path confinement with realpath(), git read-only, find -exec blocking + +## Harness Architecture + +``` + Planner (Claude Opus) + | + spec.md (product spec) + | + Contract Negotiation (3 rounds) + / \ + Generator Evaluator + (Claude Opus) (varies by harness) + | | + builds code scores 1-10 + | | + +--- retry loop -----+ + (max 3 attempts per sprint) +``` + +## Test Coverage + +29 tests across 2 suites: +- `mixed-harness.test.ts` — 22 tests (parseContract, renegotiation triggers, parseEvalResult, negotiation rounds) +- `conversation-logger.test.ts` — 7 tests (entry logging, markdown format, JSONL validity, disk save) diff --git a/bun.lock b/bun.lock index 8e4935d..f1af5c7 100644 --- a/bun.lock +++ b/bun.lock @@ -6,6 +6,7 @@ "name": "adversarial-dev", "dependencies": { "@anthropic-ai/claude-agent-sdk": "^0.2.85", + "@google/genai": "^1.48.0", "@openai/codex-sdk": "^0.117.0", }, "devDependencies": { @@ -19,6 +20,8 @@ "packages": { "@anthropic-ai/claude-agent-sdk": ["@anthropic-ai/claude-agent-sdk@0.2.85", "", { "optionalDependencies": { "@img/sharp-darwin-arm64": "^0.34.2", "@img/sharp-darwin-x64": "^0.34.2", "@img/sharp-linux-arm": "^0.34.2", "@img/sharp-linux-arm64": "^0.34.2", "@img/sharp-linux-x64": "^0.34.2", "@img/sharp-linuxmusl-arm64": "^0.34.2", "@img/sharp-linuxmusl-x64": "^0.34.2", "@img/sharp-win32-arm64": "^0.34.2", "@img/sharp-win32-x64": "^0.34.2" }, "peerDependencies": { "zod": "^4.0.0" } }, "sha512-/ohKLtP1zy6aWXLW/9KTYBveJPEtAfdO96qiP1Cl5S7LgVq/qRDUl7AUw5YGrBaK6YWHEE/rfMQZGwP/i5zIvQ=="], + "@google/genai": ["@google/genai@1.48.0", "", { "dependencies": { "google-auth-library": "^10.3.0", "p-retry": "^4.6.2", "protobufjs": "^7.5.4", "ws": "^8.18.0" }, "peerDependencies": { "@modelcontextprotocol/sdk": "^1.25.2" }, "optionalPeers": ["@modelcontextprotocol/sdk"] }, "sha512-plonYK4ML2PrxsRD9SeqmFt76eREWkQdPCglOA6aYDzL1AAbE+7PUnT54SvpWGfws13L0AZEqGSpL7+1IPnTxQ=="], + "@img/sharp-darwin-arm64": ["@img/sharp-darwin-arm64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-arm64": "1.2.4" }, "os": "darwin", "cpu": "arm64" }, "sha512-imtQ3WMJXbMY4fxb/Ndp6HBTNVtWCUI0WdobyheGf5+ad6xX8VIDO8u2xE4qc/fr08CKG/7dDseFtn6M6g/r3w=="], "@img/sharp-darwin-x64": ["@img/sharp-darwin-x64@0.34.5", "", { "optionalDependencies": { "@img/sharp-libvips-darwin-x64": "1.2.4" }, "os": "darwin", "cpu": "x64" }, "sha512-YNEFAF/4KQ/PeW0N+r+aVVsoIY0/qxxikF2SWdp+NRkmMB7y9LBZAVqQ4yhGCm/H3H270OSykqmQMKLBhBJDEw=="], @@ -67,16 +70,94 @@ "@openai/codex-win32-x64": ["@openai/codex@0.117.0-win32-x64", "", { "os": "win32", "cpu": "x64" }, "sha512-ByedNwSlHJ4aE2++fBaUcaqbQsmx2dZS6mhrnv2SqbTY0saRFE2BT1R64fClt8TwXwMsQQn1uvkxjzU4aEhRcg=="], + "@protobufjs/aspromise": ["@protobufjs/aspromise@1.1.2", "", {}, "sha512-j+gKExEuLmKwvz3OgROXtrJ2UG2x8Ch2YZUxahh+s1F2HZ+wAceUNLkvy6zKCPVRkU++ZWQrdxsUeQXmcg4uoQ=="], + + "@protobufjs/base64": ["@protobufjs/base64@1.1.2", "", {}, "sha512-AZkcAA5vnN/v4PDqKyMR5lx7hZttPDgClv83E//FMNhR2TMcLUhfRUBHCmSl0oi9zMgDDqRUJkSxO3wm85+XLg=="], + + "@protobufjs/codegen": ["@protobufjs/codegen@2.0.4", "", {}, "sha512-YyFaikqM5sH0ziFZCN3xDC7zeGaB/d0IUb9CATugHWbd1FRFwWwt4ld4OYMPWu5a3Xe01mGAULCdqhMlPl29Jg=="], + + "@protobufjs/eventemitter": ["@protobufjs/eventemitter@1.1.0", "", {}, "sha512-j9ednRT81vYJ9OfVuXG6ERSTdEL1xVsNgqpkxMsbIabzSo3goCjDIveeGv5d03om39ML71RdmrGNjG5SReBP/Q=="], + + "@protobufjs/fetch": ["@protobufjs/fetch@1.1.0", "", { "dependencies": { "@protobufjs/aspromise": "^1.1.1", "@protobufjs/inquire": "^1.1.0" } }, "sha512-lljVXpqXebpsijW71PZaCYeIcE5on1w5DlQy5WH6GLbFryLUrBD4932W/E2BSpfRJWseIL4v/KPgBFxDOIdKpQ=="], + + "@protobufjs/float": ["@protobufjs/float@1.0.2", "", {}, "sha512-Ddb+kVXlXst9d+R9PfTIxh1EdNkgoRe5tOX6t01f1lYWOvJnSPDBlG241QLzcyPdoNTsblLUdujGSE4RzrTZGQ=="], + + "@protobufjs/inquire": ["@protobufjs/inquire@1.1.0", "", {}, "sha512-kdSefcPdruJiFMVSbn801t4vFK7KB/5gd2fYvrxhuJYg8ILrmn9SKSX2tZdV6V+ksulWqS7aXjBcRXl3wHoD9Q=="], + + "@protobufjs/path": ["@protobufjs/path@1.1.2", "", {}, "sha512-6JOcJ5Tm08dOHAbdR3GrvP+yUUfkjG5ePsHYczMFLq3ZmMkAD98cDgcT2iA1lJ9NVwFd4tH/iSSoe44YWkltEA=="], + + "@protobufjs/pool": ["@protobufjs/pool@1.1.0", "", {}, "sha512-0kELaGSIDBKvcgS4zkjz1PeddatrjYcmMWOlAuAPwAeccUrPHdUqo/J6LiymHHEiJT5NrF1UVwxY14f+fy4WQw=="], + + "@protobufjs/utf8": ["@protobufjs/utf8@1.1.0", "", {}, "sha512-Vvn3zZrhQZkkBE8LSuW3em98c0FwgO4nxzv6OdSxPKJIEKY2bGbHn+mhGIPerzI4twdxaP8/0+06HBpwf345Lw=="], + "@types/bun": ["@types/bun@1.3.11", "", { "dependencies": { "bun-types": "1.3.11" } }, "sha512-5vPne5QvtpjGpsGYXiFyycfpDF2ECyPcTSsFBMa0fraoxiQyMJ3SmuQIGhzPg2WJuWxVBoxWJ2kClYTcw/4fAg=="], "@types/node": ["@types/node@25.5.0", "", { "dependencies": { "undici-types": "~7.18.0" } }, "sha512-jp2P3tQMSxWugkCUKLRPVUpGaL5MVFwF8RDuSRztfwgN1wmqJeMSbKlnEtQqU8UrhTmzEmZdu2I6v2dpp7XIxw=="], + "@types/retry": ["@types/retry@0.12.0", "", {}, "sha512-wWKOClTTiizcZhXnPY4wikVAwmdYHp8q6DmC+EJUzAMsycb7HB32Kh9RN4+0gExjmPmZSAQjgURXIGATPegAvA=="], + + "agent-base": ["agent-base@7.1.4", "", {}, "sha512-MnA+YT8fwfJPgBx3m60MNqakm30XOkyIoH1y6huTQvC0PwZG7ki8NacLBcrPbNoo8vEZy7Jpuk7+jMO+CUovTQ=="], + + "base64-js": ["base64-js@1.5.1", "", {}, "sha512-AKpaYlHn8t4SVbOHCy+b5+KKgvR4vrsD8vbvrbiQJps7fKDTkjkDry6ji0rUJjC0kzbNePLwzxq8iypo41qeWA=="], + + "bignumber.js": ["bignumber.js@9.3.1", "", {}, "sha512-Ko0uX15oIUS7wJ3Rb30Fs6SkVbLmPBAKdlm7q9+ak9bbIeFf0MwuBsQV6z7+X768/cHsfg+WlysDWJcmthjsjQ=="], + + "buffer-equal-constant-time": ["buffer-equal-constant-time@1.0.1", "", {}, "sha512-zRpUiDwd/xk6ADqPMATG8vc9VPrkck7T07OIx0gnjmJAnHnTVXNQG3vfvWNuiZIkwu9KrKdA1iJKfsfTVxE6NA=="], + "bun-types": ["bun-types@1.3.11", "", { "dependencies": { "@types/node": "*" } }, "sha512-1KGPpoxQWl9f6wcZh57LvrPIInQMn2TQ7jsgxqpRzg+l0QPOFvJVH7HmvHo/AiPgwXy+/Thf6Ov3EdVn1vOabg=="], + "data-uri-to-buffer": ["data-uri-to-buffer@4.0.1", "", {}, "sha512-0R9ikRb668HB7QDxT1vkpuUBtqc53YyAwMwGeUFKRojY/NWKvdZ+9UYtRfGmhqNbRkTSVpMbmyhXipFFv2cb/A=="], + + "debug": ["debug@4.4.3", "", { "dependencies": { "ms": "^2.1.3" } }, "sha512-RGwwWnwQvkVfavKVt22FGLw+xYSdzARwm0ru6DhTVA3umU5hZc28V3kO4stgYryrTlLpuvgI9GiijltAjNbcqA=="], + + "ecdsa-sig-formatter": ["ecdsa-sig-formatter@1.0.11", "", { "dependencies": { "safe-buffer": "^5.0.1" } }, "sha512-nagl3RYrbNv6kQkeJIpt6NJZy8twLB/2vtz6yN9Z4vRKHN4/QZJIEbqohALSgwKdnksuY3k5Addp5lg8sVoVcQ=="], + + "extend": ["extend@3.0.2", "", {}, "sha512-fjquC59cD7CyW6urNXK0FBufkZcoiGG80wTuPujX590cB5Ttln20E2UB4S/WARVqhXffZl2LNgS+gQdPIIim/g=="], + + "fetch-blob": ["fetch-blob@3.2.0", "", { "dependencies": { "node-domexception": "^1.0.0", "web-streams-polyfill": "^3.0.3" } }, "sha512-7yAQpD2UMJzLi1Dqv7qFYnPbaPx7ZfFK6PiIxQ4PfkGPyNyl2Ugx+a/umUonmKqjhM4DnfbMvdX6otXq83soQQ=="], + + "formdata-polyfill": ["formdata-polyfill@4.0.10", "", { "dependencies": { "fetch-blob": "^3.1.2" } }, "sha512-buewHzMvYL29jdeQTVILecSaZKnt/RJWjoZCF5OW60Z67/GmSLBkOFM7qh1PI3zFNtJbaZL5eQu1vLfazOwj4g=="], + + "gaxios": ["gaxios@7.1.4", "", { "dependencies": { "extend": "^3.0.2", "https-proxy-agent": "^7.0.1", "node-fetch": "^3.3.2" } }, "sha512-bTIgTsM2bWn3XklZISBTQX7ZSddGW+IO3bMdGaemHZ3tbqExMENHLx6kKZ/KlejgrMtj8q7wBItt51yegqalrA=="], + + "gcp-metadata": ["gcp-metadata@8.1.2", "", { "dependencies": { "gaxios": "^7.0.0", "google-logging-utils": "^1.0.0", "json-bigint": "^1.0.0" } }, "sha512-zV/5HKTfCeKWnxG0Dmrw51hEWFGfcF2xiXqcA3+J90WDuP0SvoiSO5ORvcBsifmx/FoIjgQN3oNOGaQ5PhLFkg=="], + + "google-auth-library": ["google-auth-library@10.6.2", "", { "dependencies": { "base64-js": "^1.3.0", "ecdsa-sig-formatter": "^1.0.11", "gaxios": "^7.1.4", "gcp-metadata": "8.1.2", "google-logging-utils": "1.1.3", "jws": "^4.0.0" } }, "sha512-e27Z6EThmVNNvtYASwQxose/G57rkRuaRbQyxM2bvYLLX/GqWZ5chWq2EBoUchJbCc57eC9ArzO5wMsEmWftCw=="], + + "google-logging-utils": ["google-logging-utils@1.1.3", "", {}, "sha512-eAmLkjDjAFCVXg7A1unxHsLf961m6y17QFqXqAXGj/gVkKFrEICfStRfwUlGNfeCEjNRa32JEWOUTlYXPyyKvA=="], + + "https-proxy-agent": ["https-proxy-agent@7.0.6", "", { "dependencies": { "agent-base": "^7.1.2", "debug": "4" } }, "sha512-vK9P5/iUfdl95AI+JVyUuIcVtd4ofvtrOr3HNtM2yxC9bnMbEdp3x01OhQNnjb8IJYi38VlTE3mBXwcfvywuSw=="], + + "json-bigint": ["json-bigint@1.0.0", "", { "dependencies": { "bignumber.js": "^9.0.0" } }, "sha512-SiPv/8VpZuWbvLSMtTDU8hEfrZWg/mH/nV/b4o0CYbSxu1UIQPLdwKOCIyLQX+VIPO5vrLX3i8qtqFyhdPSUSQ=="], + + "jwa": ["jwa@2.0.1", "", { "dependencies": { "buffer-equal-constant-time": "^1.0.1", "ecdsa-sig-formatter": "1.0.11", "safe-buffer": "^5.0.1" } }, "sha512-hRF04fqJIP8Abbkq5NKGN0Bbr3JxlQ+qhZufXVr0DvujKy93ZCbXZMHDL4EOtodSbCWxOqR8MS1tXA5hwqCXDg=="], + + "jws": ["jws@4.0.1", "", { "dependencies": { "jwa": "^2.0.1", "safe-buffer": "^5.0.1" } }, "sha512-EKI/M/yqPncGUUh44xz0PxSidXFr/+r0pA70+gIYhjv+et7yxM+s29Y+VGDkovRofQem0fs7Uvf4+YmAdyRduA=="], + + "long": ["long@5.3.2", "", {}, "sha512-mNAgZ1GmyNhD7AuqnTG3/VQ26o760+ZYBPKjPvugO8+nLbYfX6TVpJPseBvopbdY+qpZ/lKUnmEc1LeZYS3QAA=="], + + "ms": ["ms@2.1.3", "", {}, "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA=="], + + "node-domexception": ["node-domexception@1.0.0", "", {}, "sha512-/jKZoMpw0F8GRwl4/eLROPA3cfcXtLApP0QzLmUT/HuPCZWyB7IY9ZrMeKw2O/nFIqPQB3PVM9aYm0F312AXDQ=="], + + "node-fetch": ["node-fetch@3.3.2", "", { "dependencies": { "data-uri-to-buffer": "^4.0.0", "fetch-blob": "^3.1.4", "formdata-polyfill": "^4.0.10" } }, "sha512-dRB78srN/l6gqWulah9SrxeYnxeddIG30+GOqK/9OlLVyLg3HPnr6SqOWTWOXKRwC2eGYCkZ59NNuSgvSrpgOA=="], + + "p-retry": ["p-retry@4.6.2", "", { "dependencies": { "@types/retry": "0.12.0", "retry": "^0.13.1" } }, "sha512-312Id396EbJdvRONlngUx0NydfrIQ5lsYu0znKVUzVvArzEIt08V1qhtyESbGVd1FGX7UKtiFp5uwKZdM8wIuQ=="], + + "protobufjs": ["protobufjs@7.5.4", "", { "dependencies": { "@protobufjs/aspromise": "^1.1.2", "@protobufjs/base64": "^1.1.2", "@protobufjs/codegen": "^2.0.4", "@protobufjs/eventemitter": "^1.1.0", "@protobufjs/fetch": "^1.1.0", "@protobufjs/float": "^1.0.2", "@protobufjs/inquire": "^1.1.0", "@protobufjs/path": "^1.1.2", "@protobufjs/pool": "^1.1.0", "@protobufjs/utf8": "^1.1.0", "@types/node": ">=13.7.0", "long": "^5.0.0" } }, "sha512-CvexbZtbov6jW2eXAvLukXjXUW1TzFaivC46BpWc/3BpcCysb5Vffu+B3XHMm8lVEuy2Mm4XGex8hBSg1yapPg=="], + + "retry": ["retry@0.13.1", "", {}, "sha512-XQBQ3I8W1Cge0Seh+6gjj03LbmRFWuoszgK9ooCpwYIrhhoO80pfq4cUkU5DkknwfOfFteRwlZ56PYOGYyFWdg=="], + + "safe-buffer": ["safe-buffer@5.2.1", "", {}, "sha512-rp3So07KcdmmKbGvgaNxQSJr7bGVSVk5S9Eq1F+ppbRo70+YeaDxkw5Dd8NPN+GD6bjnYm2VuPuCXmpuYvmCXQ=="], + "typescript": ["typescript@6.0.2", "", { "bin": { "tsc": "bin/tsc", "tsserver": "bin/tsserver" } }, "sha512-bGdAIrZ0wiGDo5l8c++HWtbaNCWTS4UTv7RaTH/ThVIgjkveJt83m74bBHMJkuCbslY8ixgLBVZJIOiQlQTjfQ=="], "undici-types": ["undici-types@7.18.2", "", {}, "sha512-AsuCzffGHJybSaRrmr5eHr81mwJU3kjw6M+uprWvCXiNeN9SOGwQ3Jn8jb8m3Z6izVgknn1R0FTCEAP2QrLY/w=="], + "web-streams-polyfill": ["web-streams-polyfill@3.3.3", "", {}, "sha512-d2JWLCivmZYTSIoge9MsgFCZrt571BikcWGYkjC1khllbTeDlGqZ2D8vD8E/lJa8WGWbb7Plm8/XJYV7IJHZZw=="], + + "ws": ["ws@8.20.0", "", { "peerDependencies": { "bufferutil": "^4.0.1", "utf-8-validate": ">=5.0.2" }, "optionalPeers": ["bufferutil", "utf-8-validate"] }, "sha512-sAt8BhgNbzCtgGbt2OxmpuryO63ZoDk/sqaB/znQm94T4fCEsy/yV+7CdC1kJhOU9lboAEU7R3kquuycDoibVA=="], + "zod": ["zod@4.3.6", "", {}, "sha512-rftlrkhHZOcjDwkGlnUtZZkvaPHCsDATp4pGpuOOMDaTdDDXF91wuVDJoWoPsKX/3YPQ5fHuF3STjcYyKr+Qhg=="], } } diff --git a/claude-harness/evaluator.ts b/claude-harness/evaluator.ts index 77730b2..ad36293 100644 --- a/claude-harness/evaluator.ts +++ b/claude-harness/evaluator.ts @@ -93,7 +93,7 @@ function parseEvalResult( for (const candidate of candidates) { try { const parsed = JSON.parse(candidate) as EvalResult; - if (parsed.feedback && Array.isArray(parsed.feedback)) { + if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) { // Recalculate passed based on threshold parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold); return parsed; diff --git a/claude-harness/harness.ts b/claude-harness/harness.ts index d947913..dac4093 100644 --- a/claude-harness/harness.ts +++ b/claude-harness/harness.ts @@ -10,7 +10,6 @@ import { writeSpec, readSpec, writeContract, - readContract, writeFeedback, writeProgress, } from "../shared/files.ts"; @@ -87,7 +86,22 @@ export async function runHarness(config: HarnessConfig): Promise await writeProgress(config.workDir, progress); log("HARNESS", "Negotiating sprint contract..."); - const contract = await negotiateContract(config.workDir, spec, sprint); + let contract: SprintContract; + let negotiationAttempts = 0; + const maxNegotiationAttempts = 2; + while (true) { + try { + contract = await negotiateContract(config.workDir, spec, sprint); + break; + } catch (e) { + negotiationAttempts++; + if (negotiationAttempts >= maxNegotiationAttempts) { + logError("HARNESS", `Contract negotiation failed after ${negotiationAttempts} attempts: ${e}`); + throw e; + } + log("HARNESS", `Contract negotiation produced invalid output, retrying (${negotiationAttempts}/${maxNegotiationAttempts})...`); + } + } await writeContract(config.workDir, contract); log("HARNESS", `Contract agreed: ${contract.criteria.length} criteria for ${contract.features.length} features`); @@ -121,6 +135,28 @@ export async function runHarness(config: HarnessConfig): Promise if (retry < config.maxRetriesPerSprint) { log("HARNESS", `Sprint ${sprint} failed attempt ${attempts}, retrying...`); + + // Check if we should renegotiate criteria + if (retry >= 1 && lastEval && lastEval.feedback.length > 0) { + const avgScore = lastEval.feedback.reduce((sum, f) => sum + f.score, 0) / lastEval.feedback.length; + const allFailing = lastEval.feedback.every(f => f.score < (contract.criteria.find(c => c.name === f.criterion)?.threshold ?? 7)); + + // Renegotiate if average score is very low or all criteria are failing + if (allFailing || avgScore < 4) { + if (allFailing) { + log("HARNESS", `All criteria failing (avg score: ${avgScore.toFixed(1)}), renegotiating contract...`); + } else { + log("HARNESS", `Low average score (${avgScore.toFixed(1)}), renegotiating contract...`); + } + try { + contract = await negotiateContract(config.workDir, spec, sprint); + await writeContract(config.workDir, contract); + log("HARNESS", `Renegotiated contract: ${contract.criteria.length} criteria for ${contract.features.length} features`); + } catch (e) { + logError("HARNESS", `Renegotiation failed, continuing with current contract: ${e}`); + } + } + } } else { logError("HARNESS", `Sprint ${sprint} FAILED after ${attempts} attempts`); } @@ -161,60 +197,97 @@ async function negotiateContract( spec: string, sprintNumber: number, ): Promise { - // Generator proposes contract - const proposalPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; - - const proposalOptions: Options = { - cwd: workDir, - systemPrompt: CONTRACT_NEGOTIATION_GENERATOR_PROMPT, - permissionMode: "bypassPermissions", - allowDangerouslySkipPermissions: true, - tools: ["Read"], - model: CLAUDE_MODEL, - maxTurns: 10, - persistSession: false, - }; - + const maxRounds = 3; + let round = 0; let proposalText = ""; - for await (const msg of query({ prompt: proposalPrompt, options: proposalOptions })) { - if (msg.type === "assistant") { - const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; - for (const block of message.message.content) { - if (block.type === "text" && block.text) { - proposalText += block.text; + let reviewText = ""; + let approved = false; + + while (round < maxRounds && !approved) { + round++; + log("HARNESS", `Contract negotiation round ${round}/${maxRounds}`); + + // Generator proposes or counter-proposes + let generatorPrompt: string; + if (round === 1) { + // First round: initial proposal + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; + } else { + // Subsequent rounds: counter-propose based on evaluator feedback + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\n## Evaluator Feedback\n\nThe evaluator reviewed the contract and provided this feedback:\n\n${reviewText}\n\nPlease revise the contract based on this feedback. If the evaluator approved, output "APPROVED". Otherwise, output a revised contract.`; + } + + const proposalOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_GENERATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + proposalText = ""; + for await (const msg of query({ prompt: generatorPrompt, options: proposalOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + proposalText += block.text; + } } } } - } - // Evaluator reviews contract - const reviewPrompt = `## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; - - const reviewOptions: Options = { - cwd: workDir, - systemPrompt: CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, - permissionMode: "bypassPermissions", - allowDangerouslySkipPermissions: true, - tools: ["Read"], - model: CLAUDE_MODEL, - maxTurns: 10, - persistSession: false, - }; + // Check if generator approved (only in subsequent rounds) + if (round > 1 && proposalText.trim() === "APPROVED") { + approved = true; + log("HARNESS", "Generator accepted evaluator revisions, contract finalized"); + break; + } - let reviewText = ""; - for await (const msg of query({ prompt: reviewPrompt, options: reviewOptions })) { - if (msg.type === "assistant") { - const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; - for (const block of message.message.content) { - if (block.type === "text" && block.text) { - reviewText += block.text; + // Evaluator reviews contract + const reviewPrompt = `## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; + + const reviewOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + reviewText = ""; + for await (const msg of query({ prompt: reviewPrompt, options: reviewOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + reviewText += block.text; + } } } } + + // Check if evaluator approved + if (reviewText.trim().toUpperCase().startsWith("APPROVED")) { + approved = true; + log("HARNESS", `Contract approved by evaluator in round ${round}`); + break; + } + + // If not approved and we have reached max rounds, take evaluator version as final + if (round >= maxRounds) { + log("HARNESS", `Max negotiation rounds (${maxRounds}) reached, using evaluator version`); + } } - // Parse the final contract (either the proposal if approved, or the revised version) - const contractSource = reviewText.trim() === "APPROVED" ? proposalText : reviewText; + // Parse the final contract (either proposal if approved, or evaluator version) + const contractSource = reviewText.trim().toUpperCase().startsWith("APPROVED") ? proposalText : reviewText; return parseContract(contractSource, sprintNumber); } @@ -241,28 +314,5 @@ function parseContract(text: string, sprintNumber: number): SprintContract { } } - { - logError("HARNESS", "Failed to parse contract JSON, creating default"); - return { - sprintNumber, - features: [`Sprint ${sprintNumber} features`], - criteria: [ - { - name: "basic_functionality", - description: "Core features for this sprint are implemented and working", - threshold: 7, - }, - { - name: "code_quality", - description: "Code is clean, well-structured, and follows best practices", - threshold: 7, - }, - { - name: "error_handling", - description: "Errors are handled gracefully with appropriate user feedback", - threshold: 7, - }, - ], - }; - } + throw new Error(`Contract negotiation produced unparseable output. Raw text: ${text.slice(0, 200)}`); } diff --git a/codex-harness/evaluator.ts b/codex-harness/evaluator.ts index 4ae40fb..a935f8d 100644 --- a/codex-harness/evaluator.ts +++ b/codex-harness/evaluator.ts @@ -79,7 +79,7 @@ function parseEvalResult( for (const candidate of candidates) { try { const parsed = JSON.parse(candidate) as EvalResult; - if (parsed.feedback && Array.isArray(parsed.feedback)) { + if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) { parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold); return parsed; } diff --git a/codex-harness/harness.ts b/codex-harness/harness.ts index 105c64d..53ef923 100644 --- a/codex-harness/harness.ts +++ b/codex-harness/harness.ts @@ -86,7 +86,22 @@ export async function runHarness(config: HarnessConfig): Promise await writeProgress(config.workDir, progress); log("HARNESS", "Negotiating sprint contract..."); - const contract = await negotiateContract(config.workDir, spec, sprint); + let contract: SprintContract; + let negotiationAttempts = 0; + const maxNegotiationAttempts = 2; + while (true) { + try { + contract = await negotiateContract(config.workDir, spec, sprint); + break; + } catch (e) { + negotiationAttempts++; + if (negotiationAttempts >= maxNegotiationAttempts) { + logError("HARNESS", `Contract negotiation failed after ${negotiationAttempts} attempts: ${e}`); + throw e; + } + log("HARNESS", `Contract negotiation produced invalid output, retrying (${negotiationAttempts}/${maxNegotiationAttempts})...`); + } + } await writeContract(config.workDir, contract); log("HARNESS", `Contract agreed: ${contract.criteria.length} criteria for ${contract.features.length} features`); @@ -120,6 +135,28 @@ export async function runHarness(config: HarnessConfig): Promise if (retry < config.maxRetriesPerSprint) { log("HARNESS", `Sprint ${sprint} failed attempt ${attempts}, retrying...`); + + // Check if we should renegotiate criteria + if (retry >= 1 && lastEval && lastEval.feedback.length > 0) { + const avgScore = lastEval.feedback.reduce((sum, f) => sum + f.score, 0) / lastEval.feedback.length; + const allFailing = lastEval.feedback.every(f => f.score < (contract.criteria.find(c => c.name === f.criterion)?.threshold ?? 7)); + + // Renegotiate if average score is very low or all criteria are failing + if (allFailing || avgScore < 4) { + if (allFailing) { + log("HARNESS", `All criteria failing (avg score: ${avgScore.toFixed(1)}), renegotiating contract...`); + } else { + log("HARNESS", `Low average score (${avgScore.toFixed(1)}), renegotiating contract...`); + } + try { + contract = await negotiateContract(config.workDir, spec, sprint); + await writeContract(config.workDir, contract); + log("HARNESS", `Renegotiated contract: ${contract.criteria.length} criteria for ${contract.features.length} features`); + } catch (e) { + logError("HARNESS", `Renegotiation failed, continuing with current contract: ${e}`); + } + } + } } else { logError("HARNESS", `Sprint ${sprint} FAILED after ${attempts} attempts`); } @@ -159,37 +196,74 @@ async function negotiateContract( spec: string, sprintNumber: number, ): Promise { - const codex = new Codex(); + const maxRounds = 3; + let round = 0; + let proposalText = ""; + let reviewText = ""; + let approved = false; + + while (round < maxRounds && !approved) { + round++; + log("HARNESS", `Contract negotiation round ${round}/${maxRounds}`); + + const codex = new Codex(); + + // Generator proposes or counter-proposes + let generatorPrompt: string; + if (round === 1) { + // First round: initial proposal + generatorPrompt = `${CONTRACT_NEGOTIATION_GENERATOR_PROMPT}\n\n---\n\n## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; + } else { + // Subsequent rounds: counter-propose based on evaluator feedback + generatorPrompt = `${CONTRACT_NEGOTIATION_GENERATOR_PROMPT}\n\n---\n\n## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\n## Evaluator Feedback\n\nThe evaluator reviewed the contract and provided this feedback:\n\n${reviewText}\n\nPlease revise the contract based on this feedback. If the evaluator approved, output "APPROVED". Otherwise, output a revised contract.`; + } - // Generator proposes contract - const proposalPrompt = `${CONTRACT_NEGOTIATION_GENERATOR_PROMPT}\n\n---\n\n## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; + const proposalThread = codex.startThread({ + workingDirectory: workDir, + sandboxMode: "danger-full-access", + networkAccessEnabled: CODEX_NETWORK_ACCESS, + approvalPolicy: "never", + model: CODEX_MODEL, + }); - const proposalThread = codex.startThread({ - workingDirectory: workDir, - sandboxMode: "danger-full-access", - networkAccessEnabled: CODEX_NETWORK_ACCESS, - approvalPolicy: "never", - model: CODEX_MODEL, - }); + const proposalTurn = await proposalThread.run(generatorPrompt); + proposalText = proposalTurn.finalResponse ?? ""; - const proposalTurn = await proposalThread.run(proposalPrompt); - const proposalText = proposalTurn.finalResponse ?? ""; + // Check if generator approved (only in subsequent rounds) + if (round > 1 && proposalText.trim() === "APPROVED") { + approved = true; + log("HARNESS", "Generator accepted evaluator revisions, contract finalized"); + break; + } - // Evaluator reviews contract - const reviewPrompt = `${CONTRACT_NEGOTIATION_EVALUATOR_PROMPT}\n\n---\n\n## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; + // Evaluator reviews contract + const reviewPrompt = `${CONTRACT_NEGOTIATION_EVALUATOR_PROMPT}\n\n---\n\n## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; - const reviewThread = codex.startThread({ - workingDirectory: workDir, - sandboxMode: "danger-full-access", - networkAccessEnabled: CODEX_NETWORK_ACCESS, - approvalPolicy: "never", - model: CODEX_MODEL, - }); + const reviewThread = codex.startThread({ + workingDirectory: workDir, + sandboxMode: "danger-full-access", + networkAccessEnabled: CODEX_NETWORK_ACCESS, + approvalPolicy: "never", + model: CODEX_MODEL, + }); + + const reviewTurn = await reviewThread.run(reviewPrompt); + reviewText = reviewTurn.finalResponse ?? ""; + + // Check if evaluator approved + if (reviewText.trim().toUpperCase().startsWith("APPROVED")) { + approved = true; + log("HARNESS", `Contract approved by evaluator in round ${round}`); + break; + } - const reviewTurn = await reviewThread.run(reviewPrompt); - const reviewText = reviewTurn.finalResponse ?? ""; + // If not approved and we have reached max rounds, take evaluator version as final + if (round >= maxRounds) { + log("HARNESS", `Max negotiation rounds (${maxRounds}) reached, using evaluator version`); + } + } - const contractSource = reviewText.trim() === "APPROVED" ? proposalText : reviewText; + const contractSource = reviewText.trim().toUpperCase().startsWith("APPROVED") ? proposalText : reviewText; return parseContract(contractSource, sprintNumber); } @@ -216,28 +290,5 @@ function parseContract(text: string, sprintNumber: number): SprintContract { } } - { - logError("HARNESS", "Failed to parse contract JSON, creating default"); - return { - sprintNumber, - features: [`Sprint ${sprintNumber} features`], - criteria: [ - { - name: "basic_functionality", - description: "Core features for this sprint are implemented and working", - threshold: 7, - }, - { - name: "code_quality", - description: "Code is clean, well-structured, and follows best practices", - threshold: 7, - }, - { - name: "error_handling", - description: "Errors are handled gracefully with appropriate user feedback", - threshold: 7, - }, - ], - }; - } + throw new Error(`Contract negotiation produced unparseable output. Raw text: ${text.slice(0, 200)}`); } diff --git a/examples/gemini-run-excerpt.md b/examples/gemini-run-excerpt.md new file mode 100644 index 0000000..a1eadda --- /dev/null +++ b/examples/gemini-run-excerpt.md @@ -0,0 +1,150 @@ +# Adversarial Dev Harness — Conversation Log +Session: 2026-04-02T14-19-12 +Duration: 50.7 minutes +Entries: 675 + +--- + +### ⚙️ HARNESS (system) — 💬 System +*14:19:12* + +Gemini Harness started — Generator: Claude (claude-opus-4-6), Evaluator: Gemini (gemini-3.1-pro-preview) + +--- + +### 📋 PLANNER (claude-opus-4-6) — **→ Prompt** +*14:19:12* + +
+Show full content (3748 chars) + +IMPORTANT: Your working directory is /home/player3vsgpt/Desktop/Projects/adversarial-dev-hardening/workspace/gemini. All files you create (including spec.md) MUST be written inside this directory. Do NOT write files anywhere else. + +## Task: Wire SSE Streaming into brane-code's Codex Proxy + +brane-code is a fork of Claude Code that routes API calls to OpenAI's Codex/GPT-5.4 backend. The REPL works and first message round-trip works, but second+ messages hang because the response is buffered instead of streamed. + +### The Bug + +`src/providers/openai/index.ts` function `sendToCodex()` at line 79 does: +```typescript +const body = await resp.text() // BUFFERS ENTIRE SSE RESPONSE +``` + +This needs to become a streaming SSE parser that reads `resp.body` as a ReadableStream and yields Anthropic-shaped stream events (`message_start`, `content_block_delta`, `message_stop`) as SSE chunks arrive from the Codex API. + +### What the Codex API Returns (SSE format) + +POST to `https://chatgpt.com/backend-api/codex/responses` with `stream: true` returns SSE lines: +- `data: {"type": "response.output_text.delta", "delta": "chunk of text"}` +- `data: {"type": "response.completed", "response": {"output_text": "full text"}}` +- `data: [DONE]` + +### What queryModel Expects (Anthropic stream events) + +`src/services/api/claude.ts` at line 1991 does `for await (const part of stream)` where each `part` is a `BetaRawMessageStreamEvent`: +- `{type: 'message_start', message: {id, model, role, usage}}` +- `{type: 'content_block_start', index: 0, content_block: {type: 'text', text: ''}}` +- `{type: 'content_block_delta', index: 0, delta: {type: 'text_delta', text: 'chunk'}}` +- `{type: 'content_block_stop', index: 0}` +- `{type: 'message_delta', delta: {stop_reason: 'end_turn'}, usage: {output_tokens: N}}` +- `{type: 'message_stop'}` + +### Implementation Plan + +1. Create `src/providers/openai/stream.ts` (~120 lines) — SSE parser that yields Anthropic-shaped events +2. Wire into `src/services/api/claude.ts` at the OpenAI intercept point (~line 1820) — replace the `sendToCodex()` call with `streamFromCodex()` that yields into the existing stream processing loop +3. Remove the dead code: the `sendToCodex()` call in claude.ts that passes phantom `apiKey`/`accountId` params +4. Remove the redundant JS Proxy intercept in `src/services/api/client.ts` if the queryModel-level intercept handles everything +5. Test: verify multi-turn conversation works, tokens stream in real-time, second message doesn't hang + +### Reference: Nightfox streaming (READ ONLY — do not modify) + +Nightfox at `/home/player3vsgpt/Desktop/Projects/nightfox/src/providers/openai-provider.ts` lines 330-390 already handles OpenAI streaming: +- Uses `@openai/agents` SDK `run()` with `stream: true` +- Iterates with `for await (const event of streamed)` +- Event type `raw_model_stream_event` with `event.data.type === 'output_text_delta'` and `event.data.delta` for text chunks +- Lines 687-714: `handleStreamEvent()` extracts text deltas + +### Key Files in This Repo + +- `src/providers/openai/index.ts` — current sendToCodex() that needs streaming +- `src/providers/openai/auth.ts` — OAuth token management (getValidToken) +- `src/services/api/claude.ts` — queryModel() stream consumer (~line 1820-2060) +- `src/services/api/client.ts` — JS Proxy intercept (~line 101-151) +- `src/main.tsx` — BRANE_OPENAI_MODE fast-path (~line 2250) + +### Tech Stack + +- TypeScript, Bun runtime +- No new dependencies — use native fetch + ReadableStream +- Keep Anthropic SDK types as internal wire format + +### What NOT To Do + +- Do NOT modify Nightfox files — read-only reference +- Do NOT add `@openai/agents` as a dependency +- Do NOT change the Anthropic SDK type system +- Do NOT touch auth logic — OAuth refresh already works + +
+ +--- + +### 📋 PLANNER (claude-opus-4-6) — 🔧 Tool Call +*14:19:19* +> Tool: `Read` + +
+Read input + +``` +{"file_path":"/home/player3vsgpt/Desktop/Projects/adversarial-dev-hardening/workspace/gemini/src/providers/openai/index.ts"} +``` +
+ +--- + +### 📋 PLANNER (claude-opus-4-6) — 🔧 Tool Call +*14:19:20* +> Tool: `Read` + +
+Read input + +``` +{"file_path":"/home/player3vsgpt/Desktop/Projects/adversarial-dev-hardening/workspace/gemini/src/services/api/claude.ts","offset":1800,"limit":280} +``` +
+ +--- + +### 📋 PLANNER (claude-opus-4-6) — 🔧 Tool Call +*14:19:21* +> Tool: `Read` + +
+Read input + +``` +{"file_path":"/home/player3vsgpt/Desktop/Projects/adversarial-dev-hardening/workspace/gemini/src/services/api/client.ts","offset":90,"limit":80} +``` +
+ +--- + +### 📋 PLANNER (claude-opus-4-6) — 🔧 Tool Call +*14:19:22* +> Tool: `Read` + +
+Read input + +``` +{"file_path":"/home/player3vsgpt/Desktop/Projects/nightfox/src/providers/openai-provider.ts","offset":320,"limit":80} +``` +
+ +--- + +### 📋 PLANNER (claude-opus-4-6) — 🔧 Tool Call diff --git a/gemini-harness/evaluator.ts b/gemini-harness/evaluator.ts new file mode 100644 index 0000000..8674652 --- /dev/null +++ b/gemini-harness/evaluator.ts @@ -0,0 +1,258 @@ +import { GoogleGenAI, Type } from "@google/genai"; +import type { Tool } from "@google/genai"; +import { readFile, readdir, realpath } from "fs/promises"; +import { execFileSync } from "child_process"; +import { resolve } from "path"; +import { EVALUATOR_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { GEMINI_MODEL, GEMINI_API_KEY } from "../shared/config.ts"; +import { log, logError } from "../shared/logger.ts"; +import type { SprintContract, EvalResult } from "../shared/types.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runEvaluator( + workDir: string, + contract: SprintContract, + passThreshold: number, + clog: ConversationLogger, +): Promise { + const sprint = contract.sprintNumber; + log("EVALUATOR", `[Gemini/${GEMINI_MODEL}] Evaluating sprint ${sprint} against ${contract.criteria.length} criteria`); + + const taskPrompt = `## Sprint Contract to Evaluate Against + +${JSON.stringify(contract, null, 2)} + +## Pass Threshold + +Each criterion must score at least ${passThreshold}/10 to pass. + +## Instructions + +Examine the application in the \`app/\` directory. Read the code, run it if possible, and score each criterion. Output ONLY the JSON evaluation object.`; + + clog.prompt("EVALUATOR", GEMINI_MODEL, taskPrompt, { sprint }); + + const startMs = Date.now(); + + const ai = new GoogleGenAI({ apiKey: GEMINI_API_KEY }); + + const tools: Tool[] = [ + { + functionDeclarations: [ + { + name: "readFile", + description: "Read a file from the workspace", + parameters: { + type: Type.OBJECT, + properties: { + path: { type: Type.STRING, description: "Path to the file to read" }, + }, + required: ["path"], + }, + }, + { + name: "runCommand", + description: "Run a shell command in the workspace", + parameters: { + type: Type.OBJECT, + properties: { + command: { type: Type.STRING, description: "Shell command to execute" }, + }, + required: ["command"], + }, + }, + { + name: "listFiles", + description: "List files in a directory", + parameters: { + type: Type.OBJECT, + properties: { + path: { type: Type.STRING, description: "Directory path to list" }, + }, + required: ["path"], + }, + }, + ], + }, + ]; + + const chat = ai.chats.create({ + model: GEMINI_MODEL, + config: { + systemInstruction: EVALUATOR_SYSTEM_PROMPT, + tools, + }, + }); + + let response = await chat.sendMessage({ message: taskPrompt }); + + // Handle tool calls in a loop + while (response.functionCalls && response.functionCalls.length > 0) { + const toolResults: Array<{ functionResponse: { id: string; name: string; response: { result: string } } }> = []; + + for (const call of response.functionCalls) { + let result: string; + const callName = call.name ?? ""; + const callArgs = (call.args ?? {}) as Record; + + log("EVALUATOR", ` Tool: ${callName}`); + clog.toolCall("EVALUATOR", GEMINI_MODEL, callName, JSON.stringify(callArgs).slice(0, 500)); + + if (callName === "readFile") { + const filePath = resolve(workDir, callArgs.path ?? ""); + try { + const real = await realpath(filePath); + if (!real.startsWith(await realpath(workDir))) { + result = "Error: path outside workspace"; + } else { + result = await readFile(real, "utf-8"); + } + } catch (err) { + result = `Error reading file: ${err instanceof Error ? err.message : String(err)}`; + } + } else if (callName === "runCommand") { + // Read-only commands safe for evaluation. No code execution (node/bun/npm). + const ALLOWED_CMDS = ["ls", "cat", "grep", "head", "tail", "wc", "diff", "find", "tsc"]; + const GIT_READ_ONLY = ["log", "status", "diff", "show", "ls-files", "rev-parse"]; + const command = callArgs.command ?? ""; + const parts = command.trim().split(/\s+/); + const bin = parts[0] ?? ""; + const args = parts.slice(1); + + // Special handling for git: only allow read-only subcommands + if (bin === "git") { + const subCmd = args[0] ?? ""; + if (!GIT_READ_ONLY.includes(subCmd)) { + result = `Error: git subcommand '${subCmd}' not allowed. Allowed: ${GIT_READ_ONLY.join(", ")}`; + } else { + try { + result = execFileSync("git", args, { cwd: workDir, timeout: 30000 }).toString(); + } catch (err) { + result = `Command error: ${err instanceof Error ? err.message : String(err)}`; + } + } + } else if (!ALLOWED_CMDS.includes(bin)) { + result = `Error: command '${bin}' not allowed. Allowed: git, ${ALLOWED_CMDS.join(", ")}`; + } else { + // Block dangerous flags: find -exec/-execdir, and absolute paths outside workspace + const BLOCKED_FLAGS = ["-exec", "-execdir", "-delete", "-fls", "-fprint"]; + const resolvedWorkDir = resolve(workDir); + const hasDangerousFlag = args.some(a => BLOCKED_FLAGS.includes(a)); + const hasEscape = args.some(a => a.startsWith("/") && !a.startsWith(resolvedWorkDir)); + if (hasDangerousFlag) { + result = "Error: dangerous flag detected (e.g. -exec). Not allowed in sandbox."; + } else if (hasEscape) { + result = "Error: command arguments reference paths outside workspace"; + } else { + try { + result = execFileSync(bin, args, { cwd: workDir, timeout: 30000 }).toString(); + } catch (err) { + result = `Command error: ${err instanceof Error ? err.message : String(err)}`; + } + } + } + } else if (callName === "listFiles") { + const dirPath = resolve(workDir, callArgs.path ?? "."); + try { + const real = await realpath(dirPath); + if (!real.startsWith(await realpath(workDir))) { + result = "Error: path outside workspace"; + } else { + const entries = await readdir(real, { recursive: true }); + result = entries.slice(0, 50).join("\n"); + } + } catch (err) { + result = `Error listing files: ${err instanceof Error ? err.message : String(err)}`; + } + } else { + result = `Unknown tool: ${callName}`; + } + + clog.toolResult("EVALUATOR", GEMINI_MODEL, callName, result.slice(0, 500)); + + toolResults.push({ + functionResponse: { + id: call.id ?? callName, + name: callName, + response: { result }, + }, + }); + } + + // Send tool results back + response = await chat.sendMessage({ + message: toolResults.map((r) => ({ functionResponse: r.functionResponse })), + }); + } + + // Extract final text response + const evaluationText = response.text ?? ""; + const durationMs = Date.now() - startMs; + + log("EVALUATOR", `Evaluation complete for sprint ${sprint}`); + + const evalResult = parseEvalResult(evaluationText, contract, passThreshold); + + // Build scores map for logging + const scoresMap: Record = {}; + for (const f of evalResult.feedback) { + scoresMap[f.criterion] = f.score; + } + + clog.response("EVALUATOR", GEMINI_MODEL, evaluationText, { + sprint, + duration_ms: durationMs, + scores: scoresMap, + }); + + const passedCount = evalResult.feedback.filter((f) => f.score >= passThreshold).length; + const totalCount = evalResult.feedback.length; + const verdict = evalResult.passed ? "PASSED" : "FAILED"; + log("EVALUATOR", `Sprint ${sprint}: ${verdict} (${passedCount}/${totalCount} criteria passed)`); + + for (const item of evalResult.feedback) { + const status = item.score >= passThreshold ? "\x1b[32mPASS\x1b[0m" : "\x1b[31mFAIL\x1b[0m"; + log("EVALUATOR", ` [${status}] ${item.criterion}: ${item.score}/10 - ${item.details.slice(0, 100)}`); + } + + return evalResult; +} + +function parseEvalResult( + response: string, + contract: SprintContract, + passThreshold: number, +): EvalResult { + const candidates: string[] = []; + const codeBlocks = [...response.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = response.match(/\{[\s\S]*"passed"[\s\S]*"feedback"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(response.trim()); + + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as EvalResult; + if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) { + parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold); + return parsed; + } + } catch { + // Try next candidate + } + } + + logError("EVALUATOR", "Failed to parse evaluation JSON from any extraction strategy"); + return { + passed: false, + scores: {}, + feedback: contract.criteria.map((c) => ({ + criterion: c.name, + score: 0, + details: "Evaluator failed to produce parseable output", + })), + overallSummary: "Evaluation parsing failed. Raw response: " + response.slice(0, 500), + }; +} diff --git a/gemini-harness/generator.ts b/gemini-harness/generator.ts new file mode 100644 index 0000000..5b1a5ba --- /dev/null +++ b/gemini-harness/generator.ts @@ -0,0 +1,74 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { GENERATOR_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { CLAUDE_MODEL, CLAUDE_MAX_TURNS } from "../shared/config.ts"; +import { log } from "../shared/logger.ts"; +import type { SprintContract, EvalResult } from "../shared/types.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runGenerator( + workDir: string, + spec: string, + contract: SprintContract, + clog: ConversationLogger, + previousFeedback?: EvalResult, +): Promise<{ response: string; sessionId?: string }> { + const sprint = contract.sprintNumber; + const attempt = previousFeedback ? "retry" : "initial"; + log("GENERATOR", `[Claude/${CLAUDE_MODEL}] Sprint ${sprint} (${attempt}) - Building: ${contract.features.join(", ")}`); + + let prompt = `IMPORTANT: Your working directory is ${workDir}. All code MUST be created inside ${workDir}/app/. Do NOT create files outside of ${workDir}.\n\n## Product Spec\n\n${spec}\n\n## Sprint Contract\n\n${JSON.stringify(contract, null, 2)}`; + + if (previousFeedback) { + prompt += `\n\n## Evaluation Feedback (MUST ADDRESS)\n\n${JSON.stringify(previousFeedback, null, 2)}`; + prompt += `\n\nThe previous attempt failed evaluation. Address every issue in the feedback above.`; + } else { + prompt += `\n\nImplement the features listed in this sprint contract. Work in the \`app/\` directory.`; + } + + clog.prompt("GENERATOR", CLAUDE_MODEL, prompt, { sprint, attempt: previousFeedback ? 2 : 1 }); + + const options: Options = { + cwd: workDir, + systemPrompt: GENERATOR_SYSTEM_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read", "Write", "Edit", "Bash", "Glob", "Grep"], + model: CLAUDE_MODEL, + maxTurns: CLAUDE_MAX_TURNS, + persistSession: true, + }; + + let fullResponse = ""; + let sessionId: string | undefined; + const startMs = Date.now(); + + for await (const msg of query({ prompt, options })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string; name?: string; input?: any }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + fullResponse += block.text; + } else if (block.type === "tool_use" && block.name) { + log("GENERATOR", ` Tool: ${block.name}`); + clog.toolCall("GENERATOR", CLAUDE_MODEL, block.name, JSON.stringify(block.input ?? {}).slice(0, 500)); + } + } + } else if (msg.type === "result") { + const result = msg as { session_id?: string }; + sessionId = result.session_id; + log("GENERATOR", `Sprint ${sprint} build complete (session: ${sessionId?.slice(0, 8)}...)`); + } + } + + const durationMs = Date.now() - startMs; + clog.response("GENERATOR", CLAUDE_MODEL, fullResponse || "(tools only, no text output)", { + sprint, + duration_ms: durationMs, + }); + + if (!fullResponse) { + log("GENERATOR", `Sprint ${sprint} completed (agent used tools only, no text output)`); + } + + return { response: fullResponse, sessionId }; +} diff --git a/gemini-harness/harness.ts b/gemini-harness/harness.ts new file mode 100644 index 0000000..9bafb4b --- /dev/null +++ b/gemini-harness/harness.ts @@ -0,0 +1,327 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { + CONTRACT_NEGOTIATION_GENERATOR_PROMPT, + CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, +} from "../shared/prompts.ts"; +import { CLAUDE_MODEL, GEMINI_MODEL } from "../shared/config.ts"; +import { log, logError, logDivider } from "../shared/logger.ts"; +import { ConversationLogger } from "../shared/conversation-logger.ts"; +import { + initWorkspace, + writeSpec, + readSpec, + writeContract, + writeFeedback, + writeProgress, +} from "../shared/files.ts"; +import type { + HarnessConfig, + SprintContract, + EvalResult, + HarnessProgress, + SprintResult, + HarnessResult, +} from "../shared/types.ts"; +import { runPlanner } from "./planner.ts"; +import { runGenerator } from "./generator.ts"; +import { runEvaluator } from "./evaluator.ts"; + +export async function runHarness(config: HarnessConfig & { logDir?: string }): Promise { + const startTime = Date.now(); + const results: SprintResult[] = []; + const logDir = config.logDir || "./logs"; + const clog = new ConversationLogger(logDir); + clog.system(`Gemini Harness started — Generator: Claude (${CLAUDE_MODEL}), Evaluator: Gemini (${GEMINI_MODEL})`); + log("HARNESS", "ADVERSARIAL DEV - Gemini Harness (Claude Generator + Gemini Evaluator)"); + log("HARNESS", `Work directory: ${config.workDir}`); + log("HARNESS", `Max sprints: ${config.maxSprints} | Max retries: ${config.maxRetriesPerSprint} | Threshold: ${config.passThreshold}/10`); + + await initWorkspace(config.workDir); + + // Phase 1: Planning (Claude) + logDivider(); + log("HARNESS", "PHASE 1: PLANNING (Claude Opus)"); + logDivider(); + + const progress: HarnessProgress = { + status: "planning", + currentSprint: 0, + totalSprints: 0, + completedSprints: 0, + retryCount: 0, + }; + await writeProgress(config.workDir, progress); + + const plannerResponse = await runPlanner(config.userPrompt, config.workDir, clog); + + let spec: string; + try { + spec = await readSpec(config.workDir); + } catch { + log("HARNESS", "Planner returned spec as text, writing to spec.md"); + await writeSpec(config.workDir, plannerResponse); + spec = plannerResponse; + } + + // Parse sprint count from spec + const sprintNumbers = Array.from(spec.matchAll(/sprint\s+(\d+)/gi)) + .map((m) => parseInt(m[1]!, 10)) + .filter((n) => n > 0 && n <= config.maxSprints); + const totalSprints = sprintNumbers.length > 0 + ? Math.min(Math.max(...sprintNumbers), config.maxSprints) + : 3; + + progress.totalSprints = totalSprints; + log("HARNESS", `Planner produced ${totalSprints} sprints`); + + // Phase 2-4: Sprint Loop + for (let sprint = 1; sprint <= totalSprints; sprint++) { + logDivider(); + log("HARNESS", `SPRINT ${sprint}/${totalSprints}`); + logDivider(); + + // Phase 2: Contract Negotiation (Claude proposes, Claude reviews — same model for contract alignment) + progress.status = "negotiating"; + progress.currentSprint = sprint; + progress.retryCount = 0; + await writeProgress(config.workDir, progress); + + log("HARNESS", "Negotiating sprint contract..."); + let contract: SprintContract; + let negotiationAttempts = 0; + const maxNegotiationAttempts = 2; + while (true) { + try { + contract = await negotiateContract(config.workDir, spec, sprint, clog); + break; + } catch (e) { + negotiationAttempts++; + if (negotiationAttempts >= maxNegotiationAttempts) { + logError("HARNESS", `Contract negotiation failed after ${negotiationAttempts} attempts: ${e}`); + throw e; + } + log("HARNESS", `Contract negotiation produced invalid output, retrying (${negotiationAttempts}/${maxNegotiationAttempts})...`); + } + } + await writeContract(config.workDir, contract); + log("HARNESS", `Contract agreed: ${contract.criteria.length} criteria for ${contract.features.length} features`); + + // Phase 3-4: Build (Claude) -> Evaluate (Gemini) Loop + let passed = false; + let lastEval: EvalResult | undefined; + let attempts = 0; + + for (let retry = 0; retry <= config.maxRetriesPerSprint; retry++) { + attempts = retry + 1; + + // Build (Claude Opus) + log("HARNESS", `--- BUILD ATTEMPT ${attempts} (Claude Opus) ---`); + progress.status = "building"; + progress.retryCount = retry; + await writeProgress(config.workDir, progress); + + await runGenerator(config.workDir, spec, contract, clog, lastEval); + + // Evaluate (Gemini 3.1 Pro) + log("HARNESS", `--- EVALUATION (Gemini Evaluator) ---`); + progress.status = "evaluating"; + await writeProgress(config.workDir, progress); + + lastEval = await runEvaluator(config.workDir, contract, config.passThreshold, clog); + await writeFeedback(config.workDir, sprint, retry, lastEval); + + if (lastEval.passed) { + passed = true; + log("HARNESS", `Sprint ${sprint} PASSED on attempt ${attempts}`); + break; + } + + if (retry < config.maxRetriesPerSprint) { + log("HARNESS", `Sprint ${sprint} failed attempt ${attempts}, retrying...`); + + // Check if we should renegotiate criteria + if (retry >= 1 && lastEval && lastEval.feedback.length > 0) { + const avgScore = lastEval.feedback.reduce((sum, f) => sum + f.score, 0) / lastEval.feedback.length; + const allFailing = lastEval.feedback.every(f => f.score < (contract.criteria.find(c => c.name === f.criterion)?.threshold ?? 7)); + + // Renegotiate if average score is very low or all criteria are failing + if (allFailing || avgScore < 4) { + if (allFailing) { + log("HARNESS", `All criteria failing (avg score: ${avgScore.toFixed(1)}), renegotiating contract...`); + } else { + log("HARNESS", `Low average score (${avgScore.toFixed(1)}), renegotiating contract...`); + } + try { + contract = await negotiateContract(config.workDir, spec, sprint, clog); + await writeContract(config.workDir, contract); + log("HARNESS", `Renegotiated contract: ${contract.criteria.length} criteria for ${contract.features.length} features`); + } catch (e) { + logError("HARNESS", `Renegotiation failed, continuing with current contract: ${e}`); + } + } + } + } else { + logError("HARNESS", `Sprint ${sprint} FAILED after ${attempts} attempts`); + } + } + + results.push({ + sprintNumber: sprint, + passed, + attempts, + evalResult: lastEval, + }); + + if (passed) { + progress.completedSprints++; + } else { + progress.status = "failed"; + await writeProgress(config.workDir, progress); + logError("HARNESS", `Harness stopped: sprint ${sprint} could not pass evaluation`); + break; + } + } + + // Final status + const allPassed = results.every((r) => r.passed); + progress.status = allPassed ? "complete" : "failed"; + await writeProgress(config.workDir, progress); + + const totalDuration = Date.now() - startTime; + logDivider(); + log("HARNESS", `Harness ${allPassed ? "COMPLETED" : "FAILED"} in ${(totalDuration / 1000 / 60).toFixed(1)} minutes`); + log("HARNESS", `Sprints: ${results.filter((r) => r.passed).length}/${results.length} passed`); + + // Save conversation log + clog.system(`Harness ${allPassed ? "COMPLETED" : "FAILED"} — ${results.filter(r => r.passed).length}/${results.length} sprints passed in ${(totalDuration / 1000 / 60).toFixed(1)} min`); + const { mdPath, jsonlPath } = await clog.save(); + log("HARNESS", `Conversation log saved: ${mdPath}`); + log("HARNESS", `JSONL log saved: ${jsonlPath}`); + + return { success: allPassed, sprints: results, totalDurationMs: totalDuration }; +} + +async function negotiateContract( + workDir: string, + spec: string, + sprintNumber: number, + clog: ConversationLogger, +): Promise { + const maxRounds = 3; + let round = 0; + let proposalText = ""; + let reviewText = ""; + let approved = false; + + while (round < maxRounds && !approved) { + round++; + log("HARNESS", `Contract negotiation round ${round}/${maxRounds}`); + clog.system(`Contract negotiation round ${round}/${maxRounds} for sprint ${sprintNumber}`); + + // Generator proposes or counter-proposes + let generatorPrompt: string; + if (round === 1) { + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; + } else { + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\n## Evaluator Feedback\n\nThe evaluator reviewed the contract and provided this feedback:\n\n${reviewText}\n\nPlease revise the contract based on this feedback. If the evaluator approved, output "APPROVED". Otherwise, output a revised contract.`; + } + + const proposalOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_GENERATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + proposalText = ""; + for await (const msg of query({ prompt: generatorPrompt, options: proposalOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + proposalText += block.text; + } + } + } + } + + clog.response("CONTRACT_GEN", CLAUDE_MODEL, proposalText, { sprint: sprintNumber, round }); + // Check if generator approved (only in subsequent rounds) + if (round > 1 && proposalText.trim() === "APPROVED") { + approved = true; + log("HARNESS", "Generator accepted evaluator revisions, contract finalized"); + break; + } + + // Evaluator reviews contract + const reviewPrompt = `## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; + + const reviewOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + reviewText = ""; + for await (const msg of query({ prompt: reviewPrompt, options: reviewOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + reviewText += block.text; + } + } + } + } + + clog.response("CONTRACT_EVAL", CLAUDE_MODEL, reviewText, { sprint: sprintNumber, round }); + + // Check if evaluator approved + if (reviewText.trim().toUpperCase().startsWith("APPROVED")) { + approved = true; + log("HARNESS", `Contract approved by evaluator in round ${round}`); + break; + } + + if (round >= maxRounds) { + log("HARNESS", `Max negotiation rounds (${maxRounds}) reached, using evaluator version`); + } + } + + const contractSource = reviewText.trim().toUpperCase().startsWith("APPROVED") ? proposalText : reviewText; + return parseContract(contractSource, sprintNumber); +} + +function parseContract(text: string, sprintNumber: number): SprintContract { + const candidates: string[] = []; + const codeBlocks = [...text.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = text.match(/\{[\s\S]*"criteria"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(text.trim()); + + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as SprintContract; + if (parsed.criteria && Array.isArray(parsed.criteria)) { + parsed.sprintNumber = sprintNumber; + return parsed; + } + } catch { + // Try next candidate + } + } + + throw new Error(`Contract negotiation produced unparseable output. Raw text: ${text.slice(0, 200)}`); +} diff --git a/gemini-harness/index.ts b/gemini-harness/index.ts new file mode 100644 index 0000000..9dd9e1b --- /dev/null +++ b/gemini-harness/index.ts @@ -0,0 +1,63 @@ +import { resolve } from "path"; +import { readFile } from "fs/promises"; +import { runHarness } from "./harness.ts"; +import { DEFAULT_CONFIG } from "../shared/config.ts"; +import { log, logError, logDivider } from "../shared/logger.ts"; +import type { HarnessConfig } from "../shared/types.ts"; + +let userPrompt: string | undefined; + +const arg = process.argv[2]; +if (arg === "--file" || arg === "-f") { + const filePath = process.argv[3]; + if (!filePath) { + console.error("Error: --file requires a path argument"); + process.exit(1); + } + userPrompt = await readFile(resolve(filePath), "utf-8"); +} else { + userPrompt = arg; +} + +if (!userPrompt) { + console.error("Usage: bun run gemini-harness/index.ts "); + console.error(' bun run gemini-harness/index.ts --file '); + console.error('Example: bun run gemini-harness/index.ts "Build a task manager with REST API and dashboard"'); + process.exit(1); +} + +const config = { + ...DEFAULT_CONFIG, + userPrompt, + workDir: resolve("workspace/gemini"), + logDir: resolve(process.env.HARNESS_LOG_DIR || "./logs"), +}; + +logDivider(); +log("HARNESS", "ADVERSARIAL DEV - Gemini Harness (Claude Opus Generator + Gemini 3.1 Pro Evaluator)"); +log("HARNESS", `Prompt: "${userPrompt}"`); +logDivider(); + +try { + const result = await runHarness(config); + + logDivider(); + if (result.success) { + log("HARNESS", "All sprints completed successfully!"); + } else { + logError("HARNESS", "Harness completed with failures."); + } + + log("HARNESS", `Total time: ${(result.totalDurationMs / 1000 / 60).toFixed(1)} minutes`); + log("HARNESS", `Sprints passed: ${result.sprints.filter((s) => s.passed).length}/${result.sprints.length}`); + + for (const sprint of result.sprints) { + const status = sprint.passed ? "\x1b[32mPASS\x1b[0m" : "\x1b[31mFAIL\x1b[0m"; + log("HARNESS", ` Sprint ${sprint.sprintNumber}: [${status}] (${sprint.attempts} attempts)`); + } + + process.exit(result.success ? 0 : 1); +} catch (error) { + logError("HARNESS", `Fatal error: ${error instanceof Error ? error.message : String(error)}`); + process.exit(1); +} diff --git a/gemini-harness/planner.ts b/gemini-harness/planner.ts new file mode 100644 index 0000000..f561569 --- /dev/null +++ b/gemini-harness/planner.ts @@ -0,0 +1,69 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { readFile } from "fs/promises"; +import { join } from "path"; +import { PLANNER_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { CLAUDE_MODEL, CLAUDE_MAX_TURNS } from "../shared/config.ts"; +import { log, logError } from "../shared/logger.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runPlanner(userPrompt: string, workDir: string, clog: ConversationLogger): Promise { + log("PLANNER", `[Claude/${CLAUDE_MODEL}] Starting planning for: "${userPrompt.slice(0, 80)}..."`); + + const fullPrompt = `IMPORTANT: Your working directory is ${workDir}. All files you create (including spec.md) MUST be written inside this directory. Do NOT write files anywhere else.\n\n${userPrompt}`; + + clog.prompt("PLANNER", CLAUDE_MODEL, fullPrompt); + + const options: Options = { + cwd: workDir, + systemPrompt: PLANNER_SYSTEM_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read", "Write"], + model: CLAUDE_MODEL, + maxTurns: CLAUDE_MAX_TURNS, + persistSession: false, + }; + + let fullResponse = ""; + let completed = false; + const startMs = Date.now(); + + for await (const msg of query({ prompt: fullPrompt, options })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string; name?: string; input?: any }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + fullResponse += block.text; + } else if (block.type === "tool_use" && block.name) { + clog.toolCall("PLANNER", CLAUDE_MODEL, block.name, JSON.stringify(block.input ?? {}).slice(0, 500)); + } + } + } else if (msg.type === "result") { + completed = true; + log("PLANNER", "Planning complete"); + } + } + + if (!completed) { + clog.error("PLANNER", "Planner query did not complete"); + logError("PLANNER", "Planner query did not complete"); + throw new Error("Planner failed to produce output"); + } + + if (!fullResponse) { + try { + fullResponse = await readFile(join(workDir, "spec.md"), "utf-8"); + log("PLANNER", "Read spec from file written by planner agent"); + } catch { + clog.error("PLANNER", "No text response and no spec.md on disk"); + logError("PLANNER", "No text response and no spec.md on disk"); + throw new Error("Planner completed but produced no spec"); + } + } + + const durationMs = Date.now() - startMs; + clog.response("PLANNER", CLAUDE_MODEL, fullResponse, { duration_ms: durationMs }); + + log("PLANNER", "Product specification generated"); + return fullResponse; +} diff --git a/mixed-harness/evaluator.ts b/mixed-harness/evaluator.ts new file mode 100644 index 0000000..e9c3d5b --- /dev/null +++ b/mixed-harness/evaluator.ts @@ -0,0 +1,113 @@ +import { Codex } from "@openai/codex-sdk"; +import { EVALUATOR_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { CODEX_MODEL, CODEX_NETWORK_ACCESS } from "../shared/config.ts"; +import { log, logError } from "../shared/logger.ts"; +import type { SprintContract, EvalResult } from "../shared/types.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runEvaluator( + workDir: string, + contract: SprintContract, + passThreshold: number, + clog: ConversationLogger, +): Promise { + const sprint = contract.sprintNumber; + log("EVALUATOR", `[Codex/${CODEX_MODEL}] Evaluating sprint ${sprint} against ${contract.criteria.length} criteria`); + + const taskPrompt = `## Sprint Contract to Evaluate Against + +${JSON.stringify(contract, null, 2)} + +## Pass Threshold + +Each criterion must score at least ${passThreshold}/10 to pass. + +## Instructions + +Examine the application in the \`app/\` directory. Read the code, run it if possible, and score each criterion. Output ONLY the JSON evaluation object.`; + + const fullPrompt = `${EVALUATOR_SYSTEM_PROMPT}\n\n---\n\n${taskPrompt}`; + + clog.prompt("EVALUATOR", CODEX_MODEL, taskPrompt, { sprint }); + + const startMs = Date.now(); + const codex = new Codex(); + const thread = codex.startThread({ + workingDirectory: workDir, + sandboxMode: "danger-full-access", + networkAccessEnabled: CODEX_NETWORK_ACCESS, + approvalPolicy: "never", + model: CODEX_MODEL, + }); + + const turn = await thread.run(fullPrompt); + const response = turn.finalResponse ?? ""; + const durationMs = Date.now() - startMs; + + log("EVALUATOR", `Evaluation complete for sprint ${sprint}`); + + const evalResult = parseEvalResult(response, contract, passThreshold); + + // Build scores map for logging + const scoresMap: Record = {}; + for (const f of evalResult.feedback) { + scoresMap[f.criterion] = f.score; + } + + clog.response("EVALUATOR", CODEX_MODEL, response, { + sprint, + duration_ms: durationMs, + scores: scoresMap, + }); + + const passedCount = evalResult.feedback.filter((f) => f.score >= passThreshold).length; + const totalCount = evalResult.feedback.length; + const verdict = evalResult.passed ? "PASSED" : "FAILED"; + log("EVALUATOR", `Sprint ${sprint}: ${verdict} (${passedCount}/${totalCount} criteria passed)`); + + for (const item of evalResult.feedback) { + const status = item.score >= passThreshold ? "\x1b[32mPASS\x1b[0m" : "\x1b[31mFAIL\x1b[0m"; + log("EVALUATOR", ` [${status}] ${item.criterion}: ${item.score}/10 - ${item.details.slice(0, 100)}`); + } + + return evalResult; +} + +function parseEvalResult( + response: string, + contract: SprintContract, + passThreshold: number, +): EvalResult { + const candidates: string[] = []; + const codeBlocks = [...response.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = response.match(/\{[\s\S]*"passed"[\s\S]*"feedback"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(response.trim()); + + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as EvalResult; + if (parsed.feedback && Array.isArray(parsed.feedback) && parsed.feedback.length > 0) { + parsed.passed = parsed.feedback.every((f) => f.score >= passThreshold); + return parsed; + } + } catch { + // Try next candidate + } + } + + logError("EVALUATOR", "Failed to parse evaluation JSON from any extraction strategy"); + return { + passed: false, + scores: {}, + feedback: contract.criteria.map((c) => ({ + criterion: c.name, + score: 0, + details: "Evaluator failed to produce parseable output", + })), + overallSummary: "Evaluation parsing failed. Raw response: " + response.slice(0, 500), + }; +} diff --git a/mixed-harness/generator.ts b/mixed-harness/generator.ts new file mode 100644 index 0000000..5b1a5ba --- /dev/null +++ b/mixed-harness/generator.ts @@ -0,0 +1,74 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { GENERATOR_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { CLAUDE_MODEL, CLAUDE_MAX_TURNS } from "../shared/config.ts"; +import { log } from "../shared/logger.ts"; +import type { SprintContract, EvalResult } from "../shared/types.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runGenerator( + workDir: string, + spec: string, + contract: SprintContract, + clog: ConversationLogger, + previousFeedback?: EvalResult, +): Promise<{ response: string; sessionId?: string }> { + const sprint = contract.sprintNumber; + const attempt = previousFeedback ? "retry" : "initial"; + log("GENERATOR", `[Claude/${CLAUDE_MODEL}] Sprint ${sprint} (${attempt}) - Building: ${contract.features.join(", ")}`); + + let prompt = `IMPORTANT: Your working directory is ${workDir}. All code MUST be created inside ${workDir}/app/. Do NOT create files outside of ${workDir}.\n\n## Product Spec\n\n${spec}\n\n## Sprint Contract\n\n${JSON.stringify(contract, null, 2)}`; + + if (previousFeedback) { + prompt += `\n\n## Evaluation Feedback (MUST ADDRESS)\n\n${JSON.stringify(previousFeedback, null, 2)}`; + prompt += `\n\nThe previous attempt failed evaluation. Address every issue in the feedback above.`; + } else { + prompt += `\n\nImplement the features listed in this sprint contract. Work in the \`app/\` directory.`; + } + + clog.prompt("GENERATOR", CLAUDE_MODEL, prompt, { sprint, attempt: previousFeedback ? 2 : 1 }); + + const options: Options = { + cwd: workDir, + systemPrompt: GENERATOR_SYSTEM_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read", "Write", "Edit", "Bash", "Glob", "Grep"], + model: CLAUDE_MODEL, + maxTurns: CLAUDE_MAX_TURNS, + persistSession: true, + }; + + let fullResponse = ""; + let sessionId: string | undefined; + const startMs = Date.now(); + + for await (const msg of query({ prompt, options })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string; name?: string; input?: any }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + fullResponse += block.text; + } else if (block.type === "tool_use" && block.name) { + log("GENERATOR", ` Tool: ${block.name}`); + clog.toolCall("GENERATOR", CLAUDE_MODEL, block.name, JSON.stringify(block.input ?? {}).slice(0, 500)); + } + } + } else if (msg.type === "result") { + const result = msg as { session_id?: string }; + sessionId = result.session_id; + log("GENERATOR", `Sprint ${sprint} build complete (session: ${sessionId?.slice(0, 8)}...)`); + } + } + + const durationMs = Date.now() - startMs; + clog.response("GENERATOR", CLAUDE_MODEL, fullResponse || "(tools only, no text output)", { + sprint, + duration_ms: durationMs, + }); + + if (!fullResponse) { + log("GENERATOR", `Sprint ${sprint} completed (agent used tools only, no text output)`); + } + + return { response: fullResponse, sessionId }; +} diff --git a/mixed-harness/harness.ts b/mixed-harness/harness.ts new file mode 100644 index 0000000..bab0e2d --- /dev/null +++ b/mixed-harness/harness.ts @@ -0,0 +1,327 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { + CONTRACT_NEGOTIATION_GENERATOR_PROMPT, + CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, +} from "../shared/prompts.ts"; +import { CLAUDE_MODEL } from "../shared/config.ts"; +import { log, logError, logDivider } from "../shared/logger.ts"; +import { ConversationLogger } from "../shared/conversation-logger.ts"; +import { + initWorkspace, + writeSpec, + readSpec, + writeContract, + writeFeedback, + writeProgress, +} from "../shared/files.ts"; +import type { + HarnessConfig, + SprintContract, + EvalResult, + HarnessProgress, + SprintResult, + HarnessResult, +} from "../shared/types.ts"; +import { runPlanner } from "./planner.ts"; +import { runGenerator } from "./generator.ts"; +import { runEvaluator } from "./evaluator.ts"; + +export async function runHarness(config: HarnessConfig & { logDir?: string }): Promise { + const startTime = Date.now(); + const results: SprintResult[] = []; + const logDir = config.logDir || "./logs"; + const clog = new ConversationLogger(logDir); + clog.system(`Mixed Harness started — Generator: Claude (${CLAUDE_MODEL}), Evaluator: Codex (gpt-5.4)`); + log("HARNESS", "ADVERSARIAL DEV - Mixed Harness (Claude Generator + Codex Evaluator)"); + log("HARNESS", `Work directory: ${config.workDir}`); + log("HARNESS", `Max sprints: ${config.maxSprints} | Max retries: ${config.maxRetriesPerSprint} | Threshold: ${config.passThreshold}/10`); + + await initWorkspace(config.workDir); + + // Phase 1: Planning (Claude) + logDivider(); + log("HARNESS", "PHASE 1: PLANNING (Claude Opus)"); + logDivider(); + + const progress: HarnessProgress = { + status: "planning", + currentSprint: 0, + totalSprints: 0, + completedSprints: 0, + retryCount: 0, + }; + await writeProgress(config.workDir, progress); + + const plannerResponse = await runPlanner(config.userPrompt, config.workDir, clog); + + let spec: string; + try { + spec = await readSpec(config.workDir); + } catch { + log("HARNESS", "Planner returned spec as text, writing to spec.md"); + await writeSpec(config.workDir, plannerResponse); + spec = plannerResponse; + } + + // Parse sprint count from spec + const sprintNumbers = Array.from(spec.matchAll(/sprint\s+(\d+)/gi)) + .map((m) => parseInt(m[1]!, 10)) + .filter((n) => n > 0 && n <= config.maxSprints); + const totalSprints = sprintNumbers.length > 0 + ? Math.min(Math.max(...sprintNumbers), config.maxSprints) + : 3; + + progress.totalSprints = totalSprints; + log("HARNESS", `Planner produced ${totalSprints} sprints`); + + // Phase 2-4: Sprint Loop + for (let sprint = 1; sprint <= totalSprints; sprint++) { + logDivider(); + log("HARNESS", `SPRINT ${sprint}/${totalSprints}`); + logDivider(); + + // Phase 2: Contract Negotiation (Claude proposes, Claude reviews — same model for contract alignment) + progress.status = "negotiating"; + progress.currentSprint = sprint; + progress.retryCount = 0; + await writeProgress(config.workDir, progress); + + log("HARNESS", "Negotiating sprint contract..."); + let contract: SprintContract; + let negotiationAttempts = 0; + const maxNegotiationAttempts = 2; + while (true) { + try { + contract = await negotiateContract(config.workDir, spec, sprint, clog); + break; + } catch (e) { + negotiationAttempts++; + if (negotiationAttempts >= maxNegotiationAttempts) { + logError("HARNESS", `Contract negotiation failed after ${negotiationAttempts} attempts: ${e}`); + throw e; + } + log("HARNESS", `Contract negotiation produced invalid output, retrying (${negotiationAttempts}/${maxNegotiationAttempts})...`); + } + } + await writeContract(config.workDir, contract); + log("HARNESS", `Contract agreed: ${contract.criteria.length} criteria for ${contract.features.length} features`); + + // Phase 3-4: Build (Claude) → Evaluate (Codex) Loop + let passed = false; + let lastEval: EvalResult | undefined; + let attempts = 0; + + for (let retry = 0; retry <= config.maxRetriesPerSprint; retry++) { + attempts = retry + 1; + + // Build (Claude Opus) + log("HARNESS", `--- BUILD ATTEMPT ${attempts} (Claude Opus) ---`); + progress.status = "building"; + progress.retryCount = retry; + await writeProgress(config.workDir, progress); + + await runGenerator(config.workDir, spec, contract, clog, lastEval); + + // Evaluate (Codex GPT-5.4) + log("HARNESS", `--- EVALUATION (Codex GPT-5.4) ---`); + progress.status = "evaluating"; + await writeProgress(config.workDir, progress); + + lastEval = await runEvaluator(config.workDir, contract, config.passThreshold, clog); + await writeFeedback(config.workDir, sprint, retry, lastEval); + + if (lastEval.passed) { + passed = true; + log("HARNESS", `Sprint ${sprint} PASSED on attempt ${attempts}`); + break; + } + + if (retry < config.maxRetriesPerSprint) { + log("HARNESS", `Sprint ${sprint} failed attempt ${attempts}, retrying...`); + + // Check if we should renegotiate criteria + if (retry >= 1 && lastEval && lastEval.feedback.length > 0) { + const avgScore = lastEval.feedback.reduce((sum, f) => sum + f.score, 0) / lastEval.feedback.length; + const allFailing = lastEval.feedback.every(f => f.score < (contract.criteria.find(c => c.name === f.criterion)?.threshold ?? 7)); + + // Renegotiate if average score is very low or all criteria are failing + if (allFailing || avgScore < 4) { + if (allFailing) { + log("HARNESS", `All criteria failing (avg score: ${avgScore.toFixed(1)}), renegotiating contract...`); + } else { + log("HARNESS", `Low average score (${avgScore.toFixed(1)}), renegotiating contract...`); + } + try { + contract = await negotiateContract(config.workDir, spec, sprint, clog); + await writeContract(config.workDir, contract); + log("HARNESS", `Renegotiated contract: ${contract.criteria.length} criteria for ${contract.features.length} features`); + } catch (e) { + logError("HARNESS", `Renegotiation failed, continuing with current contract: ${e}`); + } + } + } + } else { + logError("HARNESS", `Sprint ${sprint} FAILED after ${attempts} attempts`); + } + } + + results.push({ + sprintNumber: sprint, + passed, + attempts, + evalResult: lastEval, + }); + + if (passed) { + progress.completedSprints++; + } else { + progress.status = "failed"; + await writeProgress(config.workDir, progress); + logError("HARNESS", `Harness stopped: sprint ${sprint} could not pass evaluation`); + break; + } + } + + // Final status + const allPassed = results.every((r) => r.passed); + progress.status = allPassed ? "complete" : "failed"; + await writeProgress(config.workDir, progress); + + const totalDuration = Date.now() - startTime; + logDivider(); + log("HARNESS", `Harness ${allPassed ? "COMPLETED" : "FAILED"} in ${(totalDuration / 1000 / 60).toFixed(1)} minutes`); + log("HARNESS", `Sprints: ${results.filter((r) => r.passed).length}/${results.length} passed`); + + // Save conversation log + clog.system(`Harness ${allPassed ? "COMPLETED" : "FAILED"} — ${results.filter(r => r.passed).length}/${results.length} sprints passed in ${(totalDuration / 1000 / 60).toFixed(1)} min`); + const { mdPath, jsonlPath } = await clog.save(); + log("HARNESS", `Conversation log saved: ${mdPath}`); + log("HARNESS", `JSONL log saved: ${jsonlPath}`); + + return { success: allPassed, sprints: results, totalDurationMs: totalDuration }; +} + +async function negotiateContract( + workDir: string, + spec: string, + sprintNumber: number, + clog: ConversationLogger, +): Promise { + const maxRounds = 3; + let round = 0; + let proposalText = ""; + let reviewText = ""; + let approved = false; + + while (round < maxRounds && !approved) { + round++; + log("HARNESS", `Contract negotiation round ${round}/${maxRounds}`); + clog.system(`Contract negotiation round ${round}/${maxRounds} for sprint ${sprintNumber}`); + + // Generator proposes or counter-proposes + let generatorPrompt: string; + if (round === 1) { + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\nPropose a sprint contract for this sprint.`; + } else { + generatorPrompt = `## Product Spec\n\n${spec}\n\n## Sprint Number: ${sprintNumber}\n\n## Evaluator Feedback\n\nThe evaluator reviewed the contract and provided this feedback:\n\n${reviewText}\n\nPlease revise the contract based on this feedback. If the evaluator approved, output "APPROVED". Otherwise, output a revised contract.`; + } + + const proposalOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_GENERATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + proposalText = ""; + for await (const msg of query({ prompt: generatorPrompt, options: proposalOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + proposalText += block.text; + } + } + } + } + + clog.response("CONTRACT_GEN", CLAUDE_MODEL, proposalText, { sprint: sprintNumber, round }); + // Check if generator approved (only in subsequent rounds) + if (round > 1 && proposalText.trim() === "APPROVED") { + approved = true; + log("HARNESS", "Generator accepted evaluator revisions, contract finalized"); + break; + } + + // Evaluator reviews contract + const reviewPrompt = `## Proposed Sprint Contract\n\n${proposalText}\n\nReview this contract.`; + + const reviewOptions: Options = { + cwd: workDir, + systemPrompt: CONTRACT_NEGOTIATION_EVALUATOR_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read"], + model: CLAUDE_MODEL, + maxTurns: 10, + persistSession: false, + }; + + reviewText = ""; + for await (const msg of query({ prompt: reviewPrompt, options: reviewOptions })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + reviewText += block.text; + } + } + } + } + + clog.response("CONTRACT_EVAL", CLAUDE_MODEL, reviewText, { sprint: sprintNumber, round }); + + // Check if evaluator approved + if (reviewText.trim().toUpperCase().startsWith("APPROVED")) { + approved = true; + log("HARNESS", `Contract approved by evaluator in round ${round}`); + break; + } + + if (round >= maxRounds) { + log("HARNESS", `Max negotiation rounds (${maxRounds}) reached, using evaluator version`); + } + } + + const contractSource = reviewText.trim().toUpperCase().startsWith("APPROVED") ? proposalText : reviewText; + return parseContract(contractSource, sprintNumber); +} + +function parseContract(text: string, sprintNumber: number): SprintContract { + const candidates: string[] = []; + const codeBlocks = [...text.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = text.match(/\{[\s\S]*"criteria"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(text.trim()); + + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as SprintContract; + if (parsed.criteria && Array.isArray(parsed.criteria)) { + parsed.sprintNumber = sprintNumber; + return parsed; + } + } catch { + // Try next candidate + } + } + + throw new Error(`Contract negotiation produced unparseable output. Raw text: ${text.slice(0, 200)}`); +} diff --git a/mixed-harness/index.ts b/mixed-harness/index.ts new file mode 100644 index 0000000..814c328 --- /dev/null +++ b/mixed-harness/index.ts @@ -0,0 +1,63 @@ +import { resolve } from "path"; +import { readFile } from "fs/promises"; +import { runHarness } from "./harness.ts"; +import { DEFAULT_CONFIG } from "../shared/config.ts"; +import { log, logError, logDivider } from "../shared/logger.ts"; +import type { HarnessConfig } from "../shared/types.ts"; + +let userPrompt: string | undefined; + +const arg = process.argv[2]; +if (arg === "--file" || arg === "-f") { + const filePath = process.argv[3]; + if (!filePath) { + console.error("Error: --file requires a path argument"); + process.exit(1); + } + userPrompt = await readFile(resolve(filePath), "utf-8"); +} else { + userPrompt = arg; +} + +if (!userPrompt) { + console.error("Usage: bun run mixed-harness/index.ts "); + console.error(' bun run mixed-harness/index.ts --file '); + console.error('Example: bun run mixed-harness/index.ts "Build a task manager with REST API and dashboard"'); + process.exit(1); +} + +const config = { + ...DEFAULT_CONFIG, + userPrompt, + workDir: resolve("workspace/mixed"), + logDir: resolve(process.env.HARNESS_LOG_DIR || "./logs"), +}; + +logDivider(); +log("HARNESS", "ADVERSARIAL DEV - Mixed Harness (Claude Opus Generator + Codex GPT-5.4 Evaluator)"); +log("HARNESS", `Prompt: "${userPrompt}"`); +logDivider(); + +try { + const result = await runHarness(config); + + logDivider(); + if (result.success) { + log("HARNESS", "All sprints completed successfully!"); + } else { + logError("HARNESS", "Harness completed with failures."); + } + + log("HARNESS", `Total time: ${(result.totalDurationMs / 1000 / 60).toFixed(1)} minutes`); + log("HARNESS", `Sprints passed: ${result.sprints.filter((s) => s.passed).length}/${result.sprints.length}`); + + for (const sprint of result.sprints) { + const status = sprint.passed ? "\x1b[32mPASS\x1b[0m" : "\x1b[31mFAIL\x1b[0m"; + log("HARNESS", ` Sprint ${sprint.sprintNumber}: [${status}] (${sprint.attempts} attempts)`); + } + + process.exit(result.success ? 0 : 1); +} catch (error) { + logError("HARNESS", `Fatal error: ${error instanceof Error ? error.message : String(error)}`); + process.exit(1); +} diff --git a/mixed-harness/planner.ts b/mixed-harness/planner.ts new file mode 100644 index 0000000..f561569 --- /dev/null +++ b/mixed-harness/planner.ts @@ -0,0 +1,69 @@ +import { query, type Options } from "@anthropic-ai/claude-agent-sdk"; +import { readFile } from "fs/promises"; +import { join } from "path"; +import { PLANNER_SYSTEM_PROMPT } from "../shared/prompts.ts"; +import { CLAUDE_MODEL, CLAUDE_MAX_TURNS } from "../shared/config.ts"; +import { log, logError } from "../shared/logger.ts"; +import type { ConversationLogger } from "../shared/conversation-logger.ts"; + +export async function runPlanner(userPrompt: string, workDir: string, clog: ConversationLogger): Promise { + log("PLANNER", `[Claude/${CLAUDE_MODEL}] Starting planning for: "${userPrompt.slice(0, 80)}..."`); + + const fullPrompt = `IMPORTANT: Your working directory is ${workDir}. All files you create (including spec.md) MUST be written inside this directory. Do NOT write files anywhere else.\n\n${userPrompt}`; + + clog.prompt("PLANNER", CLAUDE_MODEL, fullPrompt); + + const options: Options = { + cwd: workDir, + systemPrompt: PLANNER_SYSTEM_PROMPT, + permissionMode: "bypassPermissions", + allowDangerouslySkipPermissions: true, + tools: ["Read", "Write"], + model: CLAUDE_MODEL, + maxTurns: CLAUDE_MAX_TURNS, + persistSession: false, + }; + + let fullResponse = ""; + let completed = false; + const startMs = Date.now(); + + for await (const msg of query({ prompt: fullPrompt, options })) { + if (msg.type === "assistant") { + const message = msg as { message: { content: Array<{ type: string; text?: string; name?: string; input?: any }> } }; + for (const block of message.message.content) { + if (block.type === "text" && block.text) { + fullResponse += block.text; + } else if (block.type === "tool_use" && block.name) { + clog.toolCall("PLANNER", CLAUDE_MODEL, block.name, JSON.stringify(block.input ?? {}).slice(0, 500)); + } + } + } else if (msg.type === "result") { + completed = true; + log("PLANNER", "Planning complete"); + } + } + + if (!completed) { + clog.error("PLANNER", "Planner query did not complete"); + logError("PLANNER", "Planner query did not complete"); + throw new Error("Planner failed to produce output"); + } + + if (!fullResponse) { + try { + fullResponse = await readFile(join(workDir, "spec.md"), "utf-8"); + log("PLANNER", "Read spec from file written by planner agent"); + } catch { + clog.error("PLANNER", "No text response and no spec.md on disk"); + logError("PLANNER", "No text response and no spec.md on disk"); + throw new Error("Planner completed but produced no spec"); + } + } + + const durationMs = Date.now() - startMs; + clog.response("PLANNER", CLAUDE_MODEL, fullResponse, { duration_ms: durationMs }); + + log("PLANNER", "Product specification generated"); + return fullResponse; +} diff --git a/package.json b/package.json index 260e36a..0b23bee 100644 --- a/package.json +++ b/package.json @@ -11,6 +11,7 @@ }, "dependencies": { "@anthropic-ai/claude-agent-sdk": "^0.2.85", + "@google/genai": "^1.48.0", "@openai/codex-sdk": "^0.117.0" } } diff --git a/shared/config.ts b/shared/config.ts index 821c963..f3c1e0c 100644 --- a/shared/config.ts +++ b/shared/config.ts @@ -6,8 +6,11 @@ export const DEFAULT_CONFIG: Omit = { passThreshold: 7, }; -export const CLAUDE_MODEL = "claude-sonnet-4-6"; +export const CLAUDE_MODEL = "claude-opus-4-6"; export const CODEX_MODEL = "gpt-5.4"; export const CLAUDE_MAX_TURNS = 50; export const CODEX_NETWORK_ACCESS = true; + +export const GEMINI_MODEL = "gemini-3.1-pro-preview"; +export const GEMINI_API_KEY = process.env.GEMINI_API_KEY ?? ""; diff --git a/shared/conversation-logger.ts b/shared/conversation-logger.ts new file mode 100644 index 0000000..639f287 --- /dev/null +++ b/shared/conversation-logger.ts @@ -0,0 +1,188 @@ +import { writeFile, mkdir } from "fs/promises"; +import { join } from "path"; + +type AgentRole = "PLANNER" | "GENERATOR" | "EVALUATOR" | "HARNESS" | "CONTRACT_GEN" | "CONTRACT_EVAL"; +type MessageType = "prompt" | "response" | "tool_call" | "tool_result" | "system" | "error"; + +interface ConversationEntry { + timestamp: string; + role: AgentRole; + model: string; + type: MessageType; + content: string; + metadata?: { + sprint?: number; + attempt?: number; + round?: number; + toolName?: string; + duration_ms?: number; + scores?: Record; + }; +} + +class ConversationLogger { + private entries: ConversationEntry[] = []; + private logDir: string; + private sessionId: string; + private startTime: number; + + constructor(logDir: string) { + this.logDir = logDir; + this.sessionId = new Date().toISOString().replace(/[:.]/g, "-").slice(0, 19); + this.startTime = Date.now(); + } + + log(role: AgentRole, model: string, type: MessageType, content: string, metadata?: ConversationEntry["metadata"]) { + this.entries.push({ + timestamp: new Date().toISOString(), + role, + model, + type, + content, + metadata, + }); + } + + prompt(role: AgentRole, model: string, content: string, metadata?: ConversationEntry["metadata"]) { + this.log(role, model, "prompt", content, metadata); + } + + response(role: AgentRole, model: string, content: string, metadata?: ConversationEntry["metadata"]) { + this.log(role, model, "response", content, metadata); + } + + toolCall(role: AgentRole, model: string, toolName: string, input: string) { + this.log(role, model, "tool_call", input, { toolName }); + } + + toolResult(role: AgentRole, model: string, toolName: string, output: string) { + this.log(role, model, "tool_result", output, { toolName }); + } + + system(message: string) { + this.log("HARNESS", "system", "system", message); + } + + error(role: AgentRole, message: string) { + this.log(role, "system", "error", message); + } + + /** + * Render as a beautiful markdown conversation log. + * Reads like a chat — who said what, with tool calls collapsed. + */ + toMarkdown(): string { + const elapsed = ((Date.now() - this.startTime) / 1000 / 60).toFixed(1); + const lines: string[] = []; + + lines.push(`# Adversarial Dev Harness — Conversation Log`); + lines.push(`Session: ${this.sessionId}`); + lines.push(`Duration: ${elapsed} minutes`); + lines.push(`Entries: ${this.entries.length}`); + lines.push(``); + lines.push(`---`); + lines.push(``); + + const roleEmoji: Record = { + PLANNER: "📋", + GENERATOR: "🔨", + EVALUATOR: "🔍", + HARNESS: "⚙️", + CONTRACT_GEN: "📝", + CONTRACT_EVAL: "✅", + }; + + const typeStyle: Record = { + prompt: "**→ Prompt**", + response: "**← Response**", + tool_call: "🔧 Tool Call", + tool_result: "📤 Tool Result", + system: "💬 System", + error: "❌ Error", + }; + + for (const entry of this.entries) { + const emoji = roleEmoji[entry.role] || "❓"; + const time = entry.timestamp.slice(11, 19); + const style = typeStyle[entry.type] || entry.type; + + // Header line + lines.push(`### ${emoji} ${entry.role} (${entry.model}) — ${style}`); + lines.push(`*${time}*`); + + // Metadata badges + if (entry.metadata) { + const badges: string[] = []; + if (entry.metadata.sprint !== undefined) badges.push(`Sprint ${entry.metadata.sprint}`); + if (entry.metadata.attempt !== undefined) badges.push(`Attempt ${entry.metadata.attempt}`); + if (entry.metadata.round !== undefined) badges.push(`Round ${entry.metadata.round}`); + if (entry.metadata.toolName) badges.push(`Tool: \`${entry.metadata.toolName}\``); + if (entry.metadata.duration_ms !== undefined) badges.push(`${entry.metadata.duration_ms}ms`); + if (badges.length > 0) { + lines.push(`> ${badges.join(" | ")}`); + } + if (entry.metadata.scores) { + lines.push(`> Scores: ${Object.entries(entry.metadata.scores).map(([k, v]) => `${k}=${v}`).join(", ")}`); + } + } + + lines.push(``); + + // Content — tool calls get collapsed + if (entry.type === "tool_call" || entry.type === "tool_result") { + lines.push(`
`); + lines.push(`${entry.metadata?.toolName || "tool"} ${entry.type === "tool_call" ? "input" : "output"}`); + lines.push(``); + lines.push("```"); + lines.push(entry.content.slice(0, 2000)); + if (entry.content.length > 2000) lines.push(`... (${entry.content.length} chars total)`); + lines.push("```"); + lines.push(`
`); + } else if (entry.content.length > 500) { + // Long content gets a collapsible too + lines.push(`
`); + lines.push(`Show full content (${entry.content.length} chars)`); + lines.push(``); + lines.push(entry.content); + lines.push(`
`); + } else { + lines.push(entry.content); + } + + lines.push(``); + lines.push(`---`); + lines.push(``); + } + + return lines.join("\n"); + } + + /** + * Export as JSONL for programmatic analysis. + */ + toJsonl(): string { + if (this.entries.length === 0) return ""; + return this.entries.map(e => JSON.stringify(e)).join("\n") + "\n"; + } + + /** + * Save both markdown and JSONL to the log directory. + */ + async save(): Promise<{ mdPath: string; jsonlPath: string }> { + await mkdir(this.logDir, { recursive: true }); + + const mdPath = join(this.logDir, `${this.sessionId}.md`); + const jsonlPath = join(this.logDir, `${this.sessionId}.jsonl`); + + await writeFile(mdPath, this.toMarkdown(), "utf-8"); + await writeFile(jsonlPath, this.toJsonl(), "utf-8"); + + return { mdPath, jsonlPath }; + } + + getEntryCount(): number { + return this.entries.length; + } +} + +export { ConversationLogger, type ConversationEntry, type AgentRole, type MessageType }; diff --git a/tests/conversation-logger.test.ts b/tests/conversation-logger.test.ts new file mode 100644 index 0000000..d595a02 --- /dev/null +++ b/tests/conversation-logger.test.ts @@ -0,0 +1,109 @@ +import { describe, test, expect } from "bun:test"; +import { ConversationLogger } from "../shared/conversation-logger.ts"; +import { existsSync } from "fs"; +import { rm } from "fs/promises"; +import { join } from "path"; + +describe("ConversationLogger", () => { + test("logs entries with correct fields", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + logger.prompt("GENERATOR", "claude-opus-4-6", "Build the auth module", { sprint: 1 }); + logger.response("GENERATOR", "claude-opus-4-6", "I'll create src/auth.ts...", { sprint: 1 }); + logger.toolCall("GENERATOR", "claude-opus-4-6", "Write", '{"path": "src/auth.ts"}'); + expect(logger.getEntryCount()).toBe(3); + }); + + test("toMarkdown produces readable output", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + logger.system("Starting harness"); + logger.prompt("PLANNER", "claude-opus-4-6", "Plan the sprints"); + logger.response("PLANNER", "claude-opus-4-6", "Sprint 1: Auth\nSprint 2: API"); + logger.prompt("GENERATOR", "claude-opus-4-6", "Build sprint 1", { sprint: 1, attempt: 1 }); + logger.response("GENERATOR", "claude-opus-4-6", "Created auth module"); + logger.toolCall("GENERATOR", "claude-opus-4-6", "Write", "src/auth.ts contents..."); + logger.prompt("EVALUATOR", "gpt-5.4", "Evaluate sprint 1", { sprint: 1 }); + logger.response("EVALUATOR", "gpt-5.4", '{"passed": false}', { sprint: 1, scores: { auth: 5, quality: 7 } }); + + const md = logger.toMarkdown(); + + // Structure checks + expect(md).toContain("# Adversarial Dev Harness"); + expect(md).toContain("PLANNER"); + expect(md).toContain("GENERATOR"); + expect(md).toContain("EVALUATOR"); + expect(md).toContain("claude-opus-4-6"); + expect(md).toContain("gpt-5.4"); + // Tool calls should be in collapsible + expect(md).toContain("
"); + expect(md).toContain("Write"); + // Scores badge + expect(md).toContain("auth=5"); + expect(md).toContain("quality=7"); + // Sprint/attempt badges + expect(md).toContain("Sprint 1"); + expect(md).toContain("Attempt 1"); + }); + + test("toJsonl produces valid JSONL", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + logger.system("test"); + logger.prompt("GENERATOR", "opus", "hello"); + logger.error("EVALUATOR", "parse failed"); + + const jsonl = logger.toJsonl(); + const lines = jsonl.trim().split("\n"); + expect(lines).toHaveLength(3); + + for (const line of lines) { + const parsed = JSON.parse(line); + expect(parsed.timestamp).toBeDefined(); + expect(parsed.role).toBeDefined(); + expect(parsed.type).toBeDefined(); + expect(parsed.content).toBeDefined(); + } + }); + + test("save writes files to disk", async () => { + const testDir = "/tmp/adversarial-test-logs-" + Date.now(); + const logger = new ConversationLogger(testDir); + logger.system("test save"); + logger.prompt("GENERATOR", "opus", "do the thing"); + + const { mdPath, jsonlPath } = await logger.save(); + + expect(existsSync(mdPath)).toBe(true); + expect(existsSync(jsonlPath)).toBe(true); + + // Cleanup + await rm(testDir, { recursive: true }); + }); + + test("long content gets collapsed in markdown", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + const longContent = "x".repeat(600); + logger.response("GENERATOR", "opus", longContent); + + const md = logger.toMarkdown(); + expect(md).toContain("
"); + expect(md).toContain("600 chars"); + }); + + test("tool calls show tool name in summary", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + logger.toolCall("GENERATOR", "opus", "Bash", "ls -la"); + logger.toolResult("GENERATOR", "opus", "Bash", "total 42\n-rw-r--r-- 1 ..."); + + const md = logger.toMarkdown(); + expect(md).toContain("Bash input"); + expect(md).toContain("Bash output"); + }); + + test("metadata badges render correctly", () => { + const logger = new ConversationLogger("/tmp/test-logs"); + logger.prompt("CONTRACT_GEN", "opus", "propose contract", { sprint: 2, round: 3 }); + + const md = logger.toMarkdown(); + expect(md).toContain("Sprint 2"); + expect(md).toContain("Round 3"); + }); +}); diff --git a/tests/mixed-harness.test.ts b/tests/mixed-harness.test.ts new file mode 100644 index 0000000..3d82e70 --- /dev/null +++ b/tests/mixed-harness.test.ts @@ -0,0 +1,234 @@ +import { describe, test, expect } from "bun:test"; + +// ============================================================ +// parseContract tests — fail-closed behavior +// ============================================================ +describe("parseContract", () => { + function parseContract(text: string, sprintNumber: number) { + const candidates: string[] = []; + const codeBlocks = [...text.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = text.match(/\{[\s\S]*"criteria"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(text.trim()); + + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as any; + if (parsed.criteria && Array.isArray(parsed.criteria)) { + parsed.sprintNumber = sprintNumber; + return parsed; + } + } catch { /* next */ } + } + throw new Error(`Contract negotiation produced unparseable output. Raw text: ${text.slice(0, 200)}`); + } + + test("parses valid JSON from code block", () => { + const text = '```json\n{"features": ["auth"], "criteria": [{"name": "login", "description": "works", "threshold": 7}]}\n```'; + const result = parseContract(text, 1); + expect(result.sprintNumber).toBe(1); + expect(result.criteria).toHaveLength(1); + expect(result.criteria[0].name).toBe("login"); + }); + + test("parses raw JSON without code block", () => { + const text = '{"features": ["auth"], "criteria": [{"name": "login", "description": "works", "threshold": 7}]}'; + const result = parseContract(text, 2); + expect(result.sprintNumber).toBe(2); + }); + + test("extracts JSON from surrounding prose", () => { + const text = 'Here is the contract:\n{"features": ["db"], "criteria": [{"name": "schema", "description": "valid", "threshold": 7}]}\nDone.'; + const result = parseContract(text, 3); + expect(result.criteria[0].name).toBe("schema"); + }); + + test("THROWS on garbage (fail-closed)", () => { + expect(() => parseContract("not json lol", 1)).toThrow("unparseable"); + }); + + test("THROWS on JSON without criteria field", () => { + expect(() => parseContract('{"features": ["auth"]}', 1)).toThrow("unparseable"); + }); + + test("THROWS on empty string", () => { + expect(() => parseContract("", 1)).toThrow("unparseable"); + }); + + test("THROWS on criteria as string not array", () => { + expect(() => parseContract('{"criteria": "nope"}', 1)).toThrow("unparseable"); + }); + + test("prefers last code block when multiple exist", () => { + const text = '```json\n{"features":["old"],"criteria":[{"name":"wrong","description":"x","threshold":5}]}\n```\n```json\n{"features":["new"],"criteria":[{"name":"right","description":"y","threshold":7}]}\n```'; + const result = parseContract(text, 1); + expect(result.features).toEqual(["new"]); + }); + + test("overwrites sprintNumber", () => { + const text = '{"sprintNumber": 999, "features": ["x"], "criteria": [{"name": "a", "description": "b", "threshold": 7}]}'; + const result = parseContract(text, 5); + expect(result.sprintNumber).toBe(5); + }); +}); + +// ============================================================ +// Renegotiation trigger logic +// ============================================================ +describe("renegotiation trigger", () => { + function shouldRenegotiate( + feedback: Array<{ criterion: string; score: number }>, + criteria: Array<{ name: string; threshold: number }>, + ): { trigger: boolean; reason: string } { + if (feedback.length === 0) return { trigger: false, reason: "empty feedback" }; + const avgScore = feedback.reduce((s, f) => s + f.score, 0) / feedback.length; + const allFailing = feedback.every(f => f.score < (criteria.find(c => c.name === f.criterion)?.threshold ?? 7)); + if (allFailing) return { trigger: true, reason: "all failing" }; + if (avgScore < 4) return { trigger: true, reason: "low avg" }; + return { trigger: false, reason: "ok" }; + } + + test("triggers when ALL criteria below threshold", () => { + const r = shouldRenegotiate( + [{ criterion: "a", score: 3 }, { criterion: "b", score: 5 }], + [{ name: "a", threshold: 7 }, { name: "b", threshold: 7 }], + ); + expect(r.trigger).toBe(true); + expect(r.reason).toBe("all failing"); + }); + + test("triggers when avg < 4", () => { + const r = shouldRenegotiate( + [{ criterion: "a", score: 2 }, { criterion: "b", score: 3 }, { criterion: "c", score: 4 }], + [{ name: "a", threshold: 7 }, { name: "b", threshold: 7 }, { name: "c", threshold: 3 }], + ); + expect(r.trigger).toBe(true); + expect(r.reason).toBe("low avg"); + }); + + test("does NOT trigger when some pass", () => { + const r = shouldRenegotiate( + [{ criterion: "a", score: 8 }, { criterion: "b", score: 5 }], + [{ name: "a", threshold: 7 }, { name: "b", threshold: 7 }], + ); + expect(r.trigger).toBe(false); + }); + + test("safe on empty feedback (no division by zero)", () => { + const r = shouldRenegotiate([], []); + expect(r.trigger).toBe(false); + expect(r.reason).toBe("empty feedback"); + }); + + test("uses per-criterion threshold", () => { + const r = shouldRenegotiate( + [{ criterion: "easy", score: 4 }, { criterion: "hard", score: 4 }], + [{ name: "easy", threshold: 3 }, { name: "hard", threshold: 5 }], + ); + // easy passes (4>=3), so not allFailing. avg=4, not < 4. + expect(r.trigger).toBe(false); + }); + + test("defaults to threshold 7 for unknown criterion", () => { + const r = shouldRenegotiate( + [{ criterion: "mystery", score: 5 }], + [], + ); + // 5 < 7 default → allFailing + expect(r.trigger).toBe(true); + }); +}); + +// ============================================================ +// parseEvalResult tests +// ============================================================ +describe("parseEvalResult", () => { + function parseEvalResult(response: string, passThreshold: number) { + const candidates: string[] = []; + const codeBlocks = [...response.matchAll(/```(?:json)?\s*([\s\S]*?)```/g)]; + for (const match of codeBlocks.reverse()) { + if (match[1]) candidates.push(match[1].trim()); + } + const braceMatch = response.match(/\{[\s\S]*"passed"[\s\S]*"feedback"[\s\S]*\}/); + if (braceMatch) candidates.push(braceMatch[0]); + candidates.push(response.trim()); + for (const candidate of candidates) { + try { + const parsed = JSON.parse(candidate) as any; + if (parsed.feedback && Array.isArray(parsed.feedback)) { + parsed.passed = parsed.feedback.every((f: any) => f.score >= passThreshold); + return parsed; + } + } catch { /* next */ } + } + return null; + } + + test("recalculates passed based on threshold", () => { + const json = JSON.stringify({ + passed: true, + feedback: [{ criterion: "a", score: 8, details: "ok" }, { criterion: "b", score: 6, details: "meh" }], + }); + const r = parseEvalResult(json, 7); + expect(r!.passed).toBe(false); // b=6 < 7 + }); + + test("marks passed when all meet threshold", () => { + const json = JSON.stringify({ + passed: false, + feedback: [{ criterion: "a", score: 9, details: "great" }, { criterion: "b", score: 7, details: "ok" }], + }); + const r = parseEvalResult(json, 7); + expect(r!.passed).toBe(true); + }); + + test("extracts from markdown code block", () => { + const resp = '```json\n{"feedback":[{"criterion":"x","score":5,"details":"bad"}]}\n```'; + const r = parseEvalResult(resp, 7); + expect(r).not.toBeNull(); + expect(r!.passed).toBe(false); + }); + + test("returns null on garbage", () => { + expect(parseEvalResult("lol", 7)).toBeNull(); + }); +}); + +// ============================================================ +// Negotiation round logic +// ============================================================ +describe("negotiation rounds", () => { + test("APPROVED on round 1 = 1 round total", () => { + let rounds = 0, approved = false; + while (rounds < 3 && !approved) { + rounds++; + if ("APPROVED" === "APPROVED") { approved = true; } + } + expect(rounds).toBe(1); + expect(approved).toBe(true); + }); + + test("never approved = exactly 3 rounds", () => { + let rounds = 0, approved = false; + while (rounds < 3 && !approved) { + rounds++; + if ("revised json" === "APPROVED") { approved = true; } + } + expect(rounds).toBe(3); + expect(approved).toBe(false); + }); + + test("generator accepts in round 2 = 2 rounds", () => { + let rounds = 0, approved = false; + while (rounds < 3 && !approved) { + rounds++; + if (rounds > 1 && "APPROVED" === "APPROVED") { approved = true; break; } + // evaluator reviews... + } + expect(rounds).toBe(2); + expect(approved).toBe(true); + }); +});