diff --git a/CLAUDE.md b/CLAUDE.md index b089ffe..1603b00 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,26 +1,20 @@ # benchflow -Multi-turn agent benchmarking with ACP. +Multi-turn agent benchmarking with ACP. Docs in [`docs/`](./docs/). -Docs: `docs/quickstart.md`, `docs/cli-reference.md`, `docs/api-reference.md`, `docs/task-authoring.md`, `docs/use-cases.md`, `docs/progressive-disclosure.md`. - -## Setup - -Requires Python 3.12+. Uses `uv`. +## Setup + test ```bash uv venv -p 3.12 .venv && uv pip install -e ".[dev]" -``` - -## Test - -```bash .venv/bin/python -m pytest tests/ .venv/bin/ty check src/ +ruff check . ``` ## Conventions -- **Don't rewrite passing tests.** Updating a test because the code it covers changed shape is fine; rewriting one to match new behavior without understanding why it was written is not. No tautological tests (dataclass reads, stdlib behavior, "does it construct"). -- **Test new regressions against `main` first.** A test that passes on buggy `main` pins the bug instead of preventing it. Name the commit/PR it guards. -- **Human review before main.** Commit freely on a feature branch, open a PR. Never push to `main` directly, never force-push it. Self-approval doesn't count — request an independent reviewer. +- **Don't rewrite passing tests** to match new behavior. Update for shape changes, not for semantic changes you don't understand. No tautological tests. +- **Regression tests must name the PR/commit they guard** in the docstring (e.g. `Guards the fix from PR #198 against the regression introduced by PR #193`). +- **Human review before `main`.** PRs only. No force-pushes to `main`. Self-approval doesn't count. +- **Trunk-based:** branch off `main`, PR back to `main`. No long-lived release branches. +- **Releases:** bump `pyproject.toml` to the stable version, tag `v` on main, push tag (CI publishes to PyPI), then bump main to the next `.dev0`. diff --git a/README.md b/README.md index e56d455..1b5f421 100644 --- a/README.md +++ b/README.md @@ -2,7 +2,7 @@

BenchFlow

Multi-turn agent benchmarking — Scene-based lifecycle for any ACP agent

- PyPI + PyPI Discord @@ -11,12 +11,12 @@ ## What -BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It supports single-agent, multi-agent, and multi-turn evaluation patterns through a Scene-based lifecycle. +BenchFlow runs AI agents against benchmark tasks in sandboxed environments. Single-agent, multi-agent, and multi-round patterns share one Scene-based lifecycle. -- **Any ACP agent** — Gemini CLI, Claude, Codex, OpenClaw, Pi, or your own -- **Multi-scene trials** — skill generation → solve, coder → reviewer → revision +- **Any ACP agent** — Gemini CLI, Claude Code, Codex, OpenCode, OpenClaw, Pi, or your own +- **Single + multi + progressive** — single-agent / multi-agent (coder + reviewer, simulated user) / multi-round with a Python `BaseUser` callback - **Cloud sandboxes** — Daytona backend for parallel execution at scale -- **YAML-driven** — same task folder, different trial configs for ablation +- **Hardened verifier** — defaults block BenchJack/Meerkat-style reward-hacking; tasks opt out per-feature ## Install @@ -24,173 +24,50 @@ BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It s uv tool install benchflow ``` -Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). For cloud sandboxes, set `DAYTONA_API_KEY`. +Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). Set `DAYTONA_API_KEY` for cloud sandboxes; export the relevant agent API key (`GEMINI_API_KEY`, `ANTHROPIC_API_KEY`, etc.) or run `claude login` / `codex --login` for subscription auth. -## Quick Start +## Documentation -### CLI +Start with [Getting started](./docs/getting-started.md), then [Concepts](./docs/concepts.md) for the mental model. Then by goal: -```bash -# Run a single task with Gemini -bench eval create -t tasks/my-task -a gemini -m gemini-3.1-flash-lite-preview -e daytona - -# Run from YAML config (batch, concurrent) -bench eval create -f benchmarks/tb2-gemini-baseline.yaml - -# List agents -bench agent list - -# Check task validity -bench tasks check tasks/my-task -``` - -### Python - -```python -import benchflow as bf -from benchflow.trial import TrialConfig, Scene, Role, Turn - -# Simplest: one agent, one task -result = await bf.run("gemini", task_path="tasks/my-task", model="gemini-3.1-flash-lite-preview") -print(result.rewards) # {"reward": 1.0} - -# Scene-based: skill-gen → solve (BYOS pattern) -config = TrialConfig( - task_path=Path("tasks/my-task"), - scenes=[ - Scene(name="skill-gen", - roles=[Role("gen", "gemini", "gemini-3.1-flash-lite-preview")], - turns=[Turn("gen", "Analyze the task and write a skill to /app/generated-skill.md")]), - Scene(name="solve", - roles=[Role("solver", "gemini", "gemini-3.1-flash-lite-preview")], - turns=[Turn("solver")]), # None prompt = use instruction.md - ], - environment="daytona", -) -result = await bf.run(config) - -# Multi-agent: coder + reviewer -config = TrialConfig( - task_path=Path("tasks/my-task"), - scenes=[ - Scene(name="review-loop", - roles=[ - Role("coder", "gemini", "gemini-3.1-flash-lite-preview"), - Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"), - ], - turns=[ - Turn("coder", "Solve the task. Write to /app/.outbox/reviewer.json when done."), - Turn("reviewer", "Review the coder's work. Write feedback to /app/.outbox/coder.json."), - Turn("coder", "Read the reviewer's feedback and revise your solution."), - ]), - ], - environment="daytona", -) -result = await bf.run(config) -``` - -### YAML Trial Config - -```yaml -# trial-baseline.yaml -task_dir: .ref/terminal-bench-2 -agent: gemini -model: gemini-3.1-flash-lite-preview -environment: daytona -concurrency: 89 - -# trial-byos.yaml (same tasks, different config) -task_dir: .ref/terminal-bench-2 -scenes: - - name: skill-gen - roles: [{name: gen, agent: gemini, model: gemini-3.1-flash-lite-preview}] - turns: [{role: gen, prompt: "Generate a skill for this task..."}] - - name: solve - roles: [{name: solver, agent: gemini, model: gemini-3.1-flash-lite-preview}] -``` - -## CLI Reference - -``` -bench agent list List registered agents -bench agent show Agent details + conformance status - -bench eval create Create + run evaluation (returns job-id) -bench eval list List completed evaluations - -bench skills eval Evaluate skill via evals.json +| If you want to… | Read | +|------------------|------| +| Run an eval on an existing task | [Getting started](./docs/getting-started.md) | +| Understand Trial / Scene / Role / Verifier | [Concepts](./docs/concepts.md) | +| Author a new task | [Task authoring](./docs/task-authoring.md) | +| Multi-agent: coder + reviewer, simulated user, BYOS, stateful envs | [Use cases](./docs/use-cases.md) | +| Multi-round single-agent (progressive disclosure, oracle access) | [Progressive disclosure](./docs/progressive-disclosure.md) | +| Skill evaluation (when the artifact is a skill, not a workspace) | [Skill eval](./docs/skill-eval.md) | +| Understand the security model | [Sandbox hardening](./docs/sandbox-hardening.md) | +| CLI flags + commands | [CLI reference](./docs/reference/cli.md) | +| Python API surface | [Python API reference](./docs/reference/python-api.md) | -bench tasks init Scaffold new task -bench tasks check Validate task (--rubric for custom) +Notebooks and runnable example scripts: [`examples/`](./examples/). -bench train create Reward-based training sweep +## Featured -bench environment create Spin up sandbox from task dir -bench environment list List active sandboxes -``` +- **Progressive disclosure on SWE-bench Pro** — the `BaseUser` abstraction drives a multi-round trial: terse round-0 prompt → failing-test hints → full spec. 5/5 oracle on Daytona, runnable demo at [`examples/swebench_pro_progressive_disclosure.ipynb`](./examples/swebench_pro_progressive_disclosure.ipynb). Also benchflow's [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) parity answer for the no-second-LLM case. See [Progressive disclosure](./docs/progressive-disclosure.md). -## Terminology +## Research artifacts -| Term | Definition | Example | -|------|-----------|---------| -| **Turn** | One prompt in one ACP session — one role acts | Coder writes a regex | -| **Multi-turn** | Same role, multiple turns | Self-review: agent → agent | -| **Round** | One A→B exchange between different roles | Coder → Reviewer | -| **Multi-round** | Different roles exchanging turns | Coder → Reviewer → Coder | -| **Scene** | Interaction region with roles + turns | A code-review scene | -| **Trial** | Sequence of scenes in a shared sandbox | Skill-gen → Solve | +Two runnable labs validate the security story: -**Inter-role messaging:** In multi-role scenes, agents communicate via outbox files. -An agent writes `/app/.outbox/{recipient}.json` with `{"to": "role", "content": "..."}`. -The scheduler reads these after each turn and injects the message into the next role's prompt. +- [`labs/benchjack-sandbox-hardening/`](./labs/benchjack-sandbox-hardening/) — end-to-end demo that 0.2.1+ blocks three [BenchJack](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) exploits that flip 0.2.0's reward from 0.0 to 1.0. +- [`labs/reward-hack-matrix/`](./labs/reward-hack-matrix/) — full reward-hack sweep across real benchmarks comparing 0.2.0 vs 0.2.2. -## Architecture +## Audience -``` -Trial = sequence of Scenes in a shared sandbox -Scene = Roles + Turns (one interaction region) -Role = agent + model -Turn = one prompt for one role - -bf.run(config) - → Trial.create(config) - → trial.setup() # resolve config, create env object - → trial.start() # spin up sandbox, upload task files - → for scene in config.scenes: - → trial._run_scene(scene) # connect/execute/disconnect per role - → setup /app/.outbox/ # (multi-role scenes only) - → for turn in scene.turns: - → read outbox → inject messages into prompt - → connect as role → execute → disconnect - → trial.verify() # run verifier, score - → trial.cleanup() # stop sandbox -``` +- **Eval researchers / paper writers** → [Getting started](./docs/getting-started.md) → [Concepts](./docs/concepts.md) → [Use cases](./docs/use-cases.md) +- **Task authors** → [Task authoring](./docs/task-authoring.md) → [Sandbox hardening](./docs/sandbox-hardening.md) +- **Agent builders integrating with benchflow** → [Concepts](./docs/concepts.md) → [Python API reference](./docs/reference/python-api.md) → [`benchflow.agents.registry`](./src/benchflow/agents/registry.py) +- **Existing Harbor users migrating** → [Use cases — migration section](./docs/use-cases.md#migration-from-harbor) → [Progressive disclosure (Harbor #1316 parity)](./docs/progressive-disclosure.md#comparison-with-multi-agent-simulated-user-harbor-1316-parity) -## Registered Agents +## Contributing -| Agent | Command | Auth | -|-------|---------|------| -| `gemini` | `gemini --acp --yolo` | GOOGLE_API_KEY | -| `claude-agent-acp` | `claude-agent-acp` | ANTHROPIC_API_KEY | -| `codex-acp` | `codex-acp` | OPENAI_API_KEY | -| `openclaw` | `openclaw-acp-shim` | inferred from model | -| `pi-acp` | `pi-acp` | ANTHROPIC_API_KEY | +PRs welcome. Open against `main`. CI runs ruff + tests on every PR; please run `ruff check .` and `pytest tests/` locally first. -## Adding a Custom Agent +For a release: bump `pyproject.toml` to the next stable version, tag `v` on main, push the tag — CI publishes to PyPI. Then bump main to the next `.dev0`. -Any ACP-native agent works. Create `agent.toml`: +## License -```toml -name = "my-agent" -launch_cmd = "my-agent --acp" -install_cmd = "npm install -g my-agent" -requires_env = ["MY_API_KEY"] -``` - -## Development - -```bash -uv venv -p 3.12 .venv && uv pip install -e ".[dev]" -.venv/bin/python -m pytest tests/ # 580+ unit tests -.venv/bin/ty check src/ # type check -``` +Apache-2.0. diff --git a/docs/concepts.md b/docs/concepts.md new file mode 100644 index 0000000..50ae5ad --- /dev/null +++ b/docs/concepts.md @@ -0,0 +1,161 @@ +# Concepts + +The mental model for benchflow. Read once, then refer back from the how-tos. + +--- + +## The five primitives + +| Primitive | What it is | +|-----------|------------| +| **Task** | A directory on disk: `instruction.md` for the agent + `tests/` for the verifier + (optional) `solution/solve.sh` for oracle runs + `environment/Dockerfile` for the sandbox. Authored once, evaluated many times. | +| **Agent** | A registered ACP-speaking program (Claude Code, Gemini CLI, OpenCode, etc.). Identified by name (`"gemini"`, `"opencode"`) plus an optional model ID. | +| **Environment** | The sandbox where the agent runs and the verifier checks the result. Backed by Harbor — Docker locally, Daytona for cloud. | +| **Verifier** | The test runner that scores the trial. By default `pytest /tests/...` against the workspace the agent left behind. Outputs `rewards: {reward: float}`. | +| **Trial** | One agent run on one task. Holds the lifecycle (setup → start → install → execute → verify → cleanup). All higher-level primitives below are built on Trials. | + +--- + +## Trial lifecycle + +A `Trial` is decomposable: each phase is a callable method, you can either run them in sequence or invoke `Trial.run()` to execute all six in order. Multi-agent flows reuse phases (e.g. `connect` + `execute` + `disconnect` repeats per role). + +``` +┌──────────────────────────────────────────────────────────────┐ +│ Trial.run() │ +│ │ +│ setup() resolve config, create Harbor env handle │ +│ ↓ │ +│ start() start container, upload task files │ +│ ↓ │ +│ install_agent() install agent binary, write credentials, │ +│ set up sandbox user │ +│ ↓ │ +│ ┌─ connect_as(role) ◄─── multi-agent loops here │ +│ │ execute(prompts) each role's turn │ +│ └─ disconnect() │ +│ ↓ │ +│ verify() harden sandbox, run pytest, score │ +│ ↓ │ +│ cleanup() kill agent procs, stop container │ +└──────────────────────────────────────────────────────────────┘ +``` + +Each phase has a name, a clear contract, and is independently testable. `Trial.run()` is the convenience that calls them in order. + +```python +import benchflow as bf +from benchflow.trial import TrialConfig, Scene +from pathlib import Path + +config = TrialConfig( + task_path=Path("tasks/regex-log"), + scenes=[Scene.single(agent="gemini", model="gemini-3.1-pro-preview")], + environment="daytona", +) +result = await bf.run(config) # full lifecycle +print(result.rewards) # {'reward': 1.0} +``` + +--- + +## Scenes, Roles, Turns + +A **Scene** is one interaction region. Inside a Scene: +- **Roles** are the agents that participate (one or more). +- **Turns** are the prompt sequence — which Role acts when, and what they're told. +- All Roles share the same sandbox filesystem. + +Single-agent runs are a Scene with one Role and one Turn. Multi-agent patterns (coder + reviewer, simulated user + assistant) are Scenes with multiple Roles and ordered Turns. + +```python +Scene( + name="review-loop", + roles=[ + Role(name="coder", agent="opencode", model="anthropic/claude-sonnet-4-6"), + Role(name="reviewer", agent="gemini", model="gemini-3.1-pro-preview"), + ], + turns=[ + Turn(role="coder"), + Turn(role="reviewer", prompt="Read /app/ and write feedback to /app/.outbox/coder.json."), + Turn(role="coder", prompt="Read the reviewer's feedback and revise."), + ], +) +``` + +Roles communicate via **outbox files**: write JSON to `/app/.outbox/{recipient}.json` and the scheduler injects it into the next Turn's prompt. + +A Trial may have multiple Scenes — used for staged flows like "skill generation → solve" (BYOS / Bring Your Own Skill). Same sandbox, sequential Scenes. + +--- + +## The User abstraction (multi-round, single-agent) + +Sometimes you want the agent to take multiple turns guided not by another LLM but by a Python callback that watches what happened and decides what to say next. That's a **User**. + +A User is a `BaseUser` subclass (or `FunctionUser` wrapping a function) with two methods: +- `setup(instruction, solution)` — once, before round 0 +- `run(round, instruction, round_result) → str | None` — per round; return `None` to stop the loop + +Between rounds, benchflow runs `soft_verify()` (verifier without the destructive parts of full hardening), gives the user the round's `RoundResult` (trajectory, rewards, verifier output, tool count), and lets the user decide round N+1's prompt. + +The User is the lighter-weight alternative to a Scene with a simulated-user Role: no second LLM, no outbox protocol, just a Python function. Use it when the loop logic is rule-based (compress instruction → show test failures as hints → stop on pass). See [`progressive-disclosure.md`](./progressive-disclosure.md) for the full guide. + +--- + +## Verifier, sandbox, hardening + +Once the agent stops, the verifier runs. By default that's `pytest -c /dev/null --confcutdir=/tests --rootdir=/app -p no:cacheprovider /tests/test.sh` (or whatever the task's `tests/test.sh` does), against the workspace the agent left behind. + +Between agent and verifier, benchflow **hardens** the sandbox to prevent the agent from gaming the score: +- Kill any lingering agent processes +- Restore build-config files (setup.py, pyproject.toml, …) to their pre-agent snapshots +- Delete agent-injected `conftest.py`, `sitecustomize.py`, `.pth` files +- Lock the workspace to root, set restrictive PYTHONPATH/PATH for the verifier process +- Run pytest with plugin auto-discovery off, only allow plugins declared in `task.toml` + +This catches the BenchJack and Meerkat exploit families documented in [`labs/benchjack-sandbox-hardening/`](../labs/benchjack-sandbox-hardening/) and [`labs/reward-hack-matrix/`](../labs/reward-hack-matrix/). + +When a task ships a legitimate `conftest.py` (e.g. qutebrowser uses one to break a real circular import), the task opts out via `task.toml`: + +```toml +[verifier.hardening] +cleanup_conftests = false +``` + +See [`progressive-disclosure.md`](./progressive-disclosure.md#per-task-hardening-opt-outs) for the full opt-out list. + +--- + +## Multi-turn vs multi-round vs multi-scene + +Three different axes — easy to confuse, worth pinning down: + +| Axis | What changes | Example | +|------|--------------|---------| +| **Multi-turn** | Same Role, multiple prompts within one Scene. The ACP session persists; the agent has continuous memory. | One coder gets prompted twice: "fix the bug", then "now write a test". | +| **Multi-round** | Same Role, multiple `connect → execute → disconnect` cycles. New ACP session each round; sandbox state persists; a Python `User` callback decides each round's prompt. | Progressive disclosure on SWE-bench Pro: round 0 terse spec, round 1 hints with failing tests, round 2 full spec. | +| **Multi-scene** | Multiple Scenes in one Trial. Sandbox state persists; agent process and ACP session restart between Scenes. | BYOS: Scene 1 generates a skill, Scene 2 solves the task using it. | + +Single-agent simple runs use none of these. Pick the axis based on what state needs to persist (memory? sandbox? both?). + +--- + +## Trajectories and rewards + +Every agent action is captured as an event in the **trajectory** — tool calls, agent messages, agent thoughts. A `RunResult` has the full trajectory plus tool count, plus rewards from the verifier and any error. + +`rewards` is a dict produced by the task's verifier. Convention: `{"reward": float}` where 1.0 = pass, 0.0 = fail. Tasks may add additional metrics (e.g. `exact_match`, `partial_credit`). + +Trajectories are written to `///trajectory/acp_trajectory.jsonl`. Use them for replay, debugging, or training data. + +--- + +## Where to go next + +- [Getting started](./getting-started.md) — install, run your first eval. +- [Task authoring](./task-authoring.md) — write a task with `task.toml` + `tests/` + `solution/`. +- [Progressive disclosure](./progressive-disclosure.md) — the User abstraction; SWE-bench Pro case study. +- [Use cases](./use-cases.md) — multi-agent patterns (coder/reviewer, simulated user, BYOS, stateful environments). +- [CLI reference](./reference/cli.md), [Python API reference](./reference/python-api.md). +- [Skill evaluation](./skill-eval.md) — when the artifact is a skill, not a workspace. diff --git a/docs/getting-started.md b/docs/getting-started.md new file mode 100644 index 0000000..f6c9d86 --- /dev/null +++ b/docs/getting-started.md @@ -0,0 +1,125 @@ +# Getting started + +A 5-minute path from install to first eval. + +## Prerequisites + +- Python 3.12+ +- [`uv`](https://docs.astral.sh/uv/) (recommended) or `pip` +- Docker (for local sandboxes) and/or `DAYTONA_API_KEY` (for cloud sandboxes) +- An API key or subscription/OAuth auth for at least one agent (see below) + +## Install + +```bash +uv tool install benchflow +``` + +This gives you the `benchflow` (alias `bench`) CLI plus the Python SDK. To install for editable development: + +```bash +git clone https://github.com/benchflow-ai/benchflow +cd benchflow +uv venv -p 3.12 .venv && uv pip install -e ".[dev]" +``` + +## Auth: OAuth, long-lived token, or API key + +You don't need an API key if you're a Claude / Codex / Gemini subscriber. Three options, pick one per agent: + +### Option 1 — Subscription OAuth from host CLI login + +If you've logged into the agent's CLI on your host (`claude login`, `codex --login`, `gemini` interactive flow), benchflow picks up the credential file and copies it into the sandbox. No API key billing. + +| Agent | How to log in on the host | What benchflow detects | Replaces env var | +|-------|---------------------------|------------------------|------------------| +| `claude-agent-acp` | `claude login` (Claude Code CLI) | `~/.claude/.credentials.json` | `ANTHROPIC_API_KEY` | +| `codex-acp` | `codex --login` (Codex CLI) | `~/.codex/auth.json` | `OPENAI_API_KEY` | +| `gemini` | `gemini` (interactive login) | `~/.gemini/oauth_creds.json` | `GEMINI_API_KEY` | + +When benchflow finds the detect file, you'll see: + +``` +Using host subscription auth (no ANTHROPIC_API_KEY set) +``` + +### Option 2 — Long-lived OAuth token (CI / headless) + +For CI pipelines, scripts, or anywhere the host can't run an interactive browser login, generate a 1-year OAuth token with `claude setup-token` and export it: + +```bash +claude setup-token # walks you through browser auth, prints a token +export CLAUDE_CODE_OAUTH_TOKEN= +``` + +benchflow auto-inherits `CLAUDE_CODE_OAUTH_TOKEN` from your shell into the sandbox; the Claude CLI inside reads it directly. Same auth precedence as plain `claude` ([Anthropic docs](https://code.claude.com/docs/en/authentication#authentication-precedence)): API keys override OAuth tokens, so unset `ANTHROPIC_API_KEY` if you want the token to win. + +`claude setup-token` only authenticates Claude. Codex and Gemini do not have an equivalent today — use Option 1 (host login) or Option 3 (API key). + +### Option 3 — API key + +Set the API-key env var directly. Works with every agent: + +```bash +export ANTHROPIC_API_KEY=sk-ant-... +export OPENAI_API_KEY=sk-... +export GEMINI_API_KEY=... +export LLM_API_KEY=... # OpenHands / LiteLLM-compatible providers +``` + +benchflow auto-inherits well-known API key env vars from your shell into the sandbox. + +### Precedence + +If multiple credentials are set, benchflow / the agent CLI uses (high to low): cloud provider creds → `ANTHROPIC_AUTH_TOKEN` → `ANTHROPIC_API_KEY` → `apiKeyHelper` → `CLAUDE_CODE_OAUTH_TOKEN` → host subscription OAuth. To force a lower-priority option, unset the higher one in your shell before running. + +## Run your first eval + +```bash +# Single task with Gemini +GEMINI_API_KEY=... bench eval create -t .ref/terminal-bench-2/regex-log -a gemini \ + -m gemini-3.1-pro-preview -e docker + +# A whole batch with concurrency +GEMINI_API_KEY=... bench eval create -t .ref/terminal-bench-2 -a gemini \ + -m gemini-3.1-pro-preview -e daytona -c 32 + +# List the registered agents +bench agent list +``` + +`bench eval create -t ` runs once on a single task or, if the path contains multiple `task.toml`-bearing subdirectories, batches them. Results land under `jobs///` — `result.json` for the verifier output, `trajectory/acp_trajectory.jsonl` for the full agent trace. + +## Run from Python + +The CLI is a thin shim over the Python API. For programmatic use: + +```python +import benchflow as bf +from benchflow.trial import TrialConfig, Scene +from pathlib import Path + +config = TrialConfig( + task_path=Path(".ref/terminal-bench-2/regex-log"), + scenes=[Scene.single(agent="gemini", model="gemini-3.1-pro-preview")], + environment="docker", +) +result = await bf.run(config) +print(result.rewards) # {'reward': 1.0} +print(result.n_tool_calls) +``` + +`Trial` is decomposable — invoke each lifecycle phase individually for custom flows. See [Concepts: trial lifecycle](./concepts.md#trial-lifecycle). + +## What to read next + +| If you want to… | Read | +|------------------|------| +| Understand the model — Trial, Scene, Role, Verifier | [`concepts.md`](./concepts.md) | +| Author a task | [`task-authoring.md`](./task-authoring.md) | +| Run multi-agent patterns (coder/reviewer, simulated user, BYOS) | [`use-cases.md`](./use-cases.md) | +| Run multi-round single-agent (progressive disclosure) | [`progressive-disclosure.md`](./progressive-disclosure.md) | +| Evaluate skills, not tasks | [`skill-eval.md`](./skill-eval.md) | +| Understand the security model | [`sandbox-hardening.md`](./sandbox-hardening.md) | +| CLI flags + commands | [`reference/cli.md`](./reference/cli.md) | +| Python API surface | [`reference/python-api.md`](./reference/python-api.md) | diff --git a/docs/progressive-disclosure.md b/docs/progressive-disclosure.md index 13e3c78..5e27006 100644 --- a/docs/progressive-disclosure.md +++ b/docs/progressive-disclosure.md @@ -1,81 +1,152 @@ # Progressive Disclosure with `BaseUser` -A pattern for multi-round agent runs where a Python callback drives the loop, deciding what to tell the agent next based on what happened in the previous round. +## TL;DR -This is BenchFlow's lightweight alternative to multi-agent "user simulation" Scenes (see [use-cases.md](./use-cases.md#1-interactive-user-simulation-harbor-1316-equivalent)). Use a `BaseUser` callback when: +`BaseUser` is a Python callback that drives a benchflow trial across multiple rounds. Each round: the callback sees the previous verifier result and decides what to tell the agent next, or stops the loop. No second LLM, no outbox protocol — just a function that knows how to grade and hint. -- You need programmatic control over the loop (e.g. terse prompt → hints on test failure → stop on pass). -- You don't want to spin up a second LLM just to play the "user" role. -- Your "user" logic is rule-based or oracle-guided rather than open-ended. +It was built for the SWE-bench Pro progressive-disclosure use case: the dataset's instructions are long structured specs that overwhelm agents in a single turn. A `BaseUser` lets you compress the spec for round 0, watch which tests fail, then disclose hints from the spec on subsequent rounds — all driven by deterministic Python, not by another LLM acting as a "user." -For comparison: a Scene-based simulated user is another LLM with its own tool access, useful for nuanced feedback. A `BaseUser` is a sync/async Python function, useful for deterministic, scriptable progressive disclosure. - ---- - -## Why this exists - -This was built for [Josh's SWE-bench Pro use case](https://github.com/swe-bench-pro/swe-bench-pro): the dataset's instructions are long structured specs that overwhelm agents in a single turn. A `BaseUser` lets you compress the spec to a terse prompt for round 0, watch which tests fail, then disclose hints from the spec on subsequent rounds. - -It is also benchflow's parity answer to [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) for the no-second-LLM case — Harbor's proposal required a FastMCP sidecar; BenchFlow's `BaseUser` is in-process Python. - ---- - -## Quick start +It is also benchflow's parity answer to the [Harbor simulated-user proposal (#1316)](https://github.com/harbor-ai/harbor/issues/1316) for the no-second-LLM case. The Harbor proposal required a FastMCP sidecar container; benchflow's `BaseUser` is in-process Python. ```python -import asyncio -from pathlib import Path import benchflow as bf from benchflow import FunctionUser, RoundResult from benchflow.trial import TrialConfig, Scene +from pathlib import Path -def my_user(round: int, instruction: str, rr: RoundResult | None) -> str | None: +def progressive(round: int, instruction: str, rr: RoundResult | None) -> str | None: if round == 0: - # Round 0: terse prompt, no hints - return instruction.split("\n")[0] - if rr and rr.rewards and rr.rewards.get("reward", 0) >= 1.0: - return None # passed, stop + return instruction.split("\n")[0] # terse: first line only + if rr and (rr.rewards or {}).get("reward", 0) >= 1.0: + return None # passed, stop if round >= 3: - return None # cap at 3 rounds - # Otherwise: show the failing tests as a hint for next round + return None # cap at 3 rounds return ( - f"The previous attempt failed these tests:\n{rr.verifier_output}\n" - f"Here is the full spec for context:\n{instruction}" + f"Tests failed:\n{rr.verifier_output}\n\n" # show failures + spec + f"Full spec:\n{instruction}" ) config = TrialConfig( task_path=Path(".ref/swebenchpro/instance_flipt-io__flipt-..."), scenes=[Scene.single(agent="opencode", model="anthropic/claude-sonnet-4-6")], - user=FunctionUser(my_user), + user=FunctionUser(progressive), max_user_rounds=3, environment="daytona", ) -result = asyncio.run(bf.run(config)) +result = await bf.run(config) +``` + +--- + +## Case study: SWE-bench Pro + +SWE-bench Pro tasks ship long, structured `instruction.md` specs (typically 2-5KB) describing API requirements, test fixtures, and expected behaviors. Single-shot agents either drown in the spec or under-engineer because they bail before reading to the bottom. + +The SWE-bench Pro eval that motivated this feature wanted exactly this loop: + +``` +round 0 "Fix the bug described here: " + agent attempts → tests fail +round 1 "Tests failed. Here is the full requirements section: ." + agent retries → tests still fail +round 2 "Still failing. Here's the full original spec: " + agent makes final attempt +``` + +Rule-based, deterministic, and the "user" never needs to think — the disclosure schedule is fixed. Spinning up a second LLM to play the user role would (a) cost double, (b) introduce nondeterminism, and (c) require an outbox protocol the agent has to learn. + +### Validation (2026-04-25, 5 SWE-bench Pro tasks, Daytona, Gemini 3.1 Pro Preview) + +| Task | Oracle | Single-round baseline | 3-round progressive (final) | Per-round soft-verify | +|------|--------|-----------------------|------------------------------|------------------------| +| ansible | ✅ 1.0 | ✅ 1.0 (23 tools, 207s) | ✅ 1.0 (126 tools, 3 rounds) | 0.0 / 0.0 / 0.0 | +| flipt | ✅ 1.0 | ❌ 0.0 (61 tools, 1444s) | ❌ 0.0 (195 tools, 3 rounds) | 0.0 / 0.0 / 0.0 | +| openlibrary | ✅ 1.0 | ✅ 1.0 (32 tools, 340s) | ✅ 1.0 (82 tools, 3 rounds) | 0.0 / 0.0 / 0.0 | +| navidrome | ✅ 1.0 | (not tested) | ❌ 0.0 (145 tools, 3 rounds) | 0.0 / 0.0 / 0.0 | +| qutebrowser | ✅ 1.0 (with `cleanup_conftests=false`) | ❌ 0.0 (verifier broken pre-fix) | ✅ 1.0 (183 tools, 3 rounds) | 0.0 / 0.0 / 0.0 | + +What this run shows and doesn't show: + +- **The infrastructure works on real SWE-bench Pro tasks.** All 5 tasks completed 3 rounds end-to-end (after one retry on ansible/qutebrowser to clear intermittent flake). Round trajectories captured, soft_verify runs between rounds, BaseUser callback drives the loop. +- **3/5 hit the canonical reward** (ansible, openlibrary, qutebrowser). flipt and navidrome stayed at 0.0 across all three rounds — Gemini 3.1 Pro doesn't crack them with this hint schedule, and progressive disclosure didn't help. +- **Per-round soft-verify scored 0.0 even on tasks where the final hardened verify scored 1.0.** Soft-verify runs between rounds without the full hardening sequence (no workspace restore, no process kill so the sandbox stays alive), so its scoring can diverge from the final verifier. The user's hint schedule reacts to soft-verify, not the canonical reward — something to keep in mind when designing the loop. +- **First-run flake.** ansible's first run hit a transport EOF after 17min and qutebrowser timed out at 50min. Both succeeded on retry. v0.3.3 adds `agent_idle_timeout` (default 600s) and clearer EOF diagnostics so the next time a hang happens the failure is fast and actionable rather than silent. + +This is one model on one day, not a published comparison. The notebook at [`examples/swebench_pro_progressive_disclosure.ipynb`](../examples/swebench_pro_progressive_disclosure.ipynb) has the executable cells; raw aggregated results are at [`experiments/swebench-pro-progressive-results.json`](../experiments/swebench-pro-progressive-results.json). + +--- + +## Where it lives in the trial lifecycle + +`BaseUser` plugs into the existing `Trial` lifecycle ([concepts](./concepts.md#trial-lifecycle)) without changing any of the existing phases. When `TrialConfig.user` is set, `Trial._run_user_loop()` replaces the single-pass `connect → execute → disconnect` block with a per-round version: + +``` +setup() → start() → install_agent() + ↓ +[oracle setup if oracle_access=True: read /solution, hide it from agent] + ↓ +user.setup(instruction, solution) ← once + ↓ +┌─ user.run(round, instruction, rr) → str | None +│ │ None: break +│ ↓ +│ connect_as(role) +│ execute(prompts=[prompt]) +│ disconnect() +│ ↓ +│ soft_verify() ← partial hardening, sandbox stays alive +│ ↓ +│ build RoundResult, log, repeat +└─ │ + ↓ (loop ends when user returns None or max_user_rounds reached) +[oracle restore: mv /solution_oracle_backup → /solution for final verify] + ↓ +verify() ← full hardening, final reward + ↓ +cleanup() ``` +Multi-scene / multi-role configs are not compatible with `User` — the loop assumes one Scene with one Role. Setting both raises `ValueError`. + +--- + +## Soft-verify and full-verify: two different verifiers + +Between rounds, benchflow needs to score the agent's progress so the user can react. But the final, end-of-trial verifier does destructive things (kills the agent, restores the workspace, chowns to root) that would prevent the next round from running. So benchflow runs **two** verifier passes: + +| | Soft-verify (between rounds) | Full-verify (end of trial) | +|---|---|---| +| Kills agent processes | ❌ no | ✅ yes | +| Restores workspace from snapshot | ❌ no | ✅ optional, task-driven | +| Purges agent-injected `conftest.py`, `sitecustomize.py`, `.pth` | ✅ yes | ✅ yes | +| Locks down PATH/PYTHONPATH | ✅ yes | ✅ yes | +| `chmod 777 /logs/verifier` | ✅ yes (so non-root verifier can write) | n/a (root) | +| Runs verifier | ✅ yes | ✅ yes | +| Result | feeds `RoundResult.rewards` | the trial's final score | + +Soft-verify is intentionally weaker than full-verify — losing some score-gaming protection in exchange for keeping the sandbox alive. The cleanup step still purges agent-injected hook files (`CLEANUP_CMD`), so an agent can't plant a `conftest.py` that flips the round score. + --- ## API ### `BaseUser` -Subclass and override `run()`. Optionally override `setup()` for one-time initialization. - ```python from benchflow import BaseUser, RoundResult class MyUser(BaseUser): async def setup(self, instruction: str, solution: str | None = None) -> None: - """Called once before the first round. + """Called once before round 0. instruction — the original task instruction (from instruction.md) - solution — the gold answer if oracle_access=True, else None + solution — gold answer if oracle_access=True, else None """ - self.spec_lines = instruction.split("\n") - self.gold = solution # only set if oracle_access=True + self.spec = instruction + self.gold = solution async def run( self, @@ -83,13 +154,13 @@ class MyUser(BaseUser): instruction: str, round_result: RoundResult | None = None, ) -> str | None: - """Produce the next prompt, or None to stop the loop. + """Return the next prompt, or None to stop. - round — 0-indexed round number - instruction — the original task instruction - round_result — None on round 0; previous round's outcome on subsequent rounds + round — 0-indexed + instruction — original task instruction (unchanged each round) + round_result — None on round 0; previous round's outcome on subsequent rounds """ - ... # return prompt str or None + ... ``` ### `RoundResult` @@ -99,24 +170,24 @@ Dataclass passed to `run()` from round 1 onward. ```python @dataclass class RoundResult: - round: int # 0-indexed - trajectory: list[dict] # ACP events from this round only - rewards: dict[str, Any] | None # verifier rewards (None if verifier crashed) - verifier_output: str | None # raw verifier stdout/log content - verifier_error: str | None # exception message if verifier failed - n_tool_calls: int # tool calls in this round + round: int # 0-indexed + trajectory: list[dict] # ACP events from this round only + rewards: dict | None # verifier rewards (None if verifier crashed) + verifier_output: str | None # raw verifier stdout/log + verifier_error: str | None # exception message if verifier failed + n_tool_calls: int # tool calls in this round ``` ### `PassthroughUser` -Sends the instruction unchanged on round 0, stops on round 1. Backward-compatible single-round behavior. +Sends the instruction unchanged on round 0, stops on round 1. Use it as the explicit single-round-equivalent. ### `FunctionUser` -Wraps a plain function as a `BaseUser`. Sync and async both supported (via `inspect.isawaitable`). +Wraps a plain function as a `BaseUser`. Sync or async — uses `inspect.isawaitable` to detect. ```python -def fn(round, instruction, rr): return None if round > 0 else instruction +def fn(round, instruction, rr): ... user = FunctionUser(fn) async def afn(round, instruction, rr): ... @@ -127,88 +198,84 @@ user = FunctionUser(afn) ```python user: BaseUser | None = None # the callback -max_user_rounds: int = 5 # hard cap on rounds (loop stops earlier if user returns None) +max_user_rounds: int = 5 # cap on rounds (loop also stops when user returns None) oracle_access: bool = False # expose gold solution to user.setup() ``` -A `User` requires a single-scene, single-role config. Multi-scene or multi-role configs raise `ValueError`. - --- ## Oracle access -When `oracle_access=True`, the trial: +When `oracle_access=True`: -1. Reads `/solution/solve.sh` before the agent starts and passes its content to `user.setup(instruction, solution=...)`. -2. Moves `/solution` → `/solution_oracle_backup` so the agent cannot read it during its rounds. -3. Temporarily restores `/solution` for `soft_verify()` between rounds (and re-hides it). -4. Restores `/solution` permanently before the final `verify()`. +1. Before round 0, the trial reads `/solution/solve.sh` and passes its contents to `user.setup(instruction, solution=...)`. +2. The trial moves `/solution` → `/solution_oracle_backup` so the agent can't read it during its rounds. +3. Between rounds, soft-verify temporarily restores `/solution` (some verifiers consult it) then re-hides it. +4. Before the final `verify()`, the trial permanently restores `/solution`. -Step 4 is wrapped in a `try/finally`, so if a round throws, the restore still runs. +Step 4 is wrapped in `try/finally` against the user loop: if a round throws, the restore still runs. -> ⚠️ Setting `oracle_access=True` without a `User` is a misconfiguration — the solution stays exposed to the agent for the entire trial. BenchFlow logs a `WARNING` at setup time when this happens. +> ⚠️ Setting `oracle_access=True` *without* a `User` is a misconfiguration — the solution stays exposed to the agent for the entire trial. benchflow logs a `WARNING` at setup time when this happens. Use cases for oracle access: -- Dataset generation: have the user generate optimal prompts based on knowing the answer. -- Curriculum learning: progressively reveal hints from the gold solution. -- Research: study how much oracle information is needed for an agent to succeed. +- **Dataset generation** — the user has the answer, generates an optimal prompt for the agent +- **Curriculum learning** — progressively reveal pieces of the solution +- **Research** — measure how much oracle information is required for an agent to succeed --- ## Per-task hardening opt-outs -The verifier's pre-run cleanup deletes `conftest.py` files outside `/tests/` to prevent agent reward-hacking. Some tasks (e.g. qutebrowser) ship legitimate `conftest.py` that sets up Python's import order to break a real circular dependency. The default cleanup deletes them, breaking pytest collection. +The verifier's pre-run cleanup deletes `conftest.py` outside `/tests/` to prevent reward-hacking. Some tasks (qutebrowser) ship legitimate `conftest.py` files that fix real circular imports — deleting them breaks pytest collection. -Tasks declare opt-outs in `task.toml`: +Tasks opt out in `task.toml`: ```toml -[verifier] -timeout_sec = 3000 - [verifier.hardening] cleanup_conftests = false ``` -Available flags (all default `true` — secure-by-default): +| Flag | Default | Effect when `false` | +|------|---------|---------------------| +| `cleanup_conftests` | `true` | Don't delete `conftest.py` outside `/tests/` before verify | -| Flag | Effect when `false` | -|------|---------------------| -| `cleanup_conftests` | Don't delete `conftest.py` outside `/tests/` before verify | +`sitecustomize.py`, `.pth` files, and `*.py` in `/tmp` always get cleaned — they have no legitimate use in a test artifact and disabling them broadens the attack surface beyond what real-world tasks need. -Other cleanup steps (`sitecustomize.py`, `.pth` files, `*.py` in `/tmp`) always run — they have no legitimate use case in repo source trees and broaden the attack surface if disabled. - -Unknown keys in `[verifier.hardening]` are logged as warnings and ignored. String values for boolean flags are rejected (must be TOML `true` / `false`). +Unknown keys in `[verifier.hardening]` are warned and ignored. String values for boolean flags are rejected. --- ## Failure modes -The user loop catches exceptions from `user.run()` and logs them as the trial error, breaking out of the loop: +The user loop catches exceptions from `user.run()` and stops, with the exception message stored in `Trial._error`: -```python +``` [User] round 2: prompt='Try again, focusing on...' -ERROR: user.run() failed at round 2: KeyError: 'spec_section' +ERROR user.run() failed at round 2: KeyError: 'spec_section' ``` -`soft_verify()` between rounds catches its own timeouts and crashes — they surface as `RoundResult.verifier_error`, not as trial-level failures. The next round still runs; the user sees the error and decides whether to continue. +`soft_verify()` between rounds catches its own timeouts and crashes — they surface as `RoundResult.verifier_error`, not as a trial-level failure. The next round still runs and the user can decide what to do. -Trajectory and tool counts are sliced per round from `Trial._trajectory`. The session counters reset on `disconnect()` between rounds, so each round's `RoundResult.trajectory` and `n_tool_calls` reflect only that round's events. +Trajectory and tool counts are sliced per round from `Trial._trajectory`. The session counters reset on `disconnect()`, so each round's `RoundResult.trajectory` and `n_tool_calls` reflect only that round's events, not cumulative. --- -## Worked example +## Comparison with multi-agent simulated user (Harbor #1316 parity) -See [`examples/swebench_pro_progressive_disclosure.ipynb`](../examples/swebench_pro_progressive_disclosure.ipynb) for a 5-task SWE-bench Pro comparison: oracle vs single-round baseline vs 3-round progressive disclosure on flipt and openlibrary. +benchflow has two patterns for multi-round agent runs. Both are functionally at parity with [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) — neither requires a FastMCP sidecar. -For a minimal end-to-end script, see [`examples/user_dogfood.py`](../examples/user_dogfood.py). +| Pattern | What "user" is | When to use | +|---------|---------------|-------------| +| **`BaseUser` callback (this doc)** | Python function in the scheduler process | Programmatic, deterministic, rule-based. No second LLM. Cheap. Best for progressive disclosure, curriculum, scripted hints. | +| **Multi-role Scene with simulated-user role** ([use-cases §1](./use-cases.md#1-interactive-user-simulation-harbor-1316-equivalent)) | Another LLM with full tool access | Open-ended, conversational. The "user" can read files, check outputs, give nuanced feedback. Best when the user's behavior must itself be adaptive or LLM-quality. | ---- +The two coexist. Choose based on whether your "user" needs to think (Scene-based) or just decide (`BaseUser`). For the SWE-bench Pro use case, the disclosure schedule is fixed, the grading is the verifier, and there's nothing for a second LLM to add — `BaseUser` wins on cost and determinism. -## Comparison with multi-agent Scene-based user simulation +--- -| Pattern | When to use | -|---------|-------------| -| `BaseUser` callback (this doc) | Programmatic, rule-based, deterministic. No second LLM. Cheap. | -| Multi-role Scene with simulated-user role ([use-cases.md §1](./use-cases.md#1-interactive-user-simulation-harbor-1316-equivalent)) | Open-ended, conversational. The "user" is another LLM with full tool access. Better for nuanced human-like interaction. | +## Worked examples -Both patterns coexist. Choose `BaseUser` for the lighter-weight case; choose Scenes when you actually want a second agent in the loop. +- [`examples/swebench_pro_progressive_disclosure.ipynb`](../examples/swebench_pro_progressive_disclosure.ipynb) — the SWE-bench Pro case study, executable end-to-end with the latest oracle/baseline data. +- [`examples/swebench_pro_user_dogfood.py`](../examples/swebench_pro_user_dogfood.py) — runnable script for any of the 5 SWE-bench Pro tasks. `--task flipt --max-rounds 3`. +- [`examples/user_dogfood.py`](../examples/user_dogfood.py) — minimal regex-log task with `FunctionUser`, useful as a starting template. +- [`experiments/swebench_pro_oracle_and_baseline.py`](../experiments/swebench_pro_oracle_and_baseline.py) — the oracle-validation + baseline experiment script that produced the table above. diff --git a/docs/quickstart.md b/docs/quickstart.md deleted file mode 100644 index 502fc94..0000000 --- a/docs/quickstart.md +++ /dev/null @@ -1,122 +0,0 @@ -# Quickstart - -Get a benchmark result in under 5 minutes. - -## Prerequisites - -- Python 3.12+ and [uv](https://docs.astral.sh/uv/) -- A Daytona API key (`DAYTONA_API_KEY`) for cloud sandboxes -- An agent API key (e.g. `GEMINI_API_KEY` for Gemini) - -## Install - -```bash -uv tool install benchflow -``` - -## Run your first evaluation - -```bash -# Set credentials -export DAYTONA_API_KEY="dtn_..." -export GEMINI_API_KEY="AIza..." - -# Run one TB2 task with Gemini -bench eval create \ - -t .ref/terminal-bench-2/regex-log \ - -a gemini \ - -m gemini-3.1-flash-lite-preview \ - -e daytona \ - --sandbox-setup-timeout 300 -``` - -BenchFlow will: -1. Download Terminal-Bench-2 tasks (first run only) -2. Spin up a Daytona sandbox -3. Install the Gemini CLI agent -4. Send the task instruction via ACP -5. Run the verifier -6. Print the reward (0.0 or 1.0) - -## Run a full benchmark - -```bash -# 89 TB2 tasks, 64 concurrent -bench eval create -f benchmarks/tb2-gemini-baseline.yaml -``` - -Example YAML config: -```yaml -task_dir: .ref/terminal-bench-2 -agent: gemini -model: gemini-3.1-flash-lite-preview -environment: daytona -concurrency: 64 -max_retries: 2 -sandbox_setup_timeout: 300 -``` - -## Python API - -```python -import benchflow as bf - -# One-liner -result = await bf.run("gemini", task_path="tasks/regex-log", model="gemini-3.1-flash-lite-preview") -print(f"reward={result.rewards}") - -# With Trial for more control -from benchflow.trial import Trial, TrialConfig, Scene - -config = TrialConfig( - task_path=Path("tasks/regex-log"), - scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")], - environment="daytona", - sandbox_setup_timeout=300, -) -trial = await Trial.create(config) -result = await trial.run() -``` - -If you are using the `Agent + Environment` path directly, pass the timeout through `RuntimeConfig`: - -```python -from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig - -agent = Agent("gemini", model="gemini-3.1-flash-lite-preview") -env = Environment.from_task("tasks/regex-log", backend="daytona") -runtime = Runtime(env, agent, config=RuntimeConfig(sandbox_setup_timeout=300)) -result = await runtime.execute() -``` - -## Multi-agent (reviewer pattern) - -```python -from benchflow.trial import TrialConfig, Scene, Role, Turn - -config = TrialConfig( - task_path=Path("tasks/regex-log"), - scenes=[ - Scene(name="coder-reviewer", - roles=[ - Role("coder", "gemini", "gemini-3.1-flash-lite-preview"), - Role("reviewer", "gemini", "gemini-3.1-flash-lite-preview"), - ], - turns=[ - Turn("coder"), - Turn("reviewer", "Review the code in /app/. Write feedback to /app/.outbox/coder.json"), - Turn("coder", "Read reviewer feedback and fix issues."), - ]), - ], - environment="daytona", - sandbox_setup_timeout=300, -) -result = await bf.run(config) -``` - -## Next steps - -- [CLI Reference](cli-reference.md) — all commands -- [Task Authoring](task-authoring.md) — create your own tasks -- [API Reference](api-reference.md) — Trial/Scene API details -- [Skill Eval Guide](skill-eval-guide.md) — evaluate agent skills diff --git a/docs/cli-reference.md b/docs/reference/cli.md similarity index 100% rename from docs/cli-reference.md rename to docs/reference/cli.md diff --git a/docs/api-reference.md b/docs/reference/python-api.md similarity index 98% rename from docs/api-reference.md rename to docs/reference/python-api.md index 00325e6..841d6cc 100644 --- a/docs/api-reference.md +++ b/docs/reference/python-api.md @@ -227,7 +227,7 @@ The Scene API in 0.3 covers coder-reviewer and multi-turn patterns. It does **no - **Per-round verification** — `verify()` runs once after all scenes complete, not between rounds. - **Inter-round trajectory inspection** — a "user" role cannot read the agent's trajectory between turns. -These are tracked for 0.4. See the [Harbor PR #1462 mapping](docs/notebooks/scene-patterns.ipynb) for details. +These are tracked for 0.4. See the [Harbor PR #1462 mapping](../examples/scene-patterns.ipynb) for details. ## YAML Trial Configs diff --git a/docs/sandbox-hardening.md b/docs/sandbox-hardening.md new file mode 100644 index 0000000..b9f2392 --- /dev/null +++ b/docs/sandbox-hardening.md @@ -0,0 +1,82 @@ +# Sandbox hardening + +## Why + +Frontier agents will reward-hack any verifier they can. [BenchJack](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) (Berkeley RDI, April 2026) showed that ~100% of tasks across eight major AI agent benchmarks are solvable by exploiting the *verifier's runtime* rather than the agent's intelligence: pytest `conftest.py` hooks, planted PATH binaries, leaked answer keys, `.pth` injection, `sitecustomize.py` shadowing. [Meerkat](https://debugml.github.io/cheating-agents/) (Penn, same week) confirmed it from the trace side — frontier models discover and use these shortcuts in the wild. + +benchflow's verifier hardening blocks these patterns by default. Tasks opt out individually when their legitimate test setup requires a hook the cleanup would otherwise delete (e.g. qutebrowser's `conftest.py` resolves a real circular import). + +## What gets hardened + +Between the agent's last action and the verifier's first command, `harden_before_verify()` (in `src/benchflow/_sandbox.py`) runs this sequence: + +1. **Kill sandbox-user processes** — `pkill -u `, twice with a sleep, so the agent can't keep writing during teardown. +2. **Wipe `/logs/verifier`** — fresh, root-owned, mode 0777 (so a non-root verifier can write its own results). +3. **Restore build-config files** from a pre-agent snapshot — `setup.py`, `pyproject.toml`, `setup.cfg`, `tox.ini`, `noxfile.py`, `hatch.toml`, `flit.ini`, `MANIFEST.in`, `requirements.txt`, `requirements-dev.txt`, `Makefile`. Closes the editable-install hijack vector. +4. **Optionally restore the entire workspace** from `/testbed_verify` snapshot — opt-in per task because some tasks score workspace edits. +5. **Purge symlinks whose realpath escapes the workspace** — agent can't redirect to `/tmp/evil.py`. +6. **Purge `__pycache__` trees** that didn't exist in the pre-agent baseline. +7. **chown workspace to root** — belt-and-suspenders against zombie sandbox-user processes. +8. **`CLEANUP_CMD`** — delete `conftest.py` outside `/tests/`, `*.py` from `/tmp`/`/var/tmp`, `sitecustomize.py`/`usercustomize.py`/`*.pth` from writable `sys.path` entries. +9. **Set `VERIFIER_ENV`** — pinned `PATH`, `PYTEST_ADDOPTS=-c /dev/null --confcutdir=/tests --rootdir=/app -p no:cacheprovider`, `PYTHONPATH=""`, `PYTHONDONTWRITEBYTECODE=1`, plugin auto-discovery off. +10. **Discover allowed pytest plugins** — `importlib.metadata` scan inside the container, then merge with task-declared `pytest_plugins` from `task.toml`. Anything not in the allow-list is blocked. + +The verifier then runs against this hardened workspace. + +## Per-task opt-outs + +Tasks declare opt-outs in `task.toml`: + +```toml +[verifier.hardening] +cleanup_conftests = false +``` + +| Flag | Default | Effect when `false` | +|------|---------|---------------------| +| `cleanup_conftests` | `true` | Don't delete `conftest.py` outside `/tests/` before verify | + +Other cleanup steps (`sitecustomize.py`, `.pth`, `/tmp` `*.py`) always run — they have no legitimate use in a test artifact and disabling them would broaden the attack surface beyond what real tasks need. + +Unknown keys in `[verifier.hardening]` are warned and ignored. String values for boolean flags are rejected. + +See [`progressive-disclosure.md`](./progressive-disclosure.md#per-task-hardening-opt-outs) for the qutebrowser case study (legitimate `conftest.py` for circular-import fix). + +## Validation: the labs + +Two runnable research artifacts demonstrate the hardening's effectiveness: + +### `labs/benchjack-sandbox-hardening/` + +End-to-end demo that benchflow 0.2.1+ blocks three BenchJack-style exploits that work against 0.2.0. Each exploit flips the reward from 0.0 to 1.0 under 0.2.0 — under hardened benchflow, the same exploits fail and reward stays 0.0. + +Patterns covered: +- `pattern1_conftest_hook/` — agent plants `conftest.py` with a pytest_runtest_makereport hook that overwrites the report +- `pattern2_answer_lookup/` — agent reads `/solution/` directly +- `pattern7_pth_injection/` — agent installs a `.pth` file that imports a sitecustomize-style shadow + +One-command repro: `cd labs/benchjack-sandbox-hardening && ./run_comparison.py`. Output is `comparison.ipynb`-style diff: 0.2.0 reward=1.0 / hardened reward=0.0 per pattern. + +### `labs/reward-hack-matrix/` + +Full reward-hack sweep across real benchmark tasks comparing 0.2.0 vs 0.2.2. Scope: 8 major benchmarks, ~100 tasks, multiple exploit families. Output: a sweep JSON (`sweep_0.2.0_vs_0.2.2.json`) showing the hardening rate by exploit class. + +Run with `cd labs/reward-hack-matrix && python run_matrix.py`. The lab README documents the methodology and per-task results. + +## Threat model and known gaps + +Benchflow's hardening assumes: +- The Docker image and pre-installed packages are trusted (they're built before the agent runs). +- The agent runs as a non-root sandbox user (`sandbox_user`) — when set to `None`, hardening still runs but the agent can defeat lockdown by being root. Don't run untrusted agents as root. +- The verifier itself is trusted code shipped with the task; benchflow only hardens *what the agent can do to the verifier's environment*. + +Known residual risk: +- An agent with sustained access can poison `__pycache__` for files that exist in the baseline (those caches aren't deleted because some tasks diff workspace against `/testbed_verify`). Mitigated by the workspace chown but not eliminated. +- Tasks that don't ship a build-config snapshot can still be hijacked via `setup.py` edits. Snapshot is automatic for declared filenames — task authors don't need to opt in. + +## Related + +- [`labs/benchjack-sandbox-hardening/README.md`](../labs/benchjack-sandbox-hardening/README.md) — full BenchJack pattern catalog and repro instructions. +- [`labs/reward-hack-matrix/README.md`](../labs/reward-hack-matrix/README.md) — methodology, exploit taxonomy, sweep results. +- [`progressive-disclosure.md`](./progressive-disclosure.md) — soft-verify (the relaxed hardening used between rounds in multi-round trials). +- [`task-authoring.md`](./task-authoring.md) — the `task.toml` schema including `[verifier.hardening]` opt-outs. diff --git a/docs/skill-eval-guide.md b/docs/skill-eval.md similarity index 100% rename from docs/skill-eval-guide.md rename to docs/skill-eval.md diff --git a/docs/notebooks/coder-reviewer-demo.py b/examples/coder-reviewer-demo.py similarity index 97% rename from docs/notebooks/coder-reviewer-demo.py rename to examples/coder-reviewer-demo.py index 010b492..8c3ba62 100644 --- a/docs/notebooks/coder-reviewer-demo.py +++ b/examples/coder-reviewer-demo.py @@ -12,8 +12,8 @@ - A Harbor-format task directory (e.g. .ref/terminal-bench-2/regex-log) Usage: - python docs/notebooks/coder-reviewer-demo.py --task .ref/terminal-bench-2/regex-log - python docs/notebooks/coder-reviewer-demo.py --task .ref/terminal-bench-2/regex-log --env docker + python examples/coder-reviewer-demo.py --task .ref/terminal-bench-2/regex-log + python examples/coder-reviewer-demo.py --task .ref/terminal-bench-2/regex-log --env docker Terminology: - Turn: One prompt → one ACP session (one role acts) diff --git a/docs/notebooks/nanofirm-task/environment/Dockerfile b/examples/nanofirm-task/environment/Dockerfile similarity index 100% rename from docs/notebooks/nanofirm-task/environment/Dockerfile rename to examples/nanofirm-task/environment/Dockerfile diff --git a/docs/notebooks/nanofirm-task/environment/contract.md b/examples/nanofirm-task/environment/contract.md similarity index 100% rename from docs/notebooks/nanofirm-task/environment/contract.md rename to examples/nanofirm-task/environment/contract.md diff --git a/docs/notebooks/nanofirm-task/instruction.md b/examples/nanofirm-task/instruction.md similarity index 100% rename from docs/notebooks/nanofirm-task/instruction.md rename to examples/nanofirm-task/instruction.md diff --git a/docs/notebooks/nanofirm-task/solution/solve.sh b/examples/nanofirm-task/solution/solve.sh similarity index 100% rename from docs/notebooks/nanofirm-task/solution/solve.sh rename to examples/nanofirm-task/solution/solve.sh diff --git a/docs/notebooks/nanofirm-task/task.toml b/examples/nanofirm-task/task.toml similarity index 100% rename from docs/notebooks/nanofirm-task/task.toml rename to examples/nanofirm-task/task.toml diff --git a/docs/notebooks/nanofirm-task/tests/evaluate.py b/examples/nanofirm-task/tests/evaluate.py similarity index 100% rename from docs/notebooks/nanofirm-task/tests/evaluate.py rename to examples/nanofirm-task/tests/evaluate.py diff --git a/docs/notebooks/nanofirm-task/tests/test.sh b/examples/nanofirm-task/tests/test.sh similarity index 100% rename from docs/notebooks/nanofirm-task/tests/test.sh rename to examples/nanofirm-task/tests/test.sh diff --git a/docs/notebooks/scene-patterns.ipynb b/examples/scene-patterns.ipynb similarity index 100% rename from docs/notebooks/scene-patterns.ipynb rename to examples/scene-patterns.ipynb diff --git a/docs/notebooks/scene-patterns.md b/examples/scene-patterns.md similarity index 99% rename from docs/notebooks/scene-patterns.md rename to examples/scene-patterns.md index 91e64b9..e6719b2 100644 --- a/docs/notebooks/scene-patterns.md +++ b/examples/scene-patterns.md @@ -107,7 +107,7 @@ Each pattern is a TrialConfig change — same API, same verifier, same trajector ```bash pip install google-generativeai export GEMINI_API_KEY="AIza..." -python docs/notebooks/scene-patterns.py +python examples/scene-patterns.py ``` The script constructs the contract inline and runs all 4 patterns with actual LLM calls. No Docker or Daytona needed — it demonstrates the interaction patterns directly. diff --git a/docs/notebooks/scene-patterns.py b/examples/scene-patterns.py similarity index 99% rename from docs/notebooks/scene-patterns.py rename to examples/scene-patterns.py index 961bd09..88a92a6 100644 --- a/docs/notebooks/scene-patterns.py +++ b/examples/scene-patterns.py @@ -14,7 +14,7 @@ Run: export GEMINI_API_KEY="AIza..." - python docs/notebooks/scene-patterns.py + python examples/scene-patterns.py """ import os diff --git a/examples/swebench_pro_progressive_disclosure.ipynb b/examples/swebench_pro_progressive_disclosure.ipynb index 3b6fe2c..c3978dd 100644 --- a/examples/swebench_pro_progressive_disclosure.ipynb +++ b/examples/swebench_pro_progressive_disclosure.ipynb @@ -12,7 +12,7 @@ "**Baseline agent:** Gemini 3.1 Pro Preview, single-round.\n", "**Progressive:** `BaseUser` callback, up to 3 rounds, hints disclosed on test failure.\n", "\n", - "Built for [Josh's GitHub/Microsoft SWE-bench Pro use case](https://github.com/swe-bench-pro/swe-bench-pro). Parity answer to [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) for the no-second-LLM case — see [`docs/progressive-disclosure.md`](../docs/progressive-disclosure.md).\n", + "Built for the SWE-bench Pro progressive-disclosure use case. Parity answer to [Harbor #1316](https://github.com/harbor-ai/harbor/issues/1316) for the no-second-LLM case — see [`docs/progressive-disclosure.md`](../docs/progressive-disclosure.md).\n", "\n", "## Setup history (2026-04-24)\n", "\n", @@ -36,10 +36,10 @@ "execution_count": 1, "metadata": { "execution": { - "iopub.execute_input": "2026-04-25T01:10:06.729067Z", - "iopub.status.busy": "2026-04-25T01:10:06.728964Z", - "iopub.status.idle": "2026-04-25T01:10:10.646349Z", - "shell.execute_reply": "2026-04-25T01:10:10.645629Z" + "iopub.execute_input": "2026-04-25T17:18:52.654931Z", + "iopub.status.busy": "2026-04-25T17:18:52.654778Z", + "iopub.status.idle": "2026-04-25T17:18:52.927296Z", + "shell.execute_reply": "2026-04-25T17:18:52.926906Z" } }, "outputs": [ @@ -82,10 +82,10 @@ "execution_count": 2, "metadata": { "execution": { - "iopub.execute_input": "2026-04-25T01:10:10.667007Z", - "iopub.status.busy": "2026-04-25T01:10:10.666654Z", - "iopub.status.idle": "2026-04-25T01:10:10.670952Z", - "shell.execute_reply": "2026-04-25T01:10:10.670403Z" + "iopub.execute_input": "2026-04-25T17:18:52.945618Z", + "iopub.status.busy": "2026-04-25T17:18:52.945499Z", + "iopub.status.idle": "2026-04-25T17:18:52.949997Z", + "shell.execute_reply": "2026-04-25T17:18:52.949446Z" } }, "outputs": [ @@ -138,10 +138,10 @@ "execution_count": 3, "metadata": { "execution": { - "iopub.execute_input": "2026-04-25T01:10:10.672135Z", - "iopub.status.busy": "2026-04-25T01:10:10.672040Z", - "iopub.status.idle": "2026-04-25T01:10:10.676265Z", - "shell.execute_reply": "2026-04-25T01:10:10.675540Z" + "iopub.execute_input": "2026-04-25T17:18:52.951031Z", + "iopub.status.busy": "2026-04-25T17:18:52.950929Z", + "iopub.status.idle": "2026-04-25T17:18:52.955473Z", + "shell.execute_reply": "2026-04-25T17:18:52.954919Z" } }, "outputs": [ @@ -200,10 +200,10 @@ "execution_count": 4, "metadata": { "execution": { - "iopub.execute_input": "2026-04-25T01:10:10.677398Z", - "iopub.status.busy": "2026-04-25T01:10:10.677287Z", - "iopub.status.idle": "2026-04-25T01:10:10.681221Z", - "shell.execute_reply": "2026-04-25T01:10:10.680559Z" + "iopub.execute_input": "2026-04-25T17:18:52.956429Z", + "iopub.status.busy": "2026-04-25T17:18:52.956330Z", + "iopub.status.idle": "2026-04-25T17:18:52.959689Z", + "shell.execute_reply": "2026-04-25T17:18:52.959351Z" } }, "outputs": [ @@ -238,18 +238,18 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "## 5. Progressive disclosure: where it should help\n", + "## 5. Progressive disclosure: 5-task run with Gemini 3.1 Pro\n", "\n", - "**flipt** is the interesting case: oracle passes (gold `solve.sh` works) but Gemini's single-round baseline fails after 61 tool calls and 24 minutes. The agent had the full spec but didn't solve it. Hypothesis: progressive disclosure with failing-test feedback could let the agent course-correct.\n", + "Three-round progressive disclosure (terse → failing tests + half spec → full spec) on all 5 oracle-passing SWE-bench Pro tasks. Daytona backend, Gemini 3.1 Pro Preview as the agent.\n", "\n", - "**openlibrary** is the regression test: baseline already passes, so progressive disclosure should also pass. If progressive breaks something the baseline solves, that's a bug in the user loop.\n", + "**What we expected:**\n", + "- **flipt** — baseline failed at 24min/61 tools; progressive should give the agent failure feedback to course-correct\n", + "- **openlibrary** — baseline already passed; progressive should not regress\n", + "- **ansible / navidrome / qutebrowser** — see whether progressive helps tasks where baseline numbers weren't run yet\n", "\n", - "Run the script:\n", - "```bash\n", - "GEMINI_API_KEY=... python examples/swebench_pro_user_dogfood.py --task flipt --max-rounds 3\n", - "```\n", + "**What actually happened:** 3/5 tasks reached final reward 1.0 (ansible, openlibrary, qutebrowser). flipt and navidrome stayed at 0.0 — Gemini 3.1 Pro didn't crack them with this hint schedule. ansible and qutebrowser flaked on first run (transport EOF / 50min timeout) and succeeded on retry; v0.3.3 adds `agent_idle_timeout` and clearer EOF diagnostics so future hangs fail fast and clearly.\n", "\n", - "Results land in `/tmp/swebench-pro-jobs/progressive///`." + "A subtle finding: per-round soft_verify scored 0.0 across all rounds even on tasks where the final hardened verify reported 1.0. Soft_verify intentionally skips workspace restore + process kill so the sandbox stays alive, so its score can diverge from the canonical verifier. Designing a User callback's stop condition needs to account for this — reading `RoundResult.verifier_output` (raw test output) is more reliable than the rewards dict alone." ] }, { @@ -257,10 +257,10 @@ "execution_count": 5, "metadata": { "execution": { - "iopub.execute_input": "2026-04-25T01:10:10.682639Z", - "iopub.status.busy": "2026-04-25T01:10:10.682546Z", - "iopub.status.idle": "2026-04-25T01:10:10.686008Z", - "shell.execute_reply": "2026-04-25T01:10:10.685533Z" + "iopub.execute_input": "2026-04-25T17:18:52.960619Z", + "iopub.status.busy": "2026-04-25T17:18:52.960540Z", + "iopub.status.idle": "2026-04-25T17:18:52.966081Z", + "shell.execute_reply": "2026-04-25T17:18:52.965596Z" } }, "outputs": [ @@ -268,41 +268,42 @@ "name": "stdout", "output_type": "stream", "text": [ - "Latest trial: /tmp/swebench-pro-jobs/progressive/2026-04-24__18-07-33/instance_flipt-io__flipt-02e21636c58e86c51119b63e0fb5ca7b813b07b1__0684d916\n", - " (no user_rounds.jsonl yet — run is in progress or failed before any round)\n" + "Task Final Tools Time(s) Rounds reward Error\n", + "------------------------------------------------------------------------------------------\n", + "ansible 1.0 126 938.1 0.0 / 0.0 / 0.0 \n", + "flipt 0.0 195 3373.8 0.0 / 0.0 / 0.0 \n", + "openlibrary 1.0 82 707.0 0.0 / 0.0 / 0.0 \n", + "navidrome 0.0 145 1514.3 0.0 / 0.0 / 0.0 \n", + "qutebrowser 1.0 183 1478.9 0.0 / 0.0 / 0.0 \n", + "\n", + "3/5 progressive runs passed final verify, 0 hit infra errors\n" ] } ], "source": [ - "# Once you have a progressive run, point at its trial dir and load round results.\n", - "# Each round's outcome is logged to user_rounds.jsonl.\n", - "\n", "import json\n", "\n", - "PROGRESSIVE_ROOT = Path('/tmp/swebench-pro-jobs/progressive')\n", - "\n", - "if PROGRESSIVE_ROOT.exists():\n", - " runs = sorted(PROGRESSIVE_ROOT.iterdir())\n", - " if runs:\n", - " latest_job = runs[-1]\n", - " instances = list(latest_job.iterdir())\n", - " if instances:\n", - " trial = instances[0]\n", - " print(f'Latest trial: {trial}')\n", - " rounds_log = trial / 'user_rounds.jsonl'\n", - " if rounds_log.exists():\n", - " rounds = [json.loads(l) for l in rounds_log.read_text().splitlines()]\n", - " print(f'\\n{\"Round\":<7} {\"Reward\":<10} {\"Tools\":<7} Verifier error')\n", - " print('-' * 60)\n", - " for r in rounds:\n", - " score = (r.get('rewards') or {}).get('reward', '?')\n", - " err = (r.get('verifier_error') or '')[:30]\n", - " print(f'{r[\"round\"]:<7} {score!s:<10} {r[\"n_tool_calls\"]:<7} {err}')\n", - " else:\n", - " print(' (no user_rounds.jsonl yet — run is in progress or failed before any round)')\n", - "else:\n", - " print(f'No progressive runs at {PROGRESSIVE_ROOT} yet.')\n", - " print('Run examples/swebench_pro_user_dogfood.py to generate one.')" + "# Aggregated per-run results — committed alongside this notebook so the\n", + "# tables are reproducible without re-running Daytona.\n", + "results_path = Path('experiments/swebench-pro-progressive-results.json')\n", + "results = json.load(open(results_path))\n", + "\n", + "print(f'{\"Task\":<14} {\"Final\":<8} {\"Tools\":<8} {\"Time(s)\":<10} {\"Rounds reward\":<22} {\"Error\"}')\n", + "print('-' * 90)\n", + "for task in ['ansible', 'flipt', 'openlibrary', 'navidrome', 'qutebrowser']:\n", + " r = results.get(task, {})\n", + " final = r.get('final_reward')\n", + " final_str = f'{final}' if final is not None else '—'\n", + " tools = r.get('tool_calls', 0)\n", + " elapsed = r.get('elapsed_s', 0)\n", + " rounds = r.get('rounds', [])\n", + " rounds_str = ' / '.join(f'{x[\"reward\"]}' for x in rounds) if rounds else '(no rounds)'\n", + " err = (r.get('error') or '')[:30]\n", + " print(f'{task:<14} {final_str:<8} {tools:<8} {elapsed:<10.1f} {rounds_str:<22} {err}')\n", + "\n", + "passed = sum(1 for t, r in results.items() if r.get('final_reward') == 1.0)\n", + "errored = sum(1 for t, r in results.items() if r.get('error'))\n", + "print(f'\\n{passed}/{len(results)} progressive runs passed final verify, {errored} hit infra errors')" ] }, { diff --git a/examples/swebench_pro_user_dogfood.py b/examples/swebench_pro_user_dogfood.py index 997ccc8..4fa7cc6 100644 --- a/examples/swebench_pro_user_dogfood.py +++ b/examples/swebench_pro_user_dogfood.py @@ -1,7 +1,7 @@ """Dogfood: SWE-bench Pro progressive disclosure with BaseUser. Demonstrates the BaseUser abstraction on a SWE-bench Pro task — the original -motivation for this feature (Josh's GitHub/Microsoft use case). +motivation for this feature. The user: Round 0: terse problem description (one sentence from the spec). @@ -127,8 +127,6 @@ async def main(): print(f" Rewards: {result.rewards}") print(f" Tool calls: {result.n_tool_calls}") print(f" Error: {result.error}") - if result.trial_dir: - print(f" Trial dir: {result.trial_dir}") if __name__ == "__main__": diff --git a/experiments/swebench-pro-progressive-results.json b/experiments/swebench-pro-progressive-results.json new file mode 100644 index 0000000..4b36bfa --- /dev/null +++ b/experiments/swebench-pro-progressive-results.json @@ -0,0 +1,117 @@ +{ + "ansible": { + "final_reward": 1.0, + "tool_calls": 126, + "error": null, + "elapsed_s": 938.1, + "rounds": [ + { + "round": 0, + "reward": 0.0, + "tools": 38 + }, + { + "round": 1, + "reward": 0.0, + "tools": 47 + }, + { + "round": 2, + "reward": 0.0, + "tools": 41 + } + ] + }, + "flipt": { + "final_reward": 0.0, + "tool_calls": 195, + "error": null, + "elapsed_s": 3373.8, + "rounds": [ + { + "round": 0, + "reward": 0.0, + "tools": 32 + }, + { + "round": 1, + "reward": 0.0, + "tools": 92 + }, + { + "round": 2, + "reward": 0.0, + "tools": 71 + } + ] + }, + "openlibrary": { + "final_reward": 1.0, + "tool_calls": 82, + "error": null, + "elapsed_s": 707.0, + "rounds": [ + { + "round": 0, + "reward": 0.0, + "tools": 18 + }, + { + "round": 1, + "reward": 0.0, + "tools": 29 + }, + { + "round": 2, + "reward": 0.0, + "tools": 35 + } + ] + }, + "navidrome": { + "final_reward": 0.0, + "tool_calls": 145, + "error": null, + "elapsed_s": 1514.3, + "rounds": [ + { + "round": 0, + "reward": 0.0, + "tools": 47 + }, + { + "round": 1, + "reward": 0.0, + "tools": 66 + }, + { + "round": 2, + "reward": 0.0, + "tools": 32 + } + ] + }, + "qutebrowser": { + "final_reward": 1.0, + "tool_calls": 183, + "error": null, + "elapsed_s": 1478.9, + "rounds": [ + { + "round": 0, + "reward": 0.0, + "tools": 89 + }, + { + "round": 1, + "reward": 0.0, + "tools": 40 + }, + { + "round": 2, + "reward": 0.0, + "tools": 54 + } + ] + } +} \ No newline at end of file diff --git a/src/benchflow/_acp_run.py b/src/benchflow/_acp_run.py index 58b95be..4e4acc7 100644 --- a/src/benchflow/_acp_run.py +++ b/src/benchflow/_acp_run.py @@ -118,14 +118,18 @@ async def connect_acp( for attempt in range(_ACP_CONNECT_MAX_RETRIES + 1): if attempt > 0: delay = _ACP_CONNECT_BASE_DELAY * (2 ** (attempt - 1)) - logger.info(f"ACP connect retry {attempt}/{_ACP_CONNECT_MAX_RETRIES} after {delay:.0f}s") + logger.info( + f"ACP connect retry {attempt}/{_ACP_CONNECT_MAX_RETRIES} after {delay:.0f}s" + ) await asyncio.sleep(delay) try: if environment == "docker": live_proc = DockerProcess.from_harbor_env(env) else: - is_dind = hasattr(env, "_strategy") and hasattr(env._strategy, "_compose_cmd") + is_dind = hasattr(env, "_strategy") and hasattr( + env._strategy, "_compose_cmd" + ) if is_dind: live_proc = await DaytonaPtyProcess.from_harbor_env(env) logger.info("Using PTY transport for DinD compose task") @@ -144,10 +148,14 @@ async def connect_acp( await acp_client.connect() init_result = await asyncio.wait_for(acp_client.initialize(), timeout=60) - agent_name = init_result.agent_info.name if init_result.agent_info else agent + agent_name = ( + init_result.agent_info.name if init_result.agent_info else agent + ) logger.info(f"ACP agent: {agent_name}") - session = await asyncio.wait_for(acp_client.session_new(cwd=agent_cwd), timeout=60) + session = await asyncio.wait_for( + acp_client.session_new(cwd=agent_cwd), timeout=60 + ) logger.info(f"Session: {session.session_id}") break except ConnectionError as e: @@ -176,7 +184,9 @@ async def connect_acp( except Exception as e: logger.warning(f"Failed to set model via ACP: {e}") elif model: - logger.info(f"Skipping ACP set_model for {agent} — launch/env config owns model selection") + logger.info( + f"Skipping ACP set_model for {agent} — launch/env config owns model selection" + ) return acp_client, session, agent_name @@ -186,17 +196,102 @@ async def execute_prompts( session, prompts: list[str], timeout: int, + idle_timeout: int | None = None, ) -> tuple[list[dict], int]: - """Send prompts via ACP and capture trajectory. Return (trajectory, n_tool_calls).""" + """Send prompts via ACP and capture trajectory. Return (trajectory, n_tool_calls). + + timeout — wall-clock budget for each prompt (full agent budget). + idle_timeout — abort the prompt if no tool call or message arrives for + this many seconds. Catches agents that hung silently while + the agent process is still alive (e.g. gemini-cli not + responding). None disables idle detection. + """ for i, prompt in enumerate(prompts): - logger.info(f"Prompt {i + 1}/{len(prompts)}: {(prompt or '')[:80]}...") - prompt_result = await asyncio.wait_for( - acp_client.prompt(prompt), - timeout=timeout, + logger.info( + f"Prompt {i + 1}/{len(prompts)}: {(prompt or '')[:80]}..." ) + if idle_timeout is None: + prompt_result = await asyncio.wait_for( + acp_client.prompt(prompt), + timeout=timeout, + ) + else: + prompt_result = await _prompt_with_idle_watchdog( + acp_client, session, prompt, timeout, idle_timeout + ) logger.info( f" → {prompt_result.stop_reason.value}, " f"{len(session.tool_calls)} total tool calls" ) trajectory = _capture_session_trajectory(session) return trajectory, len(session.tool_calls) + + +async def _prompt_with_idle_watchdog( + acp_client: ACPClient, + session, + prompt: str, + timeout: int, + idle_timeout: int, +): + """Run acp_client.prompt() with both a wall-clock and an idle watchdog. + + The watchdog polls session.tool_calls every few seconds and aborts if no + progress was made in idle_timeout. This catches agents that hung silently + while the local process is still alive (no output to stdout, no tool calls + appended). + """ + + def _activity_count() -> int: + # Match the docstring contract: idle = no tool call AND no message + # AND no thought. Sum all three so streamed text resets the timer. + return ( + len(session.tool_calls) + + len(session.message_chunks) + + len(session.thought_chunks) + ) + + prompt_task = asyncio.create_task(acp_client.prompt(prompt)) + last_progress = asyncio.get_event_loop().time() + last_count = _activity_count() + # poll_interval considers BOTH idle_timeout and wall-clock timeout so that + # short overall budgets don't overshoot (e.g. timeout=30s with default + # poll_interval=30s could overshoot 100%). Cap at 30s, floor at 1s. + poll_interval = max(1, min(30, idle_timeout // 4, max(1, timeout // 4))) + deadline = last_progress + timeout + + try: + while not prompt_task.done(): + await asyncio.sleep(poll_interval) + # Re-check done() after the sleep — the prompt may have completed + # during the poll interval. Without this, we'd cancel an already- + # completed task and discard a successful result. + if prompt_task.done(): + break + now = asyncio.get_event_loop().time() + cur_count = _activity_count() + if cur_count > last_count: + last_progress = now + last_count = cur_count + if now - last_progress >= idle_timeout: + raise TimeoutError( + f"Agent idle for {idle_timeout}s with no new tool call, " + f"message, or thought " + f"(last activity {int(now - last_progress)}s ago, " + f"{len(session.tool_calls)} tool calls so far)" + ) + if now > deadline: + raise TimeoutError( + f"Agent prompt exceeded wall-clock budget {timeout}s" + ) + + return prompt_task.result() + finally: + # Always cancel + drain the prompt task on exit, including the + # external-cancellation path (CancelledError from sleep). Without this + # an outer cancel leaks the prompt task — it keeps running in the + # background until Trial.cleanup() eventually kills the agent process. + if not prompt_task.done(): + prompt_task.cancel() + with contextlib.suppress(BaseException): + await prompt_task diff --git a/src/benchflow/_agent_env.py b/src/benchflow/_agent_env.py index 0229b17..d2e67d4 100644 --- a/src/benchflow/_agent_env.py +++ b/src/benchflow/_agent_env.py @@ -82,7 +82,10 @@ def auto_inherit_env(agent_env: dict[str, str]) -> None: if "GEMINI_API_KEY" in agent_env and "GOOGLE_API_KEY" not in agent_env: agent_env["GOOGLE_API_KEY"] = agent_env["GEMINI_API_KEY"] # Mirror GEMINI_API_KEY as GOOGLE_GENERATIVE_AI_API_KEY (opencode/models.dev convention) - if "GEMINI_API_KEY" in agent_env and "GOOGLE_GENERATIVE_AI_API_KEY" not in agent_env: + if ( + "GEMINI_API_KEY" in agent_env + and "GOOGLE_GENERATIVE_AI_API_KEY" not in agent_env + ): agent_env["GOOGLE_GENERATIVE_AI_API_KEY"] = agent_env["GEMINI_API_KEY"] # CLAUDE_CODE_OAUTH_TOKEN is a separate auth path — Claude CLI reads it # directly. Don't map to ANTHROPIC_API_KEY (different auth mechanism). diff --git a/src/benchflow/_agent_setup.py b/src/benchflow/_agent_setup.py index 0c04a5a..2ef8fd7 100644 --- a/src/benchflow/_agent_setup.py +++ b/src/benchflow/_agent_setup.py @@ -38,14 +38,12 @@ def _skill_link_cmd(source: str, dest: str) -> str: parent = shlex.quote(str(Path(dest).parent)) q_source = shlex.quote(source) q_dest = shlex.quote(dest) - return ( - f"mkdir -p {parent} && " - f"rm -rf {q_dest} && " - f"ln -sfn {q_source} {q_dest}" - ) + return f"mkdir -p {parent} && rm -rf {q_dest} && ln -sfn {q_source} {q_dest}" -async def _link_skill_paths(env, source: str, skill_paths: list[str], home: str, cwd: str) -> int: +async def _link_skill_paths( + env, source: str, skill_paths: list[str], home: str, cwd: str +) -> int: """Link one shared skills tree into each configured discovery path.""" parts = [] for sp in skill_paths: @@ -160,6 +158,4 @@ async def deploy_skills( agent_cwd, ) if count: - logger.info( - f"Skills distributed to {count} paths for {agent_cfg.name}" - ) + logger.info(f"Skills distributed to {count} paths for {agent_cfg.name}") diff --git a/src/benchflow/_env_setup.py b/src/benchflow/_env_setup.py index 9800ffc..99567f2 100644 --- a/src/benchflow/_env_setup.py +++ b/src/benchflow/_env_setup.py @@ -20,7 +20,9 @@ # build) instead of erroring out. Override via env if running on a paid tier. _DAYTONA_MAX_CPUS = int(os.environ.get("BENCHFLOW_DAYTONA_MAX_CPUS", "4")) _DAYTONA_MAX_MEMORY_MB = int(os.environ.get("BENCHFLOW_DAYTONA_MAX_MEMORY_MB", "8192")) -_DAYTONA_MAX_STORAGE_MB = int(os.environ.get("BENCHFLOW_DAYTONA_MAX_STORAGE_MB", "10240")) +_DAYTONA_MAX_STORAGE_MB = int( + os.environ.get("BENCHFLOW_DAYTONA_MAX_STORAGE_MB", "10240") +) # Directories to ignore when copying deps _IGNORE_DIRS = { diff --git a/src/benchflow/_sandbox.py b/src/benchflow/_sandbox.py index 7294a3d..592b7ad 100644 --- a/src/benchflow/_sandbox.py +++ b/src/benchflow/_sandbox.py @@ -480,7 +480,8 @@ async def _discover_pytest_plugin_flags(env, task: "Task") -> str: try: result = await env.exec( f"python3 -c {shlex.quote(_DISCOVER_PYTEST_PLUGINS_SCRIPT)}", - user="root", timeout_sec=15, + user="root", + timeout_sec=15, ) if result.stderr: logger.debug(f"Plugin discovery stderr: {result.stderr.strip()}") @@ -563,7 +564,8 @@ async def _trusted_verifier_path( async def _trusted_verifier_pythonpath( - env, sandbox_user: str | None, + env, + sandbox_user: str | None, ) -> str: """Return filtered PYTHONPATH preserving only trusted image entries. @@ -571,7 +573,9 @@ async def _trusted_verifier_pythonpath( block the workspace — it is already importable via CWD/pytest and is chowned to root before verification. """ - pp_result = await env.exec("printenv PYTHONPATH 2>/dev/null || true", user="root", timeout_sec=10) + pp_result = await env.exec( + "printenv PYTHONPATH 2>/dev/null || true", user="root", timeout_sec=10 + ) raw_pp = (pp_result.stdout or "").strip() if not raw_pp: return "" @@ -630,9 +634,7 @@ def _read_hardening_config(task_dir: "Path | str | None") -> dict[str, bool]: if k in result and isinstance(v, bool): result[k] = v else: - logger.warning( - f"task.toml [verifier.hardening] unknown/invalid: {k}={v!r}" - ) + logger.warning(f"task.toml [verifier.hardening] unknown/invalid: {k}={v!r}") return result diff --git a/src/benchflow/_snapshot.py b/src/benchflow/_snapshot.py index 4cb786e..1cabc33 100644 --- a/src/benchflow/_snapshot.py +++ b/src/benchflow/_snapshot.py @@ -27,8 +27,11 @@ async def snapshot(env, name: str, workspace: str = "/app") -> str: in trial metadata / rewards.jsonl. """ import re - if not re.match(r'^[a-zA-Z0-9_-]+$', name): - raise ValueError(f"Snapshot name must be alphanumeric/dash/underscore, got: {name!r}") + + if not re.match(r"^[a-zA-Z0-9_-]+$", name): + raise ValueError( + f"Snapshot name must be alphanumeric/dash/underscore, got: {name!r}" + ) await env.exec(f"mkdir -p {_SNAP_DIR}") snap_path = f"{_SNAP_DIR}/{name}.tar.gz" result = await env.exec( diff --git a/src/benchflow/acp/client.py b/src/benchflow/acp/client.py index 57a55eb..c4561b3 100644 --- a/src/benchflow/acp/client.py +++ b/src/benchflow/acp/client.py @@ -98,7 +98,6 @@ async def _read_until_response(self, request_id: int) -> dict[str, Any]: ) return msg.get("result", {}) - # It's a notification (no id) if "method" in msg and "id" not in msg: try: diff --git a/src/benchflow/agents/registry.py b/src/benchflow/agents/registry.py index 83bf2c1..6fcc28b 100644 --- a/src/benchflow/agents/registry.py +++ b/src/benchflow/agents/registry.py @@ -347,7 +347,7 @@ class AgentConfig: home_dirs=[".openhands"], install_cmd=( "export DEBIAN_FRONTEND=noninteractive && " - "export PATH=\"$HOME/.local/bin:$PATH\" && " + 'export PATH="$HOME/.local/bin:$PATH" && ' "( command -v curl >/dev/null 2>&1 || " " ( apt-get update -qq && " " apt-get install -y -qq curl ca-certificates >/dev/null 2>&1 ) ) && " @@ -355,14 +355,14 @@ class AgentConfig: " UV_OK=0; " " if command -v uv >/dev/null 2>&1; then " " UV_VER=$(uv --version 2>/dev/null | awk '{print $2}'); " - " if [ -n \"$UV_VER\" ] && " - " [ \"$(printf '%s\\n' 0.11.6 \"$UV_VER\" | sort -V | head -n1)\" = \"0.11.6\" ]; then " + ' if [ -n "$UV_VER" ] && ' + ' [ "$(printf \'%s\\n\' 0.11.6 "$UV_VER" | sort -V | head -n1)" = "0.11.6" ]; then ' " UV_OK=1; " " fi; " " fi; " - " if [ \"$UV_OK\" = 0 ]; then " + ' if [ "$UV_OK" = 0 ]; then ' " curl -LsSf https://astral.sh/uv/install.sh | sh >/dev/null 2>&1 && " - " export PATH=\"$HOME/.local/bin:$PATH\"; " + ' export PATH="$HOME/.local/bin:$PATH"; ' " fi && " " ( uv tool list 2>/dev/null | grep -q '^openhands\\b' || " " uv tool install openhands --python 3.12 >/dev/null 2>&1 || " @@ -373,15 +373,15 @@ class AgentConfig: "/root/.local/share/uv /root/.local/share/uv/tools 2>/dev/null; " # Seed config so OpenHands ACP auth check passes before env override. "mkdir -p ~/.openhands && " - "echo '{\"llm\":{\"model\":\"placeholder\",\"api_key\":\"placeholder\"}}' " + 'echo \'{"llm":{"model":"placeholder","api_key":"placeholder"}}\' ' "> ~/.openhands/agent_settings.json && " "command -v openhands >/dev/null 2>&1" ), launch_cmd=( - "export PATH=\"$HOME/.local/bin:$PATH\" && " + 'export PATH="$HOME/.local/bin:$PATH" && ' "mkdir -p ~/.openhands && " - "printf '{\"llm\":{\"model\":\"%s\",\"api_key\":\"%s\"}}' " - "\"$LLM_MODEL\" \"$LLM_API_KEY\" > ~/.openhands/agent_settings.json && " + 'printf \'{"llm":{"model":"%s","api_key":"%s"}}\' ' + '"$LLM_MODEL" "$LLM_API_KEY" > ~/.openhands/agent_settings.json && ' "openhands acp --always-approve --override-with-envs" ), protocol="acp", diff --git a/src/benchflow/cli/main.py b/src/benchflow/cli/main.py index 9c9b7f2..f5078e4 100644 --- a/src/benchflow/cli/main.py +++ b/src/benchflow/cli/main.py @@ -770,8 +770,13 @@ def eval_create( config = TrialConfig( task_path=tasks_dir, - scenes=[Scene.single(agent=agent, model=eff_model, - skills_dir=str(skills_dir) if skills_dir else None)], + scenes=[ + Scene.single( + agent=agent, + model=eff_model, + skills_dir=str(skills_dir) if skills_dir else None, + ) + ], environment=environment, sandbox_user=sandbox_user, sandbox_setup_timeout=sandbox_setup_timeout, diff --git a/src/benchflow/job.py b/src/benchflow/job.py index bd53a0b..eff370a 100644 --- a/src/benchflow/job.py +++ b/src/benchflow/job.py @@ -111,9 +111,7 @@ class RetryConfig: wait_multiplier: float = 2.0 min_wait_sec: float = 1.0 max_wait_sec: float = 30.0 - exclude_categories: set[str] = field( - default_factory=lambda: {"timeout"} - ) + exclude_categories: set[str] = field(default_factory=lambda: {"timeout"}) def should_retry(self, error: str | None) -> bool: """Check if an error is retryable.""" @@ -130,7 +128,7 @@ def should_retry(self, error: str | None) -> bool: def backoff_delay(self, attempt: int) -> float: """Exponential backoff delay for retry attempt.""" - delay = self.min_wait_sec * (self.wait_multiplier ** attempt) + delay = self.min_wait_sec * (self.wait_multiplier**attempt) return min(delay, self.max_wait_sec) @@ -439,7 +437,9 @@ async def _run_single_task(self, task_dir: Path, cfg: JobConfig) -> RunResult: trial = await Trial.create(trial_config) return await trial.run() - async def _run_single_task_legacy(self, task_dir: Path, cfg: JobConfig) -> RunResult: + async def _run_single_task_legacy( + self, task_dir: Path, cfg: JobConfig + ) -> RunResult: """SDK.run() path — used when _sdk is mocked in tests.""" return await self._sdk.run( task_path=task_dir, @@ -539,8 +539,11 @@ async def bounded(td: Path) -> tuple[str, RunResult]: async with sem: # Jitter start to avoid SSH connection storms at high concurrency import random + if cfg.concurrency > 16: - await asyncio.sleep(random.uniform(0, min(cfg.concurrency / 10, 10))) + await asyncio.sleep( + random.uniform(0, min(cfg.concurrency / 10, 10)) + ) result = await self._run_task(td) self._prune_docker() # Log result diff --git a/src/benchflow/process.py b/src/benchflow/process.py index 80c380b..44c1409 100644 --- a/src/benchflow/process.py +++ b/src/benchflow/process.py @@ -78,7 +78,22 @@ async def readline(self) -> bytes: except Exception: logger.debug("Could not read stderr from closed process") rc = self._process.returncode if self._process else None - msg = f"Process closed stdout (rc={rc})" + # Diagnose: rc=None with closed stdout usually means the *transport* + # died (SSH/Daytona idle sleep, container killed) while the local + # subprocess wrapper is still alive. rc set means the local process + # actually exited. Surfacing the distinction makes the failure + # actionable instead of cryptic. + pid = self._process.pid if self._process else None + if rc is None: + hint = ( + f"Local subprocess (pid={pid}) is still alive but its " + "stdout/transport closed. This usually means the remote " + "container or SSH session was killed (e.g. Daytona idle " + "sleep, agent hung with no output)." + ) + else: + hint = f"Local subprocess exited with rc={rc} before stdout closed." + msg = f"Process closed stdout (rc={rc}): {hint}" if stderr_text: msg += f"\nstderr: {stderr_text[:_DIAG_TRUNCATE]}" raise ConnectionError(msg) @@ -429,7 +444,11 @@ async def from_harbor_env(cls, env: Any) -> "DaytonaPtyProcess": f"{k}={shlex.quote(v)}" for k, v in strategy._compose_env_vars().items() ) compose_cmd_base = strategy._compose_cmd([]) - return cls(sandbox=sandbox, compose_cmd_prefix=compose_env, compose_cmd_base=compose_cmd_base) + return cls( + sandbox=sandbox, + compose_cmd_prefix=compose_env, + compose_cmd_base=compose_cmd_base, + ) async def _on_pty_data(self, data: bytes) -> None: self._partial += data @@ -460,7 +479,11 @@ async def start( await self._pty.wait_for_connection() logger.info(f"DaytonaPtyProcess: PTY connected (session={session_id})") - compose_parts = shlex.split(self._compose_cmd_base) if self._compose_cmd_base else ["docker", "compose"] + compose_parts = ( + shlex.split(self._compose_cmd_base) + if self._compose_cmd_base + else ["docker", "compose"] + ) exec_parts = [*compose_parts, "exec", "-i", "-T"] if cwd: exec_parts.extend(["-w", cwd]) @@ -469,7 +492,9 @@ async def start( env_file_cmd = "" if env: env_file_path = f"/tmp/.benchflow_env_{uuid.uuid4().hex[:16]}" - env_lines = "\n".join(f"export {k}={shlex.quote(v)}" for k, v in env.items()) + env_lines = "\n".join( + f"export {k}={shlex.quote(v)}" for k, v in env.items() + ) env_file_cmd = ( f"cat > {env_file_path} <<'__EOF__'\n{env_lines}\n__EOF__\n" f". {env_file_path} && rm -f {env_file_path} && " @@ -494,7 +519,9 @@ async def start( if marker in decoded: break except TimeoutError as e: - raise ConnectionError("DaytonaPtyProcess: timeout waiting for agent start marker") from e + raise ConnectionError( + "DaytonaPtyProcess: timeout waiting for agent start marker" + ) from e logger.info("DaytonaPtyProcess: marker seen, agent starting") diff --git a/src/benchflow/runtime.py b/src/benchflow/runtime.py index 69be399..711bdb1 100644 --- a/src/benchflow/runtime.py +++ b/src/benchflow/runtime.py @@ -256,11 +256,13 @@ async def execute(self) -> RuntimeResult: config = self.config trial_config = TrialConfig( task_path=self.env.task_path, - scenes=[Scene.single( - agent=self.agent.name, - model=self.agent.model, - skills_dir=config.skills_dir, - )], + scenes=[ + Scene.single( + agent=self.agent.name, + model=self.agent.model, + skills_dir=config.skills_dir, + ) + ], environment=self.env.backend, sandbox_user=config.sandbox_user, sandbox_locked_paths=config.sandbox_locked_paths, diff --git a/src/benchflow/skill_eval.py b/src/benchflow/skill_eval.py index 82e9686..81c3c78 100644 --- a/src/benchflow/skill_eval.py +++ b/src/benchflow/skill_eval.py @@ -244,7 +244,7 @@ def generate_tasks( f"[environment]\n" f"cpus = 1\n" f"memory_mb = 2048\n" - f'allow_internet = true\n' + f"allow_internet = true\n" ) # environment/ @@ -327,7 +327,12 @@ def _default_dockerfile(dataset: EvalDataset, with_skill: bool) -> str: ] # Forward judge API keys as ARG (build-time only, not persisted in image layers) - for key in ("GOOGLE_API_KEY", "GEMINI_API_KEY", "ANTHROPIC_API_KEY", "OPENAI_API_KEY"): + for key in ( + "GOOGLE_API_KEY", + "GEMINI_API_KEY", + "ANTHROPIC_API_KEY", + "OPENAI_API_KEY", + ): val = os.environ.get(key) if val: lines += [f"ARG {key}", f"ENV {key}=${{{key}}}", ""] @@ -490,8 +495,14 @@ async def _run_job( import os from benchflow.job import Job, JobConfig, RetryConfig + judge_env = {} - for key in ("ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GOOGLE_API_KEY", "GEMINI_API_KEY"): + for key in ( + "ANTHROPIC_API_KEY", + "OPENAI_API_KEY", + "GOOGLE_API_KEY", + "GEMINI_API_KEY", + ): if os.environ.get(key): judge_env[key] = os.environ[key] diff --git a/src/benchflow/trial.py b/src/benchflow/trial.py index b255105..d09f568 100644 --- a/src/benchflow/trial.py +++ b/src/benchflow/trial.py @@ -141,6 +141,11 @@ class TrialConfig: jobs_dir: str | Path = "jobs" context_root: str | Path | None = None pre_agent_hooks: list | None = None + # Abort the prompt if no tool call arrives for this many seconds. + # Catches agents that hung silently while the local process is alive + # (e.g. gemini-cli not responding). None disables idle detection and + # falls back to the agent's wall-clock timeout (task.toml [agent]). + agent_idle_timeout: int | None = 600 # User-driven progressive-disclosure loop user: BaseUser | None = None @@ -168,7 +173,11 @@ def from_legacy( """Construct from flat SDK.run()-style args.""" return cls( task_path=task_path, - scenes=[Scene.single(agent=agent, model=model, prompts=prompts, skills_dir=skills_dir)], + scenes=[ + Scene.single( + agent=agent, model=model, prompts=prompts, skills_dir=skills_dir + ) + ], agent=agent, model=model, prompts=prompts, @@ -181,8 +190,14 @@ def effective_scenes(self) -> list[Scene]: """Scenes to execute — falls back to legacy fields if scenes is empty.""" if self.scenes: return self.scenes - return [Scene.single(agent=self.agent, model=self.model, prompts=self.prompts, - skills_dir=self.skills_dir)] + return [ + Scene.single( + agent=self.agent, + model=self.model, + prompts=self.prompts, + skills_dir=self.skills_dir, + ) + ] @property def primary_agent(self) -> str: @@ -299,9 +314,7 @@ async def setup(self) -> None: self._started_at, self._job_name, self._trial_name, - ) = SDK._init_trial( - cfg.task_path, cfg.job_name, cfg.trial_name, cfg.jobs_dir - ) + ) = SDK._init_trial(cfg.task_path, cfg.job_name, cfg.trial_name, cfg.jobs_dir) self._agent_env = resolve_agent_env( cfg.primary_agent, cfg.primary_model, cfg.agent_env @@ -316,6 +329,7 @@ async def setup(self) -> None: if cfg.context_root or cfg.skills_dir: import shutil import tempfile + tmp = Path(tempfile.mkdtemp(prefix="benchflow-task-")) shutil.copytree(cfg.task_path, tmp / cfg.task_path.name, dirs_exist_ok=True) effective_task_path = tmp / cfg.task_path.name @@ -327,8 +341,11 @@ async def setup(self) -> None: _inject_skills_into_dockerfile(effective_task_path, Path(cfg.skills_dir)) self._env = _create_environment( - cfg.environment, self._task, effective_task_path, - self._trial_name, self._trial_paths, + cfg.environment, + self._task, + effective_task_path, + self._trial_name, + self._trial_paths, ) self._timeout = int(self._task.config.agent.timeout_sec or 0) @@ -388,19 +405,23 @@ async def install_agent(self) -> None: timeout_sec=cfg.sandbox_setup_timeout, ) await _snapshot_build_config(self._env, workspace=self._agent_cwd) - await _seed_verifier_workspace(self._env, workspace=self._agent_cwd, sandbox_user=cfg.sandbox_user) + await _seed_verifier_workspace( + self._env, workspace=self._agent_cwd, sandbox_user=cfg.sandbox_user + ) await lockdown_paths(self._env, self._effective_locked) self._phase = "installed" return agent_name = cfg.primary_agent - self._agent_cfg = await install_agent( - self._env, agent_name, self._trial_dir - ) + self._agent_cfg = await install_agent(self._env, agent_name, self._trial_dir) cred_home = f"/home/{cfg.sandbox_user}" if cfg.sandbox_user else "/root" await write_credential_files( - self._env, agent_name, self._agent_env, - self._agent_cfg, cfg.primary_model, cred_home, + self._env, + agent_name, + self._agent_env, + self._agent_cfg, + cfg.primary_model, + cred_home, ) if self._agent_env.get("_BENCHFLOW_SUBSCRIPTION_AUTH"): await upload_subscription_auth(self._env, agent_name, cred_home) @@ -413,11 +434,18 @@ async def install_agent(self) -> None: timeout_sec=cfg.sandbox_setup_timeout, ) await _snapshot_build_config(self._env, workspace=self._agent_cwd) - await _seed_verifier_workspace(self._env, workspace=self._agent_cwd, sandbox_user=cfg.sandbox_user) + await _seed_verifier_workspace( + self._env, workspace=self._agent_cwd, sandbox_user=cfg.sandbox_user + ) await deploy_skills( - self._env, cfg.task_path, cfg.skills_dir, - self._agent_cfg, cfg.sandbox_user, self._agent_cwd, self._task, + self._env, + cfg.task_path, + cfg.skills_dir, + self._agent_cfg, + cfg.sandbox_user, + self._agent_cwd, + self._task, ) await lockdown_paths(self._env, self._effective_locked) @@ -483,12 +511,13 @@ async def execute(self, prompts: list[str] | None = None) -> tuple[list[dict], i self._session, effective_prompts, self._timeout, + idle_timeout=self._config.agent_idle_timeout, ) # trajectory and n_tool_calls are cumulative for this session. # Compute the delta since last execute() on this session. new_tools = n_tool_calls - prev_session_tools - new_events = trajectory[getattr(self, "_session_traj_count", 0):] + new_events = trajectory[getattr(self, "_session_traj_count", 0) :] self._session_tool_count = n_tool_calls self._session_traj_count = len(trajectory) @@ -520,10 +549,15 @@ async def verify(self) -> dict | None: ) from benchflow.sdk import SDK + sdk = SDK() self._rewards, self._verifier_error = await sdk._verify( - self._env, self._task, self._trial_paths, self._timing, - sandbox_user=cfg.sandbox_user, workspace=self._agent_cwd, + self._env, + self._task, + self._trial_paths, + self._timing, + sandbox_user=cfg.sandbox_user, + workspace=self._agent_cwd, ) self._phase = "verified" @@ -548,15 +582,14 @@ async def soft_verify(self) -> tuple[dict | None, str | None, str | None]: # Clean verifier output dir — chmod 777 so non-root verifier processes can write await self._env.exec( "rm -rf /logs/verifier && mkdir -p /logs/verifier && chmod 777 /logs/verifier", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) # Purge agent-injected conftest/sitecustomize/.pth without # killing processes or restoring workspace. # Honor per-task [verifier.hardening] opt-outs from task.toml. hardening = _read_hardening_config(getattr(self._task, "task_dir", None)) - await self._env.exec( - _build_cleanup_cmd(hardening), user="root", timeout_sec=10 - ) + await self._env.exec(_build_cleanup_cmd(hardening), user="root", timeout_sec=10) rewards = None verifier_output = None @@ -617,6 +650,7 @@ async def cleanup(self) -> None: if hasattr(self, "_task_tmp") and self._task_tmp: import shutil + shutil.rmtree(self._task_tmp, ignore_errors=True) self._phase = "cleaned" @@ -638,12 +672,15 @@ async def run(self) -> RunResult: await self.install_agent() # git safe.directory needed for SWE-bench tasks with sandbox_user import shlex + await self._env.exec( f"git config --global --add safe.directory " f"{shlex.quote(self._agent_cwd)} 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) from benchflow.sdk import SDK + sdk = SDK() self._trajectory, self._agent_name = await sdk._run_oracle( self._env, cfg.task_path, self._timeout, sandbox_user=None @@ -660,13 +697,18 @@ async def run(self) -> RunResult: if cfg.oracle_access: await self._env.exec( "mv /solution_oracle_backup /solution 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) await self.verify() - except TimeoutError: - self._error = f"Agent timed out after {self._timeout}s" + except TimeoutError as e: + # Preserve the watchdog's diagnostic message ("Agent idle for 600s + # with no new tool call ...") if it raised one. Fall back to the + # generic wall-clock message only when there's no detail. + detail = str(e).strip() + self._error = detail or f"Agent timed out after {self._timeout}s" logger.error(self._error) except ConnectionError as e: self._error = str(e) @@ -682,6 +724,7 @@ async def run(self) -> RunResult: if self._trial_dir is None: from benchflow.models import RunResult + return RunResult( task_name=self._config.task_path.name, error=self._error or "Setup failed before trial directory was created", @@ -703,7 +746,9 @@ async def _run_scene(self, scene: Scene) -> None: Inter-role messages are persisted to ``trial_dir/scene_messages.jsonl``. """ cfg = self._config - logger.info(f"[Scene] {scene.name} — {len(scene.turns)} turns, {len(scene.roles)} roles") + logger.info( + f"[Scene] {scene.name} — {len(scene.turns)} turns, {len(scene.roles)} roles" + ) role_map = {r.name: r for r in scene.roles} current_role: str | None = None @@ -755,13 +800,15 @@ async def _run_scene(self, scene: Scene) -> None: inbox.setdefault(recipient, []).append( f"**From {current_role}:** {content}" ) - scene_messages.append({ - "scene": scene.name, - "turn": turn_counter, - "sender": current_role, - "recipient": recipient, - "content": content, - }) + scene_messages.append( + { + "scene": scene.name, + "turn": turn_counter, + "sender": current_role, + "recipient": recipient, + "content": content, + } + ) if current_role is not None: await self.disconnect() @@ -778,9 +825,12 @@ async def _run_scene(self, scene: Scene) -> None: async def _read_scene_outbox(self, sender: str) -> list[tuple[str, str]]: """Read and clear outbox files left by *sender*. Returns [(recipient, content), ...].""" result = await self._env.exec( - f"ls {self._OUTBOX_DIR}/*.json 2>/dev/null || true", timeout_sec=10, + f"ls {self._OUTBOX_DIR}/*.json 2>/dev/null || true", + timeout_sec=10, ) - files = [f.strip() for f in (result.stdout or "").strip().splitlines() if f.strip()] + files = [ + f.strip() for f in (result.stdout or "").strip().splitlines() if f.strip() + ] messages: list[tuple[str, str]] = [] for fpath in files: quoted = shlex.quote(fpath) @@ -791,7 +841,9 @@ async def _read_scene_outbox(self, sender: str) -> list[tuple[str, str]]: content = data.get("content", "") if recipient and content: messages.append((recipient, content)) - logger.info(f"[Scene] outbox: {sender} → {recipient}: {content[:80]}") + logger.info( + f"[Scene] outbox: {sender} → {recipient}: {content[:80]}" + ) except json.JSONDecodeError: logger.warning(f"[Scene] invalid JSON in outbox: {fpath}") await self._env.exec(f"rm -f {quoted}", timeout_sec=10) @@ -821,8 +873,10 @@ async def _run_user_loop(self) -> None: ) role = scene.roles[0] - instruction = self._resolved_prompts[0] if self._resolved_prompts else ( - "Solve the task described in /app/instruction.md" + instruction = ( + self._resolved_prompts[0] + if self._resolved_prompts + else ("Solve the task described in /app/instruction.md") ) # Oracle access: read /solution before the agent runs, then remove it @@ -830,7 +884,8 @@ async def _run_user_loop(self) -> None: if cfg.oracle_access: cat = await self._env.exec( "cat /solution/solve.sh 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) solution = (cat.stdout or "").strip() or None @@ -841,7 +896,8 @@ async def _run_user_loop(self) -> None: if cfg.oracle_access: await self._env.exec( "mv /solution /solution_oracle_backup 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) round_result: RoundResult | None = None @@ -861,7 +917,8 @@ async def _run_user_loop(self) -> None: logger.info( f"[User] round {round_num}: prompt={prompt[:80]!r}..." - if len(prompt) > 80 else f"[User] round {round_num}: prompt={prompt!r}" + if len(prompt) > 80 + else f"[User] round {round_num}: prompt={prompt!r}" ) # Fresh ACP session each round — agent starts clean but sees @@ -875,7 +932,8 @@ async def _run_user_loop(self) -> None: round_trajectory = self._trajectory[traj_before:] round_tools = sum( - 1 for e in round_trajectory + 1 + for e in round_trajectory if isinstance(e, dict) and e.get("type") == "tool_call" ) @@ -885,7 +943,8 @@ async def _run_user_loop(self) -> None: if cfg.oracle_access: await self._env.exec( "mv /solution_oracle_backup /solution 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) try: rewards, verifier_output, verifier_error = await self.soft_verify() @@ -893,7 +952,8 @@ async def _run_user_loop(self) -> None: if cfg.oracle_access: await self._env.exec( "mv /solution /solution_oracle_backup 2>/dev/null || true", - user="root", timeout_sec=10, + user="root", + timeout_sec=10, ) round_result = RoundResult( @@ -905,18 +965,19 @@ async def _run_user_loop(self) -> None: n_tool_calls=round_tools, ) - rounds_log.append({ - "round": round_num, - "prompt": prompt, - "rewards": rewards, - "verifier_error": verifier_error, - "n_tool_calls": round_tools, - "n_trajectory_events": len(round_trajectory), - }) + rounds_log.append( + { + "round": round_num, + "prompt": prompt, + "rewards": rewards, + "verifier_error": verifier_error, + "n_tool_calls": round_tools, + "n_trajectory_events": len(round_trajectory), + } + ) logger.info( - f"[User] round {round_num} done: " - f"rewards={rewards}, tools={round_tools}" + f"[User] round {round_num} done: rewards={rewards}, tools={round_tools}" ) # Persist round log @@ -941,7 +1002,8 @@ async def connect_as(self, role: Role) -> None: # Merge cfg.agent_env (config-level) with role.env (role-specific) so # provider creds from YAML reach the agent. role.env wins on overlap. agent_env = resolve_agent_env( - role.agent, role.model, + role.agent, + role.model, {**(cfg.agent_env or {}), **(role.env or {})}, ) @@ -949,8 +1011,12 @@ async def connect_as(self, role: Role) -> None: agent_cfg = await install_agent(self._env, role.agent, self._trial_dir) cred_home = f"/home/{cfg.sandbox_user}" if cfg.sandbox_user else "/root" await write_credential_files( - self._env, role.agent, agent_env, - agent_cfg, role.model, cred_home, + self._env, + role.agent, + agent_env, + agent_cfg, + role.model, + cred_home, ) if agent_env.get("_BENCHFLOW_SUBSCRIPTION_AUTH"): await upload_subscription_auth(self._env, role.agent, cred_home) @@ -981,7 +1047,11 @@ def _classify_acp_error(self, e: ACPError) -> str: from benchflow._agent_env import check_subscription_auth from benchflow.agents.registry import infer_env_key_for_model - key = infer_env_key_for_model(self._config.primary_model) if self._config.primary_model else None + key = ( + infer_env_key_for_model(self._config.primary_model) + if self._config.primary_model + else None + ) if key and check_subscription_auth(self._config.primary_agent, key): return ( f"{key} was rejected as invalid. " diff --git a/src/benchflow/trial_yaml.py b/src/benchflow/trial_yaml.py index 707978b..046c6a0 100644 --- a/src/benchflow/trial_yaml.py +++ b/src/benchflow/trial_yaml.py @@ -88,12 +88,14 @@ def trial_config_from_dict( prompts = [prompts_raw] else: prompts = [None] - scenes = [Scene.single( - agent=raw["agent"], - model=raw.get("model"), - prompts=prompts, - skills_dir=raw.get("skills_dir"), - )] + scenes = [ + Scene.single( + agent=raw["agent"], + model=raw.get("model"), + prompts=prompts, + skills_dir=raw.get("skills_dir"), + ) + ] else: raise ValueError("YAML must have either 'scenes' or 'agent' at top level") diff --git a/tests/test_agent_registry.py b/tests/test_agent_registry.py index ab06cf1..4aab17d 100644 --- a/tests/test_agent_registry.py +++ b/tests/test_agent_registry.py @@ -51,6 +51,7 @@ def test_openhands_normalizes_model(self): assert env["LLM_MODEL"] == "glm-5" + class TestOpenHandsConfig: def test_openhands_uses_agentskills_paths(self): cfg = AGENTS["openhands"] diff --git a/tests/test_agent_setup.py b/tests/test_agent_setup.py index 3b99ce6..e6d7dea 100644 --- a/tests/test_agent_setup.py +++ b/tests/test_agent_setup.py @@ -116,7 +116,10 @@ async def test_deploy_skills_falls_back_when_local_skills_dir_is_missing(tmp_pat env.exec.assert_awaited_once() distributed_link_cmd = env.exec.await_args.args[0] - assert "ln -sfn /opt/benchflow/skills /home/agent/.agents/skills" in distributed_link_cmd + assert ( + "ln -sfn /opt/benchflow/skills /home/agent/.agents/skills" + in distributed_link_cmd + ) assert "ln -sfn /opt/benchflow/skills /workspace/skills" in distributed_link_cmd assert "ln -sfn /skills /home/agent/.agents/skills" not in distributed_link_cmd assert "ln -sfn /skills /workspace/skills" not in distributed_link_cmd @@ -147,7 +150,9 @@ async def test_deploy_skills_raises_when_skill_linking_fails(tmp_path): @pytest.mark.asyncio -async def test_install_agent_writes_command_stdout_and_stderr_on_failure(tmp_path: Path): +async def test_install_agent_writes_command_stdout_and_stderr_on_failure( + tmp_path: Path, +): env = SimpleNamespace() env.exec = AsyncMock( side_effect=[ diff --git a/tests/test_connect_as_env.py b/tests/test_connect_as_env.py index 97ab423..45b7a7a 100644 --- a/tests/test_connect_as_env.py +++ b/tests/test_connect_as_env.py @@ -67,7 +67,9 @@ def fake_resolve(agent, model, env): await _mock_trial.connect_as(role) assert "BENCHFLOW_PROVIDER_BASE_URL" in captured["env"] - assert captured["env"]["BENCHFLOW_PROVIDER_BASE_URL"] == "http://localhost:8080/v1" + assert ( + captured["env"]["BENCHFLOW_PROVIDER_BASE_URL"] == "http://localhost:8080/v1" + ) @pytest.mark.asyncio async def test_role_env_overrides_config_env(self, _mock_trial): diff --git a/tests/test_oracle_chokepoint.py b/tests/test_oracle_chokepoint.py index a61333d..86c3b50 100644 --- a/tests/test_oracle_chokepoint.py +++ b/tests/test_oracle_chokepoint.py @@ -40,7 +40,10 @@ def test_cli_main_does_not_import_cli_eval(self): """cli/main.py must not import from cli/eval — they are separate.""" main_py = ( Path(__file__).resolve().parent.parent - / "src" / "benchflow" / "cli" / "main.py" + / "src" + / "benchflow" + / "cli" + / "main.py" ) text = main_py.read_text() assert "from benchflow.cli.eval" not in text @@ -138,19 +141,12 @@ def test_harbor_yaml_oracle_no_model(self, tmp_path: Path): self._make_task(tmp_path) config = tmp_path / "config.yaml" - config.write_text( - "agents:\n" - " - name: oracle\n" - "datasets:\n" - " - path: tasks\n" - ) + config.write_text("agents:\n - name: oracle\ndatasets:\n - path: tasks\n") job = Job.from_yaml(config) assert job._config.agent == "oracle" assert job._config.model is None - def test_native_yaml_non_oracle_keeps_default_when_omitted( - self, tmp_path: Path - ): + def test_native_yaml_non_oracle_keeps_default_when_omitted(self, tmp_path: Path): """Backwards-compat: omitting model for an LLM agent still gets DEFAULT_MODEL.""" from benchflow.job import DEFAULT_MODEL, Job @@ -193,9 +189,7 @@ def _strip_api_keys(self, monkeypatch): ): monkeypatch.delenv(k, raising=False) - def test_oracle_single_task_no_api_key_no_error( - self, tmp_path: Path, monkeypatch - ): + def test_oracle_single_task_no_api_key_no_error(self, tmp_path: Path, monkeypatch): """The bug: oracle + missing API key → ANTHROPIC_API_KEY ValueError.""" import asyncio diff --git a/tests/test_process.py b/tests/test_process.py index 68ea0a0..2f9628d 100644 --- a/tests/test_process.py +++ b/tests/test_process.py @@ -230,9 +230,7 @@ async def test_dind_env_file_path_does_not_use_shell_pid_expansion(self): from benchflow.process import DaytonaProcess sandbox = MagicMock() - sandbox.create_ssh_access = AsyncMock( - return_value=MagicMock(token="abc") - ) + sandbox.create_ssh_access = AsyncMock(return_value=MagicMock(token="abc")) proc = DaytonaProcess( sandbox=sandbox, is_dind=True, @@ -274,9 +272,7 @@ async def test_direct_sandbox_env_file_path_does_not_use_shell_pid_expansion(sel from benchflow.process import DaytonaProcess sandbox = MagicMock() - sandbox.create_ssh_access = AsyncMock( - return_value=MagicMock(token="abc") - ) + sandbox.create_ssh_access = AsyncMock(return_value=MagicMock(token="abc")) proc = DaytonaProcess(sandbox=sandbox, is_dind=False) captured = [] diff --git a/tests/test_sandbox_hardening.py b/tests/test_sandbox_hardening.py index 6396c4e..6536d0a 100644 --- a/tests/test_sandbox_hardening.py +++ b/tests/test_sandbox_hardening.py @@ -696,7 +696,9 @@ def side_effect(cmd, **kwargs): task = _make_task() await harden_before_verify(env, task, sandbox_user=None) - assert task.config.verifier.env["PYTEST_ADDOPTS"] == VERIFIER_ENV["PYTEST_ADDOPTS"] + assert ( + task.config.verifier.env["PYTEST_ADDOPTS"] == VERIFIER_ENV["PYTEST_ADDOPTS"] + ) @pytest.mark.asyncio async def test_distro_pip_env_ubuntu(self): @@ -1124,9 +1126,7 @@ def test_defaults_when_no_task_dir(self): def test_defaults_when_no_hardening_section(self, tmp_path): from benchflow._sandbox import HARDENING_DEFAULTS, _read_hardening_config - (tmp_path / "task.toml").write_text( - "[verifier]\ntimeout_sec = 60\n" - ) + (tmp_path / "task.toml").write_text("[verifier]\ntimeout_sec = 60\n") assert _read_hardening_config(tmp_path) == HARDENING_DEFAULTS def test_opt_out_cleanup_conftests(self, tmp_path): @@ -1141,9 +1141,7 @@ def test_opt_out_cleanup_conftests(self, tmp_path): def test_unknown_key_logged_not_applied(self, tmp_path, caplog): from benchflow._sandbox import HARDENING_DEFAULTS, _read_hardening_config - (tmp_path / "task.toml").write_text( - "[verifier.hardening]\nbogus_flag = true\n" - ) + (tmp_path / "task.toml").write_text("[verifier.hardening]\nbogus_flag = true\n") cfg = _read_hardening_config(tmp_path) assert cfg == HARDENING_DEFAULTS # bogus key ignored assert any("bogus_flag" in r.message for r in caplog.records) @@ -1184,9 +1182,11 @@ async def test_plugin_discovery_bad_json_graceful(self): """Malformed JSON from container plugin discovery falls back gracefully.""" from benchflow._sandbox import _discover_pytest_plugin_flags - env = _make_env(side_effect=lambda cmd, **kw: MagicMock( - stdout="not valid json", stderr="", exit_code=0 - )) + env = _make_env( + side_effect=lambda cmd, **kw: MagicMock( + stdout="not valid json", stderr="", exit_code=0 + ) + ) task = _make_task() flags = await _discover_pytest_plugin_flags(env, task) assert flags == "" diff --git a/tests/test_sandbox_setup.py b/tests/test_sandbox_setup.py index f81b392..43af941 100644 --- a/tests/test_sandbox_setup.py +++ b/tests/test_sandbox_setup.py @@ -10,7 +10,9 @@ from benchflow.agents.registry import get_sandbox_home_dirs -async def _run_setup_sandbox_user(*, sandbox_user: str = "agent", workspace: str = "/app"): +async def _run_setup_sandbox_user( + *, sandbox_user: str = "agent", workspace: str = "/app" +): env = MagicMock() env.exec = AsyncMock(return_value=MagicMock(stdout="", stderr="", exit_code=0)) @@ -46,7 +48,9 @@ async def test_setup_command_avoids_recursive_root_tool_copies(self): assert kwargs["timeout_sec"] == 120 @pytest.mark.asyncio - async def test_setup_command_still_creates_user_prepares_home_and_chowns_workspace(self): + async def test_setup_command_still_creates_user_prepares_home_and_chowns_workspace( + self, + ): """The non-copy setup contract still creates the user and grants access.""" cmd, _ = await _run_setup_sandbox_user() diff --git a/tests/test_scene_outbox_trial.py b/tests/test_scene_outbox_trial.py index 5b6f07a..01c9177 100644 --- a/tests/test_scene_outbox_trial.py +++ b/tests/test_scene_outbox_trial.py @@ -35,8 +35,11 @@ def __init__(self) -> None: async def exec(self, cmd: str, **kwargs) -> FakeExecResult: self._exec_log.append(cmd) if "rm -rf /app/.outbox" in cmd: - self._files = {k: v for k, v in self._files.items() - if not k.startswith("/app/.outbox/")} + self._files = { + k: v + for k, v in self._files.items() + if not k.startswith("/app/.outbox/") + } return FakeExecResult() if "ls /app/.outbox/" in cmd: files = [f for f in self._files if f.startswith("/app/.outbox/")] @@ -78,7 +81,9 @@ def coder_reviewer_scene() -> Scene: ], turns=[ Turn("coder"), - Turn("reviewer", "Review the code. Write feedback to /app/.outbox/coder.json"), + Turn( + "reviewer", "Review the code. Write feedback to /app/.outbox/coder.json" + ), Turn("coder", "Read feedback and fix issues."), ], ) @@ -133,7 +138,9 @@ async def fake_execute(prompts=None): assert len(outbox_cmds) == 0 -async def test_outbox_messages_injected_into_prompt(coder_reviewer_scene: Scene) -> None: +async def test_outbox_messages_injected_into_prompt( + coder_reviewer_scene: Scene, +) -> None: """Outbox messages from coder are injected into reviewer's prompt.""" trial = _make_trial(coder_reviewer_scene) prompts_received: list[tuple[str, list[str]]] = [] @@ -149,7 +156,9 @@ async def fake_execute(prompts=None): trial._env.stage_outbox("reviewer", "Please review my regex implementation") # Reviewer writes feedback to coder outbox on second turn elif call_count == 1: - trial._env.stage_outbox("coder", "Edge case: empty string input not handled") + trial._env.stage_outbox( + "coder", "Edge case: empty string input not handled" + ) call_count += 1 return [], 0 @@ -219,7 +228,9 @@ async def fake_execute(prompts=None): assert call_count == 3 -async def test_role_switching_connects_and_disconnects(coder_reviewer_scene: Scene) -> None: +async def test_role_switching_connects_and_disconnects( + coder_reviewer_scene: Scene, +) -> None: """Verify connect/disconnect happens on role switches.""" trial = _make_trial(coder_reviewer_scene) diff --git a/tests/test_skill_eval_dryrun.py b/tests/test_skill_eval_dryrun.py index 931a8e9..a2c55af 100644 --- a/tests/test_skill_eval_dryrun.py +++ b/tests/test_skill_eval_dryrun.py @@ -55,7 +55,10 @@ def mock_skill(tmp_path): json.dumps( { "skill_name": "mock-review", - "defaults": {"timeout_sec": 60, "judge_model": "claude-haiku-4-5-20251001"}, + "defaults": { + "timeout_sec": 60, + "judge_model": "claude-haiku-4-5-20251001", + }, "cases": [ { "id": "bug-001", @@ -143,10 +146,18 @@ def test_evaluator_configures_job_correctly(self, MockJob, mock_skill): """Verify SkillEvaluator passes correct config to Job.""" mock_job_instance = MockJob.return_value mock_job_instance.run = AsyncMock( - return_value=type("R", (), { - "passed": 1, "failed": 1, "errored": 0, "total": 2, - "score": 0.5, "elapsed_sec": 10, - })() + return_value=type( + "R", + (), + { + "passed": 1, + "failed": 1, + "errored": 0, + "total": 2, + "score": 0.5, + "elapsed_sec": 10, + }, + )() ) evaluator = SkillEvaluator(mock_skill) @@ -176,28 +187,48 @@ def test_gepa_export_roundtrip(self, mock_skill): agents=["claude-agent-acp"], case_results=[ CaseResult( - case_id="bug-001", agent="claude-agent-acp", - model="haiku", with_skill=True, reward=0.85, n_tool_calls=3, + case_id="bug-001", + agent="claude-agent-acp", + model="haiku", + with_skill=True, + reward=0.85, + n_tool_calls=3, ), CaseResult( - case_id="bug-001", agent="claude-agent-acp", - model="haiku", with_skill=False, reward=0.4, n_tool_calls=1, + case_id="bug-001", + agent="claude-agent-acp", + model="haiku", + with_skill=False, + reward=0.4, + n_tool_calls=1, ), CaseResult( - case_id="bug-002", agent="claude-agent-acp", - model="haiku", with_skill=True, reward=0.9, n_tool_calls=4, + case_id="bug-002", + agent="claude-agent-acp", + model="haiku", + with_skill=True, + reward=0.9, + n_tool_calls=4, ), CaseResult( - case_id="bug-002", agent="claude-agent-acp", - model="haiku", with_skill=False, reward=0.3, n_tool_calls=1, + case_id="bug-002", + agent="claude-agent-acp", + model="haiku", + with_skill=False, + reward=0.3, + n_tool_calls=1, ), ], agent_lifts=[ AgentLift( - agent="claude-agent-acp", model="haiku", - with_skill_score=0.875, baseline_score=0.35, - lift=0.525, n_cases=2, - with_skill_passed=2, baseline_passed=0, + agent="claude-agent-acp", + model="haiku", + with_skill_score=0.875, + baseline_score=0.35, + lift=0.525, + n_cases=2, + with_skill_passed=2, + baseline_passed=0, ), ], ) @@ -235,7 +266,14 @@ def test_cli_dryrun_loads_dataset(self, mock_skill): # Run with a non-existent agent to trigger early failure after dataset loads result = runner.invoke( app, - ["skills", "eval", str(mock_skill), "-a", "claude-agent-acp", "--no-baseline"], + [ + "skills", + "eval", + str(mock_skill), + "-a", + "claude-agent-acp", + "--no-baseline", + ], ) # Should get past dataset loading (prints skill name) assert "mock-review" in result.output or "2 cases" in result.output @@ -247,16 +285,24 @@ def test_summary_table_format(self): agents=["claude-agent-acp", "codex-acp"], agent_lifts=[ AgentLift( - agent="claude-agent-acp", model="haiku", - with_skill_score=0.85, baseline_score=0.40, - lift=0.45, n_cases=5, - with_skill_passed=4, baseline_passed=2, + agent="claude-agent-acp", + model="haiku", + with_skill_score=0.85, + baseline_score=0.40, + lift=0.45, + n_cases=5, + with_skill_passed=4, + baseline_passed=2, ), AgentLift( - agent="codex-acp", model="gpt-5.4", - with_skill_score=0.72, baseline_score=0.35, - lift=0.37, n_cases=5, - with_skill_passed=3, baseline_passed=1, + agent="codex-acp", + model="gpt-5.4", + with_skill_score=0.72, + baseline_score=0.35, + lift=0.37, + n_cases=5, + with_skill_passed=3, + baseline_passed=1, ), ], ) diff --git a/tests/test_trial_install_agent_timeout.py b/tests/test_trial_install_agent_timeout.py index 40db22c..1cfb128 100644 --- a/tests/test_trial_install_agent_timeout.py +++ b/tests/test_trial_install_agent_timeout.py @@ -63,9 +63,7 @@ async def test_install_agent_forwards_sandbox_setup_timeout( ) monkeypatch.setattr("benchflow.trial.deploy_skills", deploy_skills_mock) monkeypatch.setattr("benchflow.trial.lockdown_paths", lockdown_paths_mock) - monkeypatch.setattr( - "benchflow.trial.setup_sandbox_user", setup_sandbox_user_mock - ) + monkeypatch.setattr("benchflow.trial.setup_sandbox_user", setup_sandbox_user_mock) await trial.install_agent() diff --git a/tests/test_user.py b/tests/test_user.py index 8ebf1c4..96ccc0a 100644 --- a/tests/test_user.py +++ b/tests/test_user.py @@ -30,7 +30,6 @@ async def test_stops_after_first_round(self): assert result is None - class TestFunctionUser: @pytest.mark.asyncio async def test_sync_function(self): @@ -41,11 +40,16 @@ def my_fn(round: int, instruction: str, rr: RoundResult | None) -> str | None: user = FunctionUser(my_fn) assert await user.run(0, "Fix the authentication bug") == "terse: Fix the au" - assert await user.run(1, "Fix the authentication bug", RoundResult(round=0)) is None + assert ( + await user.run(1, "Fix the authentication bug", RoundResult(round=0)) + is None + ) @pytest.mark.asyncio async def test_async_function(self): - async def my_fn(round: int, instruction: str, rr: RoundResult | None) -> str | None: + async def my_fn( + round: int, instruction: str, rr: RoundResult | None + ) -> str | None: if round == 0: return instruction if rr and rr.rewards and rr.rewards.get("exact_match", 0) < 1.0: @@ -103,7 +107,10 @@ async def stop(self, **kwargs): def _make_user_trial( - user: BaseUser, max_rounds: int = 5, oracle: bool = False, tmp_path: Path | None = None, + user: BaseUser, + max_rounds: int = 5, + oracle: bool = False, + tmp_path: Path | None = None, ) -> Trial: config = TrialConfig( task_path=Path("tasks/fake"), @@ -121,15 +128,27 @@ def _make_user_trial( verifier_dir = trial_dir / "verifier" verifier_dir.mkdir(parents=True, exist_ok=True) trial._trial_paths = type("P", (), {"verifier_dir": verifier_dir})() - trial._task = type("T", (), { - "config": type("C", (), { - "verifier": type("V", (), { - "timeout_sec": 30, - "env": {}, - })(), - "agent": type("A", (), {"timeout_sec": 60})(), - })(), - })() + trial._task = type( + "T", + (), + { + "config": type( + "C", + (), + { + "verifier": type( + "V", + (), + { + "timeout_sec": 30, + "env": {}, + }, + )(), + "agent": type("A", (), {"timeout_sec": 60})(), + }, + )(), + }, + )() trial._agent_cwd = "/app" return trial @@ -161,15 +180,25 @@ async def test_user_loop_calls_setup_and_run(self): user = RecordingUser(max_rounds=1) trial = _make_user_trial(user, max_rounds=3) - with patch.object(trial, "connect_as", new_callable=AsyncMock), \ - patch.object(trial, "execute", new_callable=AsyncMock, return_value=([], 0)), \ - patch.object(trial, "disconnect", new_callable=AsyncMock), \ - patch.object(trial, "soft_verify", new_callable=AsyncMock, return_value=({"exact_match": 1.0}, None, None)): - + with ( + patch.object(trial, "connect_as", new_callable=AsyncMock), + patch.object( + trial, "execute", new_callable=AsyncMock, return_value=([], 0) + ), + patch.object(trial, "disconnect", new_callable=AsyncMock), + patch.object( + trial, + "soft_verify", + new_callable=AsyncMock, + return_value=({"exact_match": 1.0}, None, None), + ), + ): await trial._run_user_loop() assert len(user.setup_calls) == 1 - assert user.setup_calls[0][0] == "Solve the task described in /app/instruction.md" + assert ( + user.setup_calls[0][0] == "Solve the task described in /app/instruction.md" + ) assert len(user.run_calls) == 2 # round 0 → prompt, round 1 → None assert user.run_calls[0][0] == 0 # round number assert user.run_calls[1][0] == 1 @@ -179,11 +208,19 @@ async def test_user_loop_passes_round_result(self): user = RecordingUser(max_rounds=2) trial = _make_user_trial(user, max_rounds=5) - with patch.object(trial, "connect_as", new_callable=AsyncMock), \ - patch.object(trial, "execute", new_callable=AsyncMock, return_value=([], 0)), \ - patch.object(trial, "disconnect", new_callable=AsyncMock), \ - patch.object(trial, "soft_verify", new_callable=AsyncMock, return_value=({"exact_match": 0.5}, "1 failed", None)): - + with ( + patch.object(trial, "connect_as", new_callable=AsyncMock), + patch.object( + trial, "execute", new_callable=AsyncMock, return_value=([], 0) + ), + patch.object(trial, "disconnect", new_callable=AsyncMock), + patch.object( + trial, + "soft_verify", + new_callable=AsyncMock, + return_value=({"exact_match": 0.5}, "1 failed", None), + ), + ): await trial._run_user_loop() # First call has no round_result @@ -198,6 +235,7 @@ async def test_user_loop_passes_round_result(self): @pytest.mark.asyncio async def test_user_loop_respects_max_rounds(self): """User that never stops is capped by max_user_rounds.""" + def never_stop(r, instr, rr): return "keep going" @@ -205,16 +243,23 @@ def never_stop(r, instr, rr): trial = _make_user_trial(user, max_rounds=3) call_count = 0 + async def mock_execute(prompts=None): nonlocal call_count call_count += 1 return [], 0 - with patch.object(trial, "connect_as", new_callable=AsyncMock), \ - patch.object(trial, "execute", side_effect=mock_execute), \ - patch.object(trial, "disconnect", new_callable=AsyncMock), \ - patch.object(trial, "soft_verify", new_callable=AsyncMock, return_value=(None, None, None)): - + with ( + patch.object(trial, "connect_as", new_callable=AsyncMock), + patch.object(trial, "execute", side_effect=mock_execute), + patch.object(trial, "disconnect", new_callable=AsyncMock), + patch.object( + trial, + "soft_verify", + new_callable=AsyncMock, + return_value=(None, None, None), + ), + ): await trial._run_user_loop() assert call_count == 3 @@ -224,11 +269,19 @@ async def test_oracle_access(self): user = RecordingUser(max_rounds=0) trial = _make_user_trial(user, oracle=True) - with patch.object(trial, "connect_as", new_callable=AsyncMock), \ - patch.object(trial, "execute", new_callable=AsyncMock, return_value=([], 0)), \ - patch.object(trial, "disconnect", new_callable=AsyncMock), \ - patch.object(trial, "soft_verify", new_callable=AsyncMock, return_value=(None, None, None)): - + with ( + patch.object(trial, "connect_as", new_callable=AsyncMock), + patch.object( + trial, "execute", new_callable=AsyncMock, return_value=([], 0) + ), + patch.object(trial, "disconnect", new_callable=AsyncMock), + patch.object( + trial, + "soft_verify", + new_callable=AsyncMock, + return_value=(None, None, None), + ), + ): await trial._run_user_loop() assert user.setup_calls[0][1] == "gold answer here" @@ -238,11 +291,13 @@ async def test_multi_role_raises(self): user = RecordingUser() config = TrialConfig( task_path=Path("tasks/fake"), - scenes=[Scene( - name="multi", - roles=[Role("a", "gemini"), Role("b", "gemini")], - turns=[Turn("a"), Turn("b")], - )], + scenes=[ + Scene( + name="multi", + roles=[Role("a", "gemini"), Role("b", "gemini")], + turns=[Turn("a"), Turn("b")], + ) + ], user=user, ) trial = Trial(config) @@ -260,11 +315,19 @@ async def run(self, round, instruction, rr=None): trial = _make_user_trial(FailingUser()) - with patch.object(trial, "connect_as", new_callable=AsyncMock), \ - patch.object(trial, "execute", new_callable=AsyncMock, return_value=([], 0)), \ - patch.object(trial, "disconnect", new_callable=AsyncMock), \ - patch.object(trial, "soft_verify", new_callable=AsyncMock, return_value=(None, None, None)): - + with ( + patch.object(trial, "connect_as", new_callable=AsyncMock), + patch.object( + trial, "execute", new_callable=AsyncMock, return_value=([], 0) + ), + patch.object(trial, "disconnect", new_callable=AsyncMock), + patch.object( + trial, + "soft_verify", + new_callable=AsyncMock, + return_value=(None, None, None), + ), + ): await trial._run_user_loop() assert "user.run() failed" in trial._error @@ -275,8 +338,10 @@ class TestSoftVerify: async def test_soft_verify_timeout(self): trial = _make_user_trial(PassthroughUser()) - with patch("harbor.Verifier") as MockVerifier, \ - patch("benchflow._sandbox.CLEANUP_CMD", "true"): + with ( + patch("harbor.Verifier") as MockVerifier, + patch("benchflow._sandbox.CLEANUP_CMD", "true"), + ): mock_instance = MockVerifier.return_value mock_instance.verify = AsyncMock(side_effect=TimeoutError()) @@ -290,8 +355,10 @@ async def test_soft_verify_timeout(self): async def test_soft_verify_crash(self): trial = _make_user_trial(PassthroughUser()) - with patch("harbor.Verifier") as MockVerifier, \ - patch("benchflow._sandbox.CLEANUP_CMD", "true"): + with ( + patch("harbor.Verifier") as MockVerifier, + patch("benchflow._sandbox.CLEANUP_CMD", "true"), + ): mock_instance = MockVerifier.return_value mock_instance.verify = AsyncMock(side_effect=RuntimeError("boom")) @@ -307,8 +374,10 @@ async def test_soft_verify_success(self): mock_result = type("VR", (), {"rewards": {"exact_match": 1.0}})() - with patch("harbor.Verifier") as MockVerifier, \ - patch("benchflow._sandbox.CLEANUP_CMD", "true"): + with ( + patch("harbor.Verifier") as MockVerifier, + patch("benchflow._sandbox.CLEANUP_CMD", "true"), + ): mock_instance = MockVerifier.return_value mock_instance.verify = AsyncMock(return_value=mock_result) @@ -323,9 +392,13 @@ async def test_soft_verify_runs_cleanup_cmd(self): mock_result = type("VR", (), {"rewards": {}})() - with patch("harbor.Verifier") as MockVerifier, \ - patch("benchflow._sandbox._build_cleanup_cmd", - return_value="echo cleanup_sentinel"): + with ( + patch("harbor.Verifier") as MockVerifier, + patch( + "benchflow._sandbox._build_cleanup_cmd", + return_value="echo cleanup_sentinel", + ), + ): mock_instance = MockVerifier.return_value mock_instance.verify = AsyncMock(return_value=mock_result)