Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
8540fda
release: benchflow 0.3.0
xdotli Apr 21, 2026
d6497f3
fix: openhands install — uv tool install or pip install openhands-ai …
xdotli Apr 22, 2026
d3345dd
release: benchflow 0.3.1 — fix openhands install
xdotli Apr 22, 2026
3ee1ade
fix: openhands install — bootstrap curl + uv in bare Ubuntu sandboxes
xdotli Apr 22, 2026
21b356d
fix: openhands install — set PATH before command -v check
xdotli Apr 22, 2026
6fbd320
fix: OpenHands agent — sandbox launch, auth, model, and cwd
xdotli Apr 22, 2026
046567c
feat: wire outbox messaging into Trial._run_scene() for multi-role sc…
xdotli Apr 22, 2026
a0dbe8e
fix: address Devin review — shell injection + outbox dir ownership
xdotli Apr 22, 2026
53b910c
fix: address Codex critical review — persistence, agent install, hone…
xdotli Apr 22, 2026
0fbfe6d
feat: notebook with real execution outputs from Daytona runs
xdotli Apr 22, 2026
a911c51
fix: address Devin review round 2 — credentials + disconnect for non-…
xdotli Apr 23, 2026
0b76a41
fix: clamp Daytona storage_mb to configurable max (10 GB default)
xdotli Apr 23, 2026
faeb0d2
Merge pull request #185 from benchflow-ai/fix/daytona-storage-clamp
xdotli Apr 23, 2026
94c05cf
Merge pull request #179 from benchflow-ai/feat/scene-outbox-messaging
xdotli Apr 23, 2026
cdccac7
Fix DinD compose exec missing project flags (#188)
xdotli Apr 23, 2026
ea1c728
release: benchflow 0.3.2 — Daytona DinD fix + storage clamp
xdotli Apr 23, 2026
d813dc2
Fix DinD compose ACP: use Daytona PTY WebSocket for live agent pipes …
xdotli Apr 25, 2026
e1cc115
merge: main → dev-0.3 (release prep for v0.3.2) (#195)
xdotli Apr 25, 2026
a582c0b
chore: clean up ruff lint debt across repo (#197)
xdotli Apr 25, 2026
9fd2863
fix: merge cfg.agent_env into connect_as() env resolution (#191)
EYH0602 Apr 25, 2026
871bd21
Fix/openhands sandbox launch (#182)
AmyTao Apr 25, 2026
d4d61ae
docs: use uv tool install instead of pip install (#176)
xdotli Apr 25, 2026
1fccf70
feat: wire sandbox_setup_timeout through all configs (#180)
EYH0602 Apr 25, 2026
e66537c
fix: stop copying root tool installs into sandbox home (#181)
EYH0602 Apr 25, 2026
1762123
feat: BaseUser abstraction + per-task verifier hardening opt-outs (#194)
xdotli Apr 25, 2026
5c583ed
Infra fixes for SkillsBench Apr 2026 trials: PTY timeout + Daytona lo…
xdotli Apr 25, 2026
46d5eda
fix: env-file path mismatch in DinD compose mode (#198)
xdotli Apr 25, 2026
e77e190
Merge remote-tracking branch 'origin/main' into release-0.3.2-rebase
xdotli Apr 25, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .claude/skills/benchflow/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -199,7 +199,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -182,7 +182,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`

### No args or `status` — show current state

1. Check if benchflow is installed: `pip show benchflow`
1. Check if benchflow is installed: `uv tool list | grep benchflow`
2. Check if `.env` exists with API keys
3. Check available agents: `benchflow agents`
4. Show recent job results if any exist in `jobs/`
Expand Down Expand Up @@ -182,7 +182,7 @@ asyncio.run(main())
## Setup

```bash
pip install benchflow # or: pip install -e . (from source)
uv tool install benchflow # or: uv tool install -e . (from source)
source .env # ANTHROPIC_API_KEY, DAYTONA_API_KEY
```

Expand Down
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Multi-turn agent benchmarking with ACP.

Docs: `docs/quickstart.md`, `docs/cli-reference.md`, `docs/api-reference.md`, `docs/task-authoring.md`, `docs/use-cases.md`.
Docs: `docs/quickstart.md`, `docs/cli-reference.md`, `docs/api-reference.md`, `docs/task-authoring.md`, `docs/use-cases.md`, `docs/progressive-disclosure.md`.

## Setup

Expand Down
23 changes: 21 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It s
## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`.
Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). For cloud sandboxes, set `DAYTONA_API_KEY`.

## Quick Start

Expand Down Expand Up @@ -129,6 +129,21 @@ bench environment create Spin up sandbox from task dir
bench environment list List active sandboxes
```

## Terminology

| Term | Definition | Example |
|------|-----------|---------|
| **Turn** | One prompt in one ACP session — one role acts | Coder writes a regex |
| **Multi-turn** | Same role, multiple turns | Self-review: agent → agent |
| **Round** | One A→B exchange between different roles | Coder → Reviewer |
| **Multi-round** | Different roles exchanging turns | Coder → Reviewer → Coder |
| **Scene** | Interaction region with roles + turns | A code-review scene |
| **Trial** | Sequence of scenes in a shared sandbox | Skill-gen → Solve |

**Inter-role messaging:** In multi-role scenes, agents communicate via outbox files.
An agent writes `/app/.outbox/{recipient}.json` with `{"to": "role", "content": "..."}`.
The scheduler reads these after each turn and injects the message into the next role's prompt.

## Architecture

```
Expand All @@ -143,6 +158,10 @@ bf.run(config)
→ trial.start() # spin up sandbox, upload task files
→ for scene in config.scenes:
→ trial._run_scene(scene) # connect/execute/disconnect per role
→ setup /app/.outbox/ # (multi-role scenes only)
→ for turn in scene.turns:
→ read outbox → inject messages into prompt
→ connect as role → execute → disconnect
→ trial.verify() # run verifier, score
→ trial.cleanup() # stop sandbox
```
Expand Down
2 changes: 1 addition & 1 deletion benchmarks/followup-bench/runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from benchflow._acp_run import connect_acp, execute_prompts
from benchflow._agent_setup import install_agent
from benchflow._scene import Role, Scene
from benchflow.agents.registry import AGENTS, AGENT_LAUNCH
from benchflow.agents.registry import AGENT_LAUNCH, AGENTS
from benchflow.runtime import Environment

logger = logging.getLogger(__name__)
Expand Down
78 changes: 69 additions & 9 deletions docs/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ The Trial/Scene API is the primary way to run agent benchmarks programmatically.
## Install

```bash
pip install benchflow==0.3.0a3
uv tool install benchflow
```

## Quick Start
Expand Down Expand Up @@ -34,6 +34,7 @@ config = TrialConfig(
task_path=Path("tasks/my-task"),
scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
environment="daytona",
sandbox_setup_timeout=120,
)

# Multi-scene BYOS (skill-gen → solve)
Expand All @@ -46,9 +47,13 @@ config = TrialConfig(
turns=[Turn("solver")]),
],
environment="daytona",
sandbox_setup_timeout=120,
)
```

Set `sandbox_setup_timeout` when sandbox user setup needs more than the default 120 seconds.
The same field is also available on `JobConfig` and `RuntimeConfig`.

### Scene

One interaction region — roles take turns executing prompts.
Expand All @@ -57,7 +62,9 @@ One interaction region — roles take turns executing prompts.
# Single-role shortcut
scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")

# Multi-role with turn order
# Multi-role with turn order (coder-reviewer pattern)
# Agents communicate via outbox: write /app/.outbox/{recipient}.json
# Scheduler reads outbox after each turn, injects into next role's prompt
scene = Scene(
name="coder-reviewer",
roles=[
Expand All @@ -66,8 +73,9 @@ scene = Scene(
],
turns=[
Turn("coder"), # None prompt = instruction.md
Turn("reviewer", "Review..."),
Turn("coder", "Fix issues..."),
Turn("reviewer", "Review the code. Write feedback to "
'/app/.outbox/coder.json as {"to":"coder","content":"..."}'),
Turn("coder", "Fix the issues."), # reviewer's feedback auto-injected
],
)
```
Expand Down Expand Up @@ -95,6 +103,20 @@ await trial.verify()
await trial.cleanup()
```

### RuntimeConfig

Runtime-level configuration for the `Agent + Environment` execution path.

```python
from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig

config = RuntimeConfig(sandbox_setup_timeout=300)
agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = Environment.from_task("tasks/X", backend="daytona")
runtime = Runtime(env, agent, config=config)
result = await runtime.execute()
```

### bf.run()

Convenience function — multiple calling conventions:
Expand All @@ -108,10 +130,16 @@ result = await bf.run(config)
# 2. Agent + Environment (0.3 style)
agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
env = bf.Environment.from_task("tasks/X", backend="daytona")
result = await bf.run(agent, env)
runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
result = await bf.run(agent, env, runtime_config)

# 3. String shortcut (simplest)
result = await bf.run("gemini", task_path="tasks/X", model="gemini-3.1-flash-lite-preview")
result = await bf.run(
"gemini",
task_path="tasks/X",
model="gemini-3.1-flash-lite-preview",
config=bf.RuntimeConfig(sandbox_setup_timeout=300),
)
```

## Trial Lifecycle
Expand All @@ -122,17 +150,38 @@ Trial.run()
├─ setup() — resolve config, create env object
├─ start() — spin up sandbox, upload task files, start services
├─ install_agent() — install agent binary, credentials, sandbox user
│ (sandbox user setup: create non-root user, prepare
│ small config/auth dirs, chown the workspace — no
│ recursive copy of /root tool trees; agent binaries
│ must live on shared prefixes like /usr/local/bin)
├─ for scene in scenes:
│ └─ _run_scene(scene)
│ ├─ connect_as(role) — open ACP session for this role
│ ├─ execute(prompts) — send prompts, collect trajectory
│ └─ disconnect() — kill agent process, clean up
│ ├─ setup /app/.outbox/ — (multi-role scenes only)
│ └─ for turn in scene.turns:
│ ├─ read outbox — inject messages into prompt
│ ├─ connect_as(role) — open ACP session for this role
│ ├─ execute(prompts) — send prompts, collect trajectory
│ └─ disconnect() — kill agent process, clean up
├─ verify() — run verifier, collect rewards
└─ cleanup() — stop sandbox
```

Key: `disconnect()` kills the agent process between scenes to prevent context bleed. Each scene gets a fresh agent session.

## Multi-Turn vs Multi-Round

| Pattern | Roles | Turns | Communication | Example |
|---------|-------|-------|---------------|---------|
| **Single-turn** | 1 | 1 | — | Baseline benchmark |
| **Multi-turn** | 1 | 2+ | Same session, sequential prompts | Self-review |
| **Multi-round** | 2+ | 2+ | Outbox files between roles | Coder + Reviewer |

**Multi-turn** = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.

**Multi-round** = different agents exchange turns. Use when tasks need multiple perspectives (code review, client-advisor). The scheduler reads outbox files and injects messages.

Both use the same API — `TrialConfig` with different `Scene` configurations.

## Multi-Agent Patterns

### Coder + Reviewer (followup-bench)
Expand Down Expand Up @@ -169,6 +218,17 @@ config = TrialConfig(
)
```

## 0.3 Limitations

The Scene API in 0.3 covers coder-reviewer and multi-turn patterns. It does **not** yet support:

- **Dynamic termination** — turn count is fixed at config time. A "user" role cannot decide to stop early based on agent output. Workaround: use `max_rounds` in the standalone `_scene.py` scheduler.
- **Oracle access** — no mechanism for a "user" role to read `/solution` during setup.
- **Per-round verification** — `verify()` runs once after all scenes complete, not between rounds.
- **Inter-round trajectory inspection** — a "user" role cannot read the agent's trajectory between turns.

These are tracked for 0.4. See the [Harbor PR #1462 mapping](docs/notebooks/scene-patterns.ipynb) for details.

## YAML Trial Configs

```python
Expand Down
7 changes: 6 additions & 1 deletion docs/cli-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,8 @@ bench eval create \
-a gemini \
-m gemini-3.1-flash-lite-preview \
-e daytona \
-c 64
-c 64 \
--sandbox-setup-timeout 300
```

| Flag | Default | Description |
Expand All @@ -53,6 +54,7 @@ bench eval create \
| `--concurrency`, `-c` | `4` | Max concurrent tasks (batch mode only) |
| `--jobs-dir`, `-o` | `jobs` | Output directory |
| `--sandbox-user` | `agent` | Sandbox user (null for root) |
| `--sandbox-setup-timeout` | `120` | Timeout in seconds for sandbox user setup |

### bench eval list

Expand Down Expand Up @@ -145,6 +147,7 @@ bench environment list
task_dir: .ref/terminal-bench-2
environment: daytona
concurrency: 64
sandbox_setup_timeout: 300

scenes:
- name: solve
Expand All @@ -165,6 +168,7 @@ model: gemini-3.1-flash-lite-preview
environment: daytona
concurrency: 64
max_retries: 2
sandbox_setup_timeout: 300
```

### Multi-scene (BYOS skill generation)
Expand All @@ -173,6 +177,7 @@ max_retries: 2
task_dir: tasks/
environment: daytona
concurrency: 10
sandbox_setup_timeout: 300

scenes:
- name: skill-gen
Expand Down
Loading
Loading