benchflow-ai · xdotli · Apr 25, 2026 · Apr 21, 2026 · Apr 22, 2026 · Apr 22, 2026
diff --git a/.claude/skills/benchflow/SKILL.md b/.claude/skills/benchflow/SKILL.md
@@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`
 
 ### No args or `status` — show current state
 
-1. Check if benchflow is installed: `pip show benchflow`
+1. Check if benchflow is installed: `uv tool list | grep benchflow`
 2. Check if `.env` exists with API keys
 3. Check available agents: `benchflow agents`
 4. Show recent job results if any exist in `jobs/`
@@ -199,7 +199,7 @@ asyncio.run(main())
 ## Setup
 
 ```bash
-pip install benchflow    # or: pip install -e . (from source)
+uv tool install benchflow    # or: uv tool install -e . (from source)
 source .env              # ANTHROPIC_API_KEY, DAYTONA_API_KEY
 ```
 

diff --git a/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md b/.claude/skills/benchflow/tasks/benchflow-knowledge/environment/benchflow/SKILL.md
@@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`
 
 ### No args or `status` — show current state
 
-1. Check if benchflow is installed: `pip show benchflow`
+1. Check if benchflow is installed: `uv tool list | grep benchflow`
 2. Check if `.env` exists with API keys
 3. Check available agents: `benchflow agents`
 4. Show recent job results if any exist in `jobs/`
@@ -182,7 +182,7 @@ asyncio.run(main())
 ## Setup
 
 ```bash
-pip install benchflow    # or: pip install -e . (from source)
+uv tool install benchflow    # or: uv tool install -e . (from source)
 source .env              # ANTHROPIC_API_KEY, DAYTONA_API_KEY
 ```
 

diff --git a/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md b/.claude/skills/benchflow/tasks/create-simple-task/environment/benchflow/SKILL.md
@@ -23,7 +23,7 @@ Arguments passed: `$ARGUMENTS`
 
 ### No args or `status` — show current state
 
-1. Check if benchflow is installed: `pip show benchflow`
+1. Check if benchflow is installed: `uv tool list | grep benchflow`
 2. Check if `.env` exists with API keys
 3. Check available agents: `benchflow agents`
 4. Show recent job results if any exist in `jobs/`
@@ -182,7 +182,7 @@ asyncio.run(main())
 ## Setup
 
 ```bash
-pip install benchflow    # or: pip install -e . (from source)
+uv tool install benchflow    # or: uv tool install -e . (from source)
 source .env              # ANTHROPIC_API_KEY, DAYTONA_API_KEY
 ```
 

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -2,7 +2,7 @@
 
 Multi-turn agent benchmarking with ACP.
 
-Docs: `docs/quickstart.md`, `docs/cli-reference.md`, `docs/api-reference.md`, `docs/task-authoring.md`, `docs/use-cases.md`.
+Docs: `docs/quickstart.md`, `docs/cli-reference.md`, `docs/api-reference.md`, `docs/task-authoring.md`, `docs/use-cases.md`, `docs/progressive-disclosure.md`.
 
 ## Setup
 

diff --git a/README.md b/README.md
@@ -21,10 +21,10 @@ BenchFlow runs AI agents against benchmark tasks in sandboxed environments. It s
 ## Install
 
 ```bash
-pip install benchflow==0.3.0a3
+uv tool install benchflow
 ```
 
-Requires Python 3.12+. For cloud sandboxes, set `DAYTONA_API_KEY`.
+Requires Python 3.12+ and [uv](https://docs.astral.sh/uv/). For cloud sandboxes, set `DAYTONA_API_KEY`.
 
 ## Quick Start
 
@@ -129,6 +129,21 @@ bench environment create      Spin up sandbox from task dir
 bench environment list        List active sandboxes
 ```
 
+## Terminology
+
+| Term | Definition | Example |
+|------|-----------|---------|
+| **Turn** | One prompt in one ACP session — one role acts | Coder writes a regex |
+| **Multi-turn** | Same role, multiple turns | Self-review: agent → agent |
+| **Round** | One A→B exchange between different roles | Coder → Reviewer |
+| **Multi-round** | Different roles exchanging turns | Coder → Reviewer → Coder |
+| **Scene** | Interaction region with roles + turns | A code-review scene |
+| **Trial** | Sequence of scenes in a shared sandbox | Skill-gen → Solve |
+
+**Inter-role messaging:** In multi-role scenes, agents communicate via outbox files.
+An agent writes `/app/.outbox/{recipient}.json` with `{"to": "role", "content": "..."}`.
+The scheduler reads these after each turn and injects the message into the next role's prompt.
+
 ## Architecture
 
 ```
@@ -143,6 +158,10 @@ bf.run(config)
     → trial.start()      # spin up sandbox, upload task files
     → for scene in config.scenes:
         → trial._run_scene(scene)  # connect/execute/disconnect per role
+          → setup /app/.outbox/    # (multi-role scenes only)
+          → for turn in scene.turns:
+              → read outbox → inject messages into prompt
+              → connect as role → execute → disconnect
     → trial.verify()     # run verifier, score
     → trial.cleanup()    # stop sandbox
 ```

diff --git a/benchmarks/followup-bench/runner.py b/benchmarks/followup-bench/runner.py
@@ -28,7 +28,7 @@
 from benchflow._acp_run import connect_acp, execute_prompts
 from benchflow._agent_setup import install_agent
 from benchflow._scene import Role, Scene
-from benchflow.agents.registry import AGENTS, AGENT_LAUNCH
+from benchflow.agents.registry import AGENT_LAUNCH, AGENTS
 from benchflow.runtime import Environment
 
 logger = logging.getLogger(__name__)

diff --git a/docs/api-reference.md b/docs/api-reference.md
@@ -5,7 +5,7 @@ The Trial/Scene API is the primary way to run agent benchmarks programmatically.
 ## Install
 
 ```bash
-pip install benchflow==0.3.0a3
+uv tool install benchflow
 ```
 
 ## Quick Start
@@ -34,6 +34,7 @@ config = TrialConfig(
     task_path=Path("tasks/my-task"),
     scenes=[Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")],
     environment="daytona",
+    sandbox_setup_timeout=120,
 )
 
 # Multi-scene BYOS (skill-gen → solve)
@@ -46,9 +47,13 @@ config = TrialConfig(
               turns=[Turn("solver")]),
     ],
     environment="daytona",
+    sandbox_setup_timeout=120,
 )
 ```
 
+Set `sandbox_setup_timeout` when sandbox user setup needs more than the default 120 seconds.
+The same field is also available on `JobConfig` and `RuntimeConfig`.
+
 ### Scene
 
 One interaction region — roles take turns executing prompts.
@@ -57,7 +62,9 @@ One interaction region — roles take turns executing prompts.
 # Single-role shortcut
 scene = Scene.single(agent="gemini", model="gemini-3.1-flash-lite-preview")
 
-# Multi-role with turn order
+# Multi-role with turn order (coder-reviewer pattern)
+# Agents communicate via outbox: write /app/.outbox/{recipient}.json
+# Scheduler reads outbox after each turn, injects into next role's prompt
 scene = Scene(
     name="coder-reviewer",
     roles=[
@@ -66,8 +73,9 @@ scene = Scene(
     ],
     turns=[
         Turn("coder"),                    # None prompt = instruction.md
-        Turn("reviewer", "Review..."),
-        Turn("coder", "Fix issues..."),
+        Turn("reviewer", "Review the code. Write feedback to "
+             '/app/.outbox/coder.json as {"to":"coder","content":"..."}'),
+        Turn("coder", "Fix the issues."), # reviewer's feedback auto-injected
     ],
 )
 ```
@@ -95,6 +103,20 @@ await trial.verify()
 await trial.cleanup()
 ```
 
+### RuntimeConfig
+
+Runtime-level configuration for the `Agent + Environment` execution path.
+
+```python
+from benchflow.runtime import Agent, Environment, Runtime, RuntimeConfig
+
+config = RuntimeConfig(sandbox_setup_timeout=300)
+agent = Agent("gemini", model="gemini-3.1-flash-lite-preview")
+env = Environment.from_task("tasks/X", backend="daytona")
+runtime = Runtime(env, agent, config=config)
+result = await runtime.execute()
+```
+
 ### bf.run()
 
 Convenience function — multiple calling conventions:
@@ -108,10 +130,16 @@ result = await bf.run(config)
 # 2. Agent + Environment (0.3 style)
 agent = bf.Agent("gemini", model="gemini-3.1-flash-lite-preview")
 env = bf.Environment.from_task("tasks/X", backend="daytona")
-result = await bf.run(agent, env)
+runtime_config = bf.RuntimeConfig(sandbox_setup_timeout=300)
+result = await bf.run(agent, env, runtime_config)
 
 # 3. String shortcut (simplest)
-result = await bf.run("gemini", task_path="tasks/X", model="gemini-3.1-flash-lite-preview")
+result = await bf.run(
+    "gemini",
+    task_path="tasks/X",
+    model="gemini-3.1-flash-lite-preview",
+    config=bf.RuntimeConfig(sandbox_setup_timeout=300),
+)
 ```
 
 ## Trial Lifecycle
@@ -122,17 +150,38 @@ Trial.run()
   ├─ setup()          — resolve config, create env object
   ├─ start()          — spin up sandbox, upload task files, start services
   ├─ install_agent()  — install agent binary, credentials, sandbox user
+  │                    (sandbox user setup: create non-root user, prepare
+  │                     small config/auth dirs, chown the workspace — no
+  │                     recursive copy of /root tool trees; agent binaries
+  │                     must live on shared prefixes like /usr/local/bin)
   ├─ for scene in scenes:
   │    └─ _run_scene(scene)
-  │         ├─ connect_as(role)    — open ACP session for this role
-  │         ├─ execute(prompts)    — send prompts, collect trajectory
-  │         └─ disconnect()        — kill agent process, clean up
+  │         ├─ setup /app/.outbox/ — (multi-role scenes only)
+  │         └─ for turn in scene.turns:
+  │              ├─ read outbox     — inject messages into prompt
+  │              ├─ connect_as(role) — open ACP session for this role
+  │              ├─ execute(prompts) — send prompts, collect trajectory
+  │              └─ disconnect()    — kill agent process, clean up
   ├─ verify()         — run verifier, collect rewards
   └─ cleanup()        — stop sandbox
 ```
 
 Key: `disconnect()` kills the agent process between scenes to prevent context bleed. Each scene gets a fresh agent session.
 
+## Multi-Turn vs Multi-Round
+
+| Pattern | Roles | Turns | Communication | Example |
+|---------|-------|-------|---------------|---------|
+| **Single-turn** | 1 | 1 | — | Baseline benchmark |
+| **Multi-turn** | 1 | 2+ | Same session, sequential prompts | Self-review |
+| **Multi-round** | 2+ | 2+ | Outbox files between roles | Coder + Reviewer |
+
+**Multi-turn** = same agent gets multiple prompts. Use when a second pass catches errors (self-review, iterative refinement). The agent keeps its context across turns.
+
+**Multi-round** = different agents exchange turns. Use when tasks need multiple perspectives (code review, client-advisor). The scheduler reads outbox files and injects messages.
+
+Both use the same API — `TrialConfig` with different `Scene` configurations.
+
 ## Multi-Agent Patterns
 
 ### Coder + Reviewer (followup-bench)
@@ -169,6 +218,17 @@ config = TrialConfig(
 )
 ```
 
+## 0.3 Limitations
+
+The Scene API in 0.3 covers coder-reviewer and multi-turn patterns. It does **not** yet support:
+
+- **Dynamic termination** — turn count is fixed at config time. A "user" role cannot decide to stop early based on agent output. Workaround: use `max_rounds` in the standalone `_scene.py` scheduler.
+- **Oracle access** — no mechanism for a "user" role to read `/solution` during setup.
+- **Per-round verification** — `verify()` runs once after all scenes complete, not between rounds.
+- **Inter-round trajectory inspection** — a "user" role cannot read the agent's trajectory between turns.
+
+These are tracked for 0.4. See the [Harbor PR #1462 mapping](docs/notebooks/scene-patterns.ipynb) for details.
+
 ## YAML Trial Configs
 
 ```python

diff --git a/docs/cli-reference.md b/docs/cli-reference.md
@@ -40,7 +40,8 @@ bench eval create \
   -a gemini \
   -m gemini-3.1-flash-lite-preview \
   -e daytona \
-  -c 64
+  -c 64 \
+  --sandbox-setup-timeout 300
 ```
 
 | Flag | Default | Description |
@@ -53,6 +54,7 @@ bench eval create \
 | `--concurrency`, `-c` | `4` | Max concurrent tasks (batch mode only) |
 | `--jobs-dir`, `-o` | `jobs` | Output directory |
 | `--sandbox-user` | `agent` | Sandbox user (null for root) |
+| `--sandbox-setup-timeout` | `120` | Timeout in seconds for sandbox user setup |
 
 ### bench eval list
 
@@ -145,6 +147,7 @@ bench environment list
 task_dir: .ref/terminal-bench-2
 environment: daytona
 concurrency: 64
+sandbox_setup_timeout: 300
 
 scenes:
   - name: solve
@@ -165,6 +168,7 @@ model: gemini-3.1-flash-lite-preview
 environment: daytona
 concurrency: 64
 max_retries: 2
+sandbox_setup_timeout: 300
 ```
 
 ### Multi-scene (BYOS skill generation)
@@ -173,6 +177,7 @@ max_retries: 2
 task_dir: tasks/
 environment: daytona
 concurrency: 10
+sandbox_setup_timeout: 300
 
 scenes:
   - name: skill-gen