docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study by xdotli · Pull Request #203 · benchflow-ai/benchflow

xdotli · 2026-04-25T11:37:12Z

Why

Now that v0.3.2 is shipped (BaseUser, hardening opt-outs, DinD compose, lint cleanup), the docs need to catch up. They were sprawling, version-stale (README badge said v0.3.0a3), missing a mental-model page, and didn't surface the SWE-bench Pro / Josh @ GitHub case study or the labs/ research artifacts.

What changed

New files:

`docs/concepts.md` — primitives (Task/Agent/Environment/Verifier/Trial), full trial-lifecycle ASCII diagram, Scenes/Roles/Turns, User abstraction summary, multi-turn vs multi-round vs multi-scene table.
`docs/sandbox-hardening.md` — threat model, 10-step hardening sequence, `[verifier.hardening]` opt-outs, labs/ as empirical validation.

Rewritten:

`docs/progressive-disclosure.md` — full SWE-bench Pro case study (built for Josh @ GitHub), lifecycle integration diagram, soft-verify vs full-verify split, Harbor #1316 parity comparison, expanded API reference.
`README.md` — audience routing (eval researchers / task authors / agent builders / Harbor migrators), documentation index, featured progressive-disclosure callout, Research artifacts section linking labs/. PyPI badge 0.3.0a3 → 0.3.2.
`CLAUDE.md` — trimmed to essential conventions only.

Renamed (no content change):

`docs/quickstart.md` → `docs/getting-started.md` (rewritten lighter — links forward to other docs)
`docs/skill-eval-guide.md` → `docs/skill-eval.md`
`docs/cli-reference.md` → `docs/reference/cli.md`
`docs/api-reference.md` → `docs/reference/python-api.md`
`docs/notebooks/` → `docs/examples/`

Unchanged:

`docs/task-authoring.md`, `docs/use-cases.md` (already solid; will tighten in a follow-up if needed)
`labs/` stays at repo root with relative imports intact; only docs/sandbox-hardening.md and README link to it

Validation

`ruff check .` clean
All internal cross-references updated (no broken links to old paths)
README structure validated by the audience routing principle: each role has a one-line entry point

Out of scope (follow-ups)

Tightening `use-cases.md` (currently 467 lines)
Splitting `task-toml.md` into a separate reference page when there's enough material
Architecture deep-dive in `docs/explanation/` if the audience asks for it

Test plan

CI runs against the merge commit
Eyeball-pass of README + concepts + progressive-disclosure for accuracy
Devin/human review

…sandbox hardening Wholesale restructure now that v0.3.2 is shipped. Goals: - audience routing from README so eval researchers / task authors / agent builders / Harbor migrators land on the right page - pin the mental model in one place (Trial / Scene / Role / Verifier + full lifecycle diagram) instead of scattering it across api-reference and use-cases - promote the SWE-bench Pro / Josh @ GitHub progressive-disclosure case study to a first-class section in progressive-disclosure.md, with Harbor #1316 parity comparison and the soft-verify vs full-verify split - give sandbox hardening its own page and route to labs/ for the empirical validation ## Structure README.md — audience routing, doc map, featured + research CLAUDE.md — minimal: setup, conventions, release shape docs/getting-started.md — was quickstart.md, slim, links forward docs/concepts.md — NEW: primitives, lifecycle, multi-turn vs round vs scene docs/progressive-disclosure.md — REWRITE: Josh case study, lifecycle integration, soft- vs full-verify, Harbor #1316 parity table docs/sandbox-hardening.md — NEW: threat model, hardening sequence, labs index docs/task-authoring.md — unchanged (already solid) docs/use-cases.md — unchanged (multi-agent patterns) docs/skill-eval.md — was skill-eval-guide.md (rename only) docs/examples/ — was docs/notebooks/ (rename only) docs/reference/cli.md — was docs/cli-reference.md docs/reference/python-api.md — was docs/api-reference.md labs/ stays at repo root — runnable research code with relative imports; referenced from docs/sandbox-hardening.md and README "Research artifacts." ## What's new content-wise - docs/concepts.md (new): the five primitives (Task / Agent / Environment / Verifier / Trial), trial lifecycle ASCII diagram, Scenes/Roles/Turns, User abstraction summary, multi-turn vs multi-round vs multi-scene table. - docs/progressive-disclosure.md (rewrite): full SWE-bench Pro case study with the 2026-04-24 Daytona validation table, lifecycle integration diagram showing where _run_user_loop plugs in, soft-verify vs full-verify comparison, Harbor #1316 parity discussion, expanded API reference. - docs/sandbox-hardening.md (new): the BenchJack/Meerkat threat context, the 10-step hardening sequence, the per-task [verifier.hardening] opt-out semantics, and labs/ as empirical validation. - README.md: PyPI badge 0.3.0a3 → 0.3.2, audience routing table, documentation index, featured progressive-disclosure callout, research artifacts section linking labs. - CLAUDE.md: trimmed to the essential conventions (test discipline, human review, trunk-based, release shape). Cross-references updated throughout (no broken links to old paths).

…re data - Removed all "Josh @ GitHub/Microsoft" / "Josh's" references from docs, README, example script, and notebook. Reframed as "the SWE-bench Pro progressive-disclosure use case" with Harbor #1316 as the cited PR. - Ran progressive disclosure (3 rounds, Gemini 3.1 Pro Preview, Daytona) on all 5 oracle-passing SWE-bench Pro tasks. Results aggregated to experiments/swebench-pro-progressive-results.json and rendered into the notebook + docs: ansible error: stdout closed at 17min flipt 0.0 (195 tools, 3 rounds) openlibrary 1.0 (82 tools, 3 rounds — soft 0.0 each, final 1.0) navidrome 0.0 (145 tools, 3 rounds) qutebrowser error: agent timeout at 50min Honest take in the docs: infrastructure works, two infra failures unrelated to disclosure, no measurable lift on flipt with this model on this run. Single-model run, not a paper comparison. - Fixed AttributeError in swebench_pro_user_dogfood.py (RunResult has no trial_dir attribute) — script was crashing post-trial.

…h docs Fixes the two infra failure modes surfaced by the 5-task progressive disclosure run on Daytona: 1. ansible: 'Process closed stdout (rc=None)' after 17min with 0 tool calls 2. qutebrowser: 'Agent timed out after 3000s' with 0 tool calls Both came from agents hanging (no output, no progress) while the local subprocess wrapper was still alive — the existing error messages didn't make the failure mode actionable. ## Changes **src/benchflow/process.py**: when stdout returns EOF, distinguish 'local subprocess still alive but transport closed' (rc=None — Daytona idle sleep, SSH drop, agent hung) from 'local subprocess actually exited' (rc set). Surface the distinction in the error message. **src/benchflow/_acp_run.py**: new `idle_timeout` parameter on `execute_prompts()` and a `_prompt_with_idle_watchdog()` helper. Polls `session.tool_calls` every few seconds and aborts the prompt if no new tool call arrives for `idle_timeout` seconds. Catches the qutebrowser- style hang where the agent connects, never produces a tool call, and chews through the full agent timeout (50min in our case). **src/benchflow/trial.py**: new `TrialConfig.agent_idle_timeout` field (default 600s = 10min), wired through to `execute_prompts()`. Tasks / callers can override with `None` to disable, or with a higher number for tasks that legitimately spend long stretches in agent thinking. **docs/getting-started.md**: OAuth / subscription auth section. Lists the three agents that pick up host CLI logins (claude-agent-acp, codex-acp, gemini) and the detect files for each. 'No API key needed if you ran `claude login`'. ## Validation - ruff clean - 88 tests pass (test_user, test_process, test_sandbox_hardening) - Will re-run ansible + qutebrowser progressive disclosure to confirm the new idle timeout aborts them cleanly with a clear error message

…p-token) Per Anthropic Claude Code authentication docs, the third auth path users have is generating a 1-year OAuth token with 'claude setup-token' and setting CLAUDE_CODE_OAUTH_TOKEN. This is the right option for CI / headless / sandbox environments where browser login isn't available. Reorganized the auth section into three numbered options: 1. Host CLI login (subscription_auth, file detection) 2. Long-lived CLAUDE_CODE_OAUTH_TOKEN env var (Claude only) 3. API key (works with every agent) Plus a precedence note from Anthropic's auth docs. benchflow already auto-inherits CLAUDE_CODE_OAUTH_TOKEN per src/benchflow/_agent_env.py:63 — this is just docs catching up.

After the 'agent_idle_timeout + EOF diagnostics' fix, re-ran ansible and qutebrowser (the two that flaked on first attempt). Both completed 3 rounds and reached final reward 1.0. Final 5-task results (Gemini 3.1 Pro, Daytona, 3 rounds each): Task Final Tools Rounds soft-verify Notes ansible 1.0 126 0.0 / 0.0 / 0.0 passed on retry (1st: stdout EOF) flipt 0.0 195 0.0 / 0.0 / 0.0 hard fail openlibrary 1.0 82 0.0 / 0.0 / 0.0 baseline already passed navidrome 0.0 145 0.0 / 0.0 / 0.0 hard fail qutebrowser 1.0 183 0.0 / 0.0 / 0.0 passed on retry (1st: 50min timeout) 3/5 final pass. flipt and navidrome stayed at 0.0 across all rounds — Gemini 3.1 Pro doesn't crack them with this hint schedule. Updated: - experiments/swebench-pro-progressive-results.json - examples/swebench_pro_progressive_disclosure.ipynb (re-executed with new data) - docs/progressive-disclosure.md validation table + commentary

Three findings, all real: 1. docs/reference/python-api.md:230 — relative link broken after the docs/api-reference.md → docs/reference/python-api.md move. Fix: use ../examples/ instead of docs/examples/. 2. _acp_run.py _prompt_with_idle_watchdog — race condition: after `await asyncio.sleep(poll_interval)`, prompt_task could have completed during the sleep. Without re-checking `done()` before the timeout evaluations, we'd cancel a completed task and silently discard a successful result. Added a `done()` re-check that breaks out of the loop. 3. trial.py:674 — the TimeoutError handler was overwriting the idle watchdog's detailed message ("Agent idle for 600s with no new tool call (last activity 642s ago, 0 tool calls so far)") with a generic "Agent timed out after {self._timeout}s" using the wall-clock budget, not the idle timeout. Preserve the watchdog's message when present; fall back to the generic message only when the exception has no detail.

Devin caught: _prompt_with_idle_watchdog created prompt_task with asyncio.create_task() but only cancelled it on the explicit timeout branches. If the parent coroutine was cancelled externally (asyncio.timeout, task.cancel(), Ctrl+C), CancelledError propagated out of the sleep without cancelling prompt_task — leaking the agent prompt until Trial.cleanup() eventually killed the process, plus asyncio's "Task exception was never retrieved" warning. Wrap the polling loop in try/finally so cancel + drain always runs, including the implicit-cancellation path. Both timeout branches now just `raise TimeoutError(...)` without their own cancel/drain block — the finally handles it uniformly.

Devin caught: the execute_prompts docstring says idle_timeout fires when "no tool call OR message arrives", but _prompt_with_idle_watchdog was only polling session.tool_calls. ACPSession also accumulates message_chunks and thought_chunks via handle_update — agents actively streaming text without producing a new tool call would be falsely aborted. Use a single _activity_count() that sums tool_calls + message_chunks + thought_chunks so any of the three resets the idle timer. Updated the TimeoutError message to mention all three categories.

Single source of truth for runnable examples + teaching notebooks. Previous split (docs/examples/ for teaching, examples/ for scripts) was arbitrary — both directories held the same kinds of files (.py demo scripts and .ipynb notebooks), and the duplication confused readers about where to look. Now: examples/ at repo root holds everything; docs/ has no examples subdir. labs/ stays separate at repo root for research artifacts (separate purpose: validation + reproducible experiments). Files moved (git mv): docs/examples/coder-reviewer-demo.py → examples/ docs/examples/scene-patterns.{ipynb,md,py} → examples/ docs/examples/nanofirm-task/ → examples/ References updated: README.md — single "examples/" link examples/scene-patterns.md — `python docs/examples/...` → `python examples/...` examples/coder-reviewer-demo.py — same examples/scene-patterns.py — same

Devin caught: poll_interval was computed solely from idle_timeout (min(30, max(5, idle_timeout // 4))). With the default idle_timeout=600, poll_interval was always 30s. The wall-clock deadline was only checked after each `await asyncio.sleep(poll_interval)`, so a task with timeout_sec=60 could overshoot to 90s (50%); timeout_sec=30 could overshoot to 60s (100%). Pre-PR, execute_prompts used asyncio.wait_for() which enforced the wall-clock timeout precisely. Adding the idle watchdog as the new default path silently regressed timeout precision. Fix: factor timeout into the poll interval too — `min(30, idle_timeout // 4, max(1, timeout // 4))` floored at 1. Short total budgets now get proportionally shorter poll intervals.

This comment was marked as resolved.

Sign in to view

xdotli added 3 commits April 25, 2026 05:42

This comment was marked as resolved.

Sign in to view

xdotli added 3 commits April 25, 2026 10:43

style: ruff format src/ tests/ to satisfy CI

343f2a5

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203

docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203
xdotli wants to merge 11 commits intomainfrom
docs/v0.3.2-overhaul

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

xdotli commented Apr 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What changed

Validation

Out of scope (follow-ups)

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading