Skip to content

docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203

Open
xdotli wants to merge 11 commits intomainfrom
docs/v0.3.2-overhaul
Open

docs: v0.3.2 overhaul — concepts + lifecycle + progressive disclosure case study#203
xdotli wants to merge 11 commits intomainfrom
docs/v0.3.2-overhaul

Conversation

@xdotli
Copy link
Copy Markdown
Member

@xdotli xdotli commented Apr 25, 2026

Why

Now that v0.3.2 is shipped (BaseUser, hardening opt-outs, DinD compose, lint cleanup), the docs need to catch up. They were sprawling, version-stale (README badge said v0.3.0a3), missing a mental-model page, and didn't surface the SWE-bench Pro / Josh @ GitHub case study or the labs/ research artifacts.

What changed

New files:

  • `docs/concepts.md` — primitives (Task/Agent/Environment/Verifier/Trial), full trial-lifecycle ASCII diagram, Scenes/Roles/Turns, User abstraction summary, multi-turn vs multi-round vs multi-scene table.
  • `docs/sandbox-hardening.md` — threat model, 10-step hardening sequence, `[verifier.hardening]` opt-outs, labs/ as empirical validation.

Rewritten:

  • `docs/progressive-disclosure.md` — full SWE-bench Pro case study (built for Josh @ GitHub), lifecycle integration diagram, soft-verify vs full-verify split, Harbor #1316 parity comparison, expanded API reference.
  • `README.md` — audience routing (eval researchers / task authors / agent builders / Harbor migrators), documentation index, featured progressive-disclosure callout, Research artifacts section linking labs/. PyPI badge 0.3.0a3 → 0.3.2.
  • `CLAUDE.md` — trimmed to essential conventions only.

Renamed (no content change):

  • `docs/quickstart.md` → `docs/getting-started.md` (rewritten lighter — links forward to other docs)
  • `docs/skill-eval-guide.md` → `docs/skill-eval.md`
  • `docs/cli-reference.md` → `docs/reference/cli.md`
  • `docs/api-reference.md` → `docs/reference/python-api.md`
  • `docs/notebooks/` → `docs/examples/`

Unchanged:

  • `docs/task-authoring.md`, `docs/use-cases.md` (already solid; will tighten in a follow-up if needed)
  • `labs/` stays at repo root with relative imports intact; only docs/sandbox-hardening.md and README link to it

Validation

  • `ruff check .` clean
  • All internal cross-references updated (no broken links to old paths)
  • README structure validated by the audience routing principle: each role has a one-line entry point

Out of scope (follow-ups)

  • Tightening `use-cases.md` (currently 467 lines)
  • Splitting `task-toml.md` into a separate reference page when there's enough material
  • Architecture deep-dive in `docs/explanation/` if the audience asks for it

Test plan

  • CI runs against the merge commit
  • Eyeball-pass of README + concepts + progressive-disclosure for accuracy
  • Devin/human review

Open in Devin Review

…sandbox hardening

Wholesale restructure now that v0.3.2 is shipped. Goals:
- audience routing from README so eval researchers / task authors / agent
  builders / Harbor migrators land on the right page
- pin the mental model in one place (Trial / Scene / Role / Verifier +
  full lifecycle diagram) instead of scattering it across api-reference
  and use-cases
- promote the SWE-bench Pro / Josh @ GitHub progressive-disclosure case
  study to a first-class section in progressive-disclosure.md, with
  Harbor #1316 parity comparison and the soft-verify vs full-verify split
- give sandbox hardening its own page and route to labs/ for the
  empirical validation

## Structure

  README.md                          — audience routing, doc map, featured + research
  CLAUDE.md                          — minimal: setup, conventions, release shape
  docs/getting-started.md            — was quickstart.md, slim, links forward
  docs/concepts.md                   — NEW: primitives, lifecycle, multi-turn vs round vs scene
  docs/progressive-disclosure.md     — REWRITE: Josh case study, lifecycle integration,
                                        soft- vs full-verify, Harbor #1316 parity table
  docs/sandbox-hardening.md          — NEW: threat model, hardening sequence, labs index
  docs/task-authoring.md             — unchanged (already solid)
  docs/use-cases.md                  — unchanged (multi-agent patterns)
  docs/skill-eval.md                 — was skill-eval-guide.md (rename only)
  docs/examples/                     — was docs/notebooks/ (rename only)
  docs/reference/cli.md              — was docs/cli-reference.md
  docs/reference/python-api.md       — was docs/api-reference.md

labs/ stays at repo root — runnable research code with relative imports;
referenced from docs/sandbox-hardening.md and README "Research artifacts."

## What's new content-wise

- docs/concepts.md (new): the five primitives (Task / Agent / Environment /
  Verifier / Trial), trial lifecycle ASCII diagram, Scenes/Roles/Turns,
  User abstraction summary, multi-turn vs multi-round vs multi-scene table.
- docs/progressive-disclosure.md (rewrite): full SWE-bench Pro case study
  with the 2026-04-24 Daytona validation table, lifecycle integration
  diagram showing where _run_user_loop plugs in, soft-verify vs full-verify
  comparison, Harbor #1316 parity discussion, expanded API reference.
- docs/sandbox-hardening.md (new): the BenchJack/Meerkat threat context,
  the 10-step hardening sequence, the per-task [verifier.hardening]
  opt-out semantics, and labs/ as empirical validation.
- README.md: PyPI badge 0.3.0a3 → 0.3.2, audience routing table,
  documentation index, featured progressive-disclosure callout, research
  artifacts section linking labs.
- CLAUDE.md: trimmed to the essential conventions (test discipline,
  human review, trunk-based, release shape).

Cross-references updated throughout (no broken links to old paths).
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 3 commits April 25, 2026 05:42
…re data

- Removed all "Josh @ GitHub/Microsoft" / "Josh's" references from docs,
  README, example script, and notebook. Reframed as "the SWE-bench Pro
  progressive-disclosure use case" with Harbor #1316 as the cited PR.

- Ran progressive disclosure (3 rounds, Gemini 3.1 Pro Preview, Daytona)
  on all 5 oracle-passing SWE-bench Pro tasks. Results aggregated to
  experiments/swebench-pro-progressive-results.json and rendered into
  the notebook + docs:
    ansible      error: stdout closed at 17min
    flipt        0.0 (195 tools, 3 rounds)
    openlibrary  1.0 (82 tools, 3 rounds — soft 0.0 each, final 1.0)
    navidrome    0.0 (145 tools, 3 rounds)
    qutebrowser  error: agent timeout at 50min

  Honest take in the docs: infrastructure works, two infra failures
  unrelated to disclosure, no measurable lift on flipt with this model
  on this run. Single-model run, not a paper comparison.

- Fixed AttributeError in swebench_pro_user_dogfood.py (RunResult has
  no trial_dir attribute) — script was crashing post-trial.
…h docs

Fixes the two infra failure modes surfaced by the 5-task progressive
disclosure run on Daytona:

1. ansible: 'Process closed stdout (rc=None)' after 17min with 0 tool calls
2. qutebrowser: 'Agent timed out after 3000s' with 0 tool calls

Both came from agents hanging (no output, no progress) while the local
subprocess wrapper was still alive — the existing error messages didn't
make the failure mode actionable.

## Changes

**src/benchflow/process.py**: when stdout returns EOF, distinguish
'local subprocess still alive but transport closed' (rc=None — Daytona
idle sleep, SSH drop, agent hung) from 'local subprocess actually
exited' (rc set). Surface the distinction in the error message.

**src/benchflow/_acp_run.py**: new `idle_timeout` parameter on
`execute_prompts()` and a `_prompt_with_idle_watchdog()` helper. Polls
`session.tool_calls` every few seconds and aborts the prompt if no new
tool call arrives for `idle_timeout` seconds. Catches the qutebrowser-
style hang where the agent connects, never produces a tool call, and
chews through the full agent timeout (50min in our case).

**src/benchflow/trial.py**: new `TrialConfig.agent_idle_timeout` field
(default 600s = 10min), wired through to `execute_prompts()`. Tasks /
callers can override with `None` to disable, or with a higher number
for tasks that legitimately spend long stretches in agent thinking.

**docs/getting-started.md**: OAuth / subscription auth section. Lists
the three agents that pick up host CLI logins (claude-agent-acp,
codex-acp, gemini) and the detect files for each. 'No API key needed
if you ran `claude login`'.

## Validation

- ruff clean
- 88 tests pass (test_user, test_process, test_sandbox_hardening)
- Will re-run ansible + qutebrowser progressive disclosure to confirm
  the new idle timeout aborts them cleanly with a clear error message
…p-token)

Per Anthropic Claude Code authentication docs, the third auth path users
have is generating a 1-year OAuth token with 'claude setup-token' and
setting CLAUDE_CODE_OAUTH_TOKEN. This is the right option for CI /
headless / sandbox environments where browser login isn't available.

Reorganized the auth section into three numbered options:
1. Host CLI login (subscription_auth, file detection)
2. Long-lived CLAUDE_CODE_OAUTH_TOKEN env var (Claude only)
3. API key (works with every agent)

Plus a precedence note from Anthropic's auth docs.

benchflow already auto-inherits CLAUDE_CODE_OAUTH_TOKEN per
src/benchflow/_agent_env.py:63 — this is just docs catching up.
devin-ai-integration[bot]

This comment was marked as resolved.

After the 'agent_idle_timeout + EOF diagnostics' fix, re-ran ansible
and qutebrowser (the two that flaked on first attempt). Both completed
3 rounds and reached final reward 1.0.

Final 5-task results (Gemini 3.1 Pro, Daytona, 3 rounds each):

  Task         Final  Tools  Rounds soft-verify   Notes
  ansible      1.0    126    0.0 / 0.0 / 0.0      passed on retry (1st: stdout EOF)
  flipt        0.0    195    0.0 / 0.0 / 0.0      hard fail
  openlibrary  1.0    82     0.0 / 0.0 / 0.0      baseline already passed
  navidrome    0.0    145    0.0 / 0.0 / 0.0      hard fail
  qutebrowser  1.0    183    0.0 / 0.0 / 0.0      passed on retry (1st: 50min timeout)

3/5 final pass. flipt and navidrome stayed at 0.0 across all rounds —
Gemini 3.1 Pro doesn't crack them with this hint schedule.

Updated:
- experiments/swebench-pro-progressive-results.json
- examples/swebench_pro_progressive_disclosure.ipynb (re-executed with new data)
- docs/progressive-disclosure.md validation table + commentary
devin-ai-integration[bot]

This comment was marked as resolved.

Three findings, all real:

1. docs/reference/python-api.md:230 — relative link broken after the
   docs/api-reference.md → docs/reference/python-api.md move. Fix: use
   ../examples/ instead of docs/examples/.

2. _acp_run.py _prompt_with_idle_watchdog — race condition: after
   `await asyncio.sleep(poll_interval)`, prompt_task could have
   completed during the sleep. Without re-checking `done()` before the
   timeout evaluations, we'd cancel a completed task and silently
   discard a successful result. Added a `done()` re-check that breaks
   out of the loop.

3. trial.py:674 — the TimeoutError handler was overwriting the idle
   watchdog's detailed message ("Agent idle for 600s with no new tool
   call (last activity 642s ago, 0 tool calls so far)") with a generic
   "Agent timed out after {self._timeout}s" using the wall-clock
   budget, not the idle timeout. Preserve the watchdog's message when
   present; fall back to the generic message only when the exception
   has no detail.
devin-ai-integration[bot]

This comment was marked as resolved.

Devin caught: _prompt_with_idle_watchdog created prompt_task with
asyncio.create_task() but only cancelled it on the explicit timeout
branches. If the parent coroutine was cancelled externally
(asyncio.timeout, task.cancel(), Ctrl+C), CancelledError propagated
out of the sleep without cancelling prompt_task — leaking the agent
prompt until Trial.cleanup() eventually killed the process, plus
asyncio's "Task exception was never retrieved" warning.

Wrap the polling loop in try/finally so cancel + drain always runs,
including the implicit-cancellation path. Both timeout branches now
just `raise TimeoutError(...)` without their own cancel/drain block —
the finally handles it uniformly.
devin-ai-integration[bot]

This comment was marked as resolved.

xdotli added 3 commits April 25, 2026 10:43
Devin caught: the execute_prompts docstring says idle_timeout fires
when "no tool call OR message arrives", but _prompt_with_idle_watchdog
was only polling session.tool_calls. ACPSession also accumulates
message_chunks and thought_chunks via handle_update — agents actively
streaming text without producing a new tool call would be falsely
aborted.

Use a single _activity_count() that sums tool_calls + message_chunks +
thought_chunks so any of the three resets the idle timer. Updated the
TimeoutError message to mention all three categories.
Single source of truth for runnable examples + teaching notebooks.
Previous split (docs/examples/ for teaching, examples/ for scripts)
was arbitrary — both directories held the same kinds of files (.py
demo scripts and .ipynb notebooks), and the duplication confused
readers about where to look.

Now: examples/ at repo root holds everything; docs/ has no examples
subdir. labs/ stays separate at repo root for research artifacts
(separate purpose: validation + reproducible experiments).

Files moved (git mv):
  docs/examples/coder-reviewer-demo.py    → examples/
  docs/examples/scene-patterns.{ipynb,md,py} → examples/
  docs/examples/nanofirm-task/            → examples/

References updated:
  README.md — single "examples/" link
  examples/scene-patterns.md — `python docs/examples/...` → `python examples/...`
  examples/coder-reviewer-demo.py — same
  examples/scene-patterns.py — same
devin-ai-integration[bot]

This comment was marked as resolved.

Devin caught: poll_interval was computed solely from idle_timeout
(min(30, max(5, idle_timeout // 4))). With the default idle_timeout=600,
poll_interval was always 30s. The wall-clock deadline was only checked
after each `await asyncio.sleep(poll_interval)`, so a task with
timeout_sec=60 could overshoot to 90s (50%); timeout_sec=30 could
overshoot to 60s (100%).

Pre-PR, execute_prompts used asyncio.wait_for() which enforced the
wall-clock timeout precisely. Adding the idle watchdog as the new
default path silently regressed timeout precision.

Fix: factor timeout into the poll interval too — `min(30, idle_timeout
// 4, max(1, timeout // 4))` floored at 1. Short total budgets now get
proportionally shorter poll intervals.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant