Highlights
BaseUserprogressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. Seedocs/progressive-disclosure.md.- Per-task
[verifier.hardening]opt-outs intask.toml(#194): tasks with legitimateconftest.pysetups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier. - DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
--rootdir=/appin PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.
Fixes
cfg.agent_envreachesconnect_as()(#191, closes #190): YAML-supplied provider creds now reach the agent.- DinD env-file path mismatch (#198):
shlex.join()was quoting$$literally so written/read paths diverged; switched touuid.uuid4()for unique paths. - OpenHands sandbox launch + ACP CLI path (#182).
- Stop copying root tool installs into sandbox home (#181, closes #178).
sandbox_setup_timeoutwired through configs (#180).
Chores
SWE-bench Pro validation
Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.
Install
pip install benchflow==0.3.2