Skip to content

v0.3.2 — BaseUser, verifier hardening, DinD compose

Latest

Choose a tag to compare

@xdotli xdotli released this 25 Apr 10:55
· 5 commits to main since this release
23b4de4

Highlights

  • BaseUser progressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
  • Per-task [verifier.hardening] opt-outs in task.toml (#194): tasks with legitimate conftest.py setups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier.
  • DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
  • --rootdir=/app in PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Fixes

  • cfg.agent_env reaches connect_as() (#191, closes #190): YAML-supplied provider creds now reach the agent.
  • DinD env-file path mismatch (#198): shlex.join() was quoting $$ literally so written/read paths diverged; switched to uuid.uuid4() for unique paths.
  • OpenHands sandbox launch + ACP CLI path (#182).
  • Stop copying root tool installs into sandbox home (#181, closes #178).
  • sandbox_setup_timeout wired through configs (#180).

Chores

  • Repo-wide ruff lint debt cleanup (#197): 126 errors → 0.
  • Docs: uv tool install (#176).

SWE-bench Pro validation

Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.

Install

pip install benchflow==0.3.2