Releases: benchflow-ai/benchflow
v0.3.2 — BaseUser, verifier hardening, DinD compose
Highlights
BaseUserprogressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. Seedocs/progressive-disclosure.md.- Per-task
[verifier.hardening]opt-outs intask.toml(#194): tasks with legitimateconftest.pysetups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier. - DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
--rootdir=/appin PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.
Fixes
cfg.agent_envreachesconnect_as()(#191, closes #190): YAML-supplied provider creds now reach the agent.- DinD env-file path mismatch (#198):
shlex.join()was quoting$$literally so written/read paths diverged; switched touuid.uuid4()for unique paths. - OpenHands sandbox launch + ACP CLI path (#182).
- Stop copying root tool installs into sandbox home (#181, closes #178).
sandbox_setup_timeoutwired through configs (#180).
Chores
SWE-bench Pro validation
Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.
Install
pip install benchflow==0.3.2
v0.2.3 — verifier hardening follow-ups
Added
benchmarks/tb2_multiturn-claude-haiku45.yaml— shipped config for the README's TB2 multi-turn Claude result.- Daytona resource clamping via
BENCHFLOW_DAYTONA_MAX_CPUS/MAX_MEMORY_MB.
Changed
- Renamed
skillsbench-claude-glm5.yaml→skillsbench-claude-glm51.yamlto match the model ID. codex --logincorrection indocs/getting-started.md.- Restricted sdist build to
src/,tests/, and metadata.
Fixed
- Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
- Preserve trusted verifier path entries and workspace answer files.
- Redirect oracle output to container log.
- Align YAML path resolution to config file location.
v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks
Added
- Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected
conftest.py/.pth/sitecustomize.pyfiles outside/testsplus*.pydrops in/tmpand/var/tmpbefore every verifier run. labs/reward-hack-matrix— end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results insweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.
Fixed
- Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
PYTHONHOME=""crashingPy_Initialize— empty value is NOT equivalent to unset; dropped fromVERIFIER_ENV.PYTHONSAFEPATH=1breaking matplotlibsetupext.pyimports — dropped fromVERIFIER_ENV.pytest_pluginsAttributeError during hardened verify — guarded withgetattr(...).- matplotlib LFS
EOVERFLOWon qhull build artifacts — replacedrmtree + copytreefallback withshutil.copytree(dirs_exist_ok=True)merge-copy inharden_before_verifyso inert overlay stragglers no longer block workspace restore.
Results
BenchJack-shaped exploit success rate on 666 real tasks:
| benchmark | tasks | 0.2.0 EXPLT | 0.2.2 EXPLT | Δ |
|---|---|---|---|---|
skillsbench |
77 | 16 (20.8%) | 0 (0%) | −20.8 pp |
swebench-verified |
500 | 119 (23.8%) | 1 (0.2%)¹ | −23.6 pp |
terminal-bench-2 |
89 | 82 (92.1%) | 0 (0%) | −92.1 pp |
| total | 666 | 217 (32.6%) | 1 (0.15%) | −32.4 pp |
¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.
Install: pip install benchflow==0.2.2
Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md
v0.2.1 — Sandbox hardening on by default
Added
- Sandbox hardening on by default —
sandbox_usernow defaults to"agent"(wasNone/root). Blocks conftest-hook and answer-lookup exploit patterns. - Path lockdown — new
sandbox_locked_pathsparameter makes/solutionand/testsread-only before the verifier runs, blocking.pth-injection and similar pre-verify tampering. - Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
labs/benchjack-sandbox-hardening— cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7.pth-injection) and their defenses.
Fixed
- Oracle runs as
sandbox_user— oracle agent now respects path lockdown instead of running as root and bypassing it. - Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
- Stale API key shadowing subscription auth — emits a warning when
ANTHROPIC_API_KEYenv var is present alongsideclaude logincredentials. - pytest
ini-injection bypass — closed a verifier hardening edge case.
Changed
- Version is now single-sourced via
importlib.metadata; no more duplicate version string in__init__.py. - User-facing docs — new
docs/directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved todocs/.
Install: pip install benchflow==0.2.1
v0.2.0 — First Public Release
First public release 🎉
Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.
Install
pip install benchflowHighlights
- Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (
docs/tested-agents.md) - Subscription auth — use
claude login,codex --login,geminiOAuth directly, no API keys required - Vertex AI support — ADC auth for
google-vertex/,anthropic-vertex/,vertex-zai/prefixed models - Provider registry — data-driven custom LLM endpoints via
ProviderConfig. Adding a new provider = single registry entry - vLLM support —
benchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=... - SDK refactor —
SDK.run()decomposed into focused private methods; core modules extracted (_models.py,_trajectory.py,_env_setup.py,_scoring.py) - Harbor switched to PyPI —
harbor==0.3.0pin, no more git URL dependency benchmarks/directory with reusable YAML configs and runner scripts for TB2 and SkillsBenchbenchflow tasks init/tasks checkcommands for scaffolding and validating new tasks- Oracle agent support — run
solution/solve.shdirectly for task validation - 232 unit tests (up from 66 in 0.1.x)
Benchmark Results
| Benchmark | Agent | Model | Score |
|---|---|---|---|
| TB2 single-turn | codex-acp |
GPT-5.4 | 69.7% (62/89) |
| TB2 single-turn | claude-agent-acp |
Sonnet 4.6 | 58.4% (52/89) |
| TB2 multi-turn | codex-acp |
GPT-5.4 | 62.9% (56/89) |
| TB2 multi-turn | claude-agent-acp |
Haiku 4.5 | 37.1% (33/89) |
| SkillsBench | codex-acp |
GPT-5.4 | 37.2% (32/86) |
Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).
Security fixes
- API keys no longer leak in
ps aux - ADC credentials fixed for
sandbox_usersetups (#111) - Daytona sandbox orphan cleanup with
--max-agefilter (#102) litellmupgraded to 1.83.0 for CVE-2026-35030cryptographyupgraded to 46.0.7 for CVE-2026-39892- 13 transitive Dependabot alerts resolved
Contributors
- @kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
- @xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth
Full changelog
https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md
Migration from 0.1.x
0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.
v0.1.12
What's Changed
- add crag benchmark by @danielfang001 in #12
- fix: remove binary code by @kk-xuhj in #13
- Fix/crag by @xdotli in #14
- Feat/hot load by @xdotli in #16
- Docs/readme by @kk-xuhj in #17
- docs: udpate readme by @kk-xuhj in #18
- Fix/demo agents by @kk-xuhj in #19
New Contributors
- @danielfang001 made their first contribution in #12
Full Changelog: v0.1.8...v0.1.12
v0.1.8
What's Changed
- Feat/v0.1.5 by @kk-xuhj in #1
- Feat/new_interface_for_benchmark by @kk-xuhj in #2
- add swebench by @kk-xuhj in #3
- Readme suggestion by @tom-doerr in #4
- Feat/bff integrate by @kk-xuhj in #6
- Docs/better docs by @kk-xuhj in #7
- feat: add MMLU-PRO by @kk-xuhj in #9
- Feat/type_checking by @kk-xuhj in #10
New Contributors
- @kk-xuhj made their first contribution in #1
- @tom-doerr made their first contribution in #4
Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8