Releases · benchflow-ai/benchflow

25 Apr 10:55

xdotli

v0.3.2

23b4de4

v0.3.2 — BaseUser, verifier hardening, DinD compose Latest

Latest

Highlights

BaseUser progressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
Per-task [verifier.hardening] opt-outs in task.toml (#194): tasks with legitimate conftest.py setups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier.
DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
--rootdir=/app in PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Fixes

cfg.agent_env reaches connect_as() (#191, closes #190): YAML-supplied provider creds now reach the agent.
DinD env-file path mismatch (#198): shlex.join() was quoting $$ literally so written/read paths diverged; switched to uuid.uuid4() for unique paths.
OpenHands sandbox launch + ACP CLI path (#182).
Stop copying root tool installs into sandbox home (#181, closes #178).
sandbox_setup_timeout wired through configs (#180).

Chores

Repo-wide ruff lint debt cleanup (#197): 126 errors → 0.
Docs: uv tool install (#176).

SWE-bench Pro validation

Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.

Install

pip install benchflow==0.3.2

Assets 2

16 Apr 00:14

xdotli

v0.2.3

0e5c9e9

v0.2.3 — verifier hardening follow-ups

Added

benchmarks/tb2_multiturn-claude-haiku45.yaml — shipped config for the README's TB2 multi-turn Claude result.
Daytona resource clamping via BENCHFLOW_DAYTONA_MAX_CPUS / MAX_MEMORY_MB.

Changed

Renamed skillsbench-claude-glm5.yaml → skillsbench-claude-glm51.yaml to match the model ID.
codex --login correction in docs/getting-started.md.
Restricted sdist build to src/, tests/, and metadata.

Fixed

Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
Preserve trusted verifier path entries and workspace answer files.
Redirect oracle output to container log.
Align YAML path resolution to config file location.

Assets 2

14 Apr 18:49

xdotli

v0.2.2

540b011

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

Added

Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected conftest.py / .pth / sitecustomize.py files outside /tests plus *.py drops in /tmp and /var/tmp before every verifier run.
labs/reward-hack-matrix — end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results in sweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.

Fixed

Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
PYTHONHOME="" crashing Py_Initialize — empty value is NOT equivalent to unset; dropped from VERIFIER_ENV.
PYTHONSAFEPATH=1 breaking matplotlib setupext.py imports — dropped from VERIFIER_ENV.
pytest_plugins AttributeError during hardened verify — guarded with getattr(...).
matplotlib LFS EOVERFLOW on qhull build artifacts — replaced rmtree + copytree fallback with shutil.copytree(dirs_exist_ok=True) merge-copy in harden_before_verify so inert overlay stragglers no longer block workspace restore.

Results

BenchJack-shaped exploit success rate on 666 real tasks:

benchmark	tasks	0.2.0 EXPLT	0.2.2 EXPLT	Δ
`skillsbench`	77	16 (20.8%)	0 (0%)	−20.8 pp
`swebench-verified`	500	119 (23.8%)	1 (0.2%)¹	−23.6 pp
`terminal-bench-2`	89	82 (92.1%)	0 (0%)	−92.1 pp
total	666	217 (32.6%)	1 (0.15%)	−32.4 pp

¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.

Install: pip install benchflow==0.2.2

Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md

Assets 2

14 Apr 18:49

xdotli

v0.2.1

27b5139

v0.2.1 — Sandbox hardening on by default

Added

Sandbox hardening on by default — sandbox_user now defaults to "agent" (was None/root). Blocks conftest-hook and answer-lookup exploit patterns.
Path lockdown — new sandbox_locked_paths parameter makes /solution and /tests read-only before the verifier runs, blocking .pth-injection and similar pre-verify tampering.
Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
labs/benchjack-sandbox-hardening — cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 .pth-injection) and their defenses.

Fixed

Oracle runs as sandbox_user — oracle agent now respects path lockdown instead of running as root and bypassing it.
Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
Stale API key shadowing subscription auth — emits a warning when ANTHROPIC_API_KEY env var is present alongside claude login credentials.
pytest ini-injection bypass — closed a verifier hardening edge case.

Changed

Version is now single-sourced via importlib.metadata; no more duplicate version string in __init__.py.
User-facing docs — new docs/ directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved to docs/.

Install: pip install benchflow==0.2.1

Assets 2

09 Apr 22:27

xdotli

v0.2.0

01ee396

v0.2.0 — First Public Release

First public release 🎉

Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.

Install

pip install benchflow

Highlights

Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (docs/tested-agents.md)
Subscription auth — use claude login, codex --login, gemini OAuth directly, no API keys required
Vertex AI support — ADC auth for google-vertex/, anthropic-vertex/, vertex-zai/ prefixed models
Provider registry — data-driven custom LLM endpoints via ProviderConfig. Adding a new provider = single registry entry
vLLM support — benchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=...
SDK refactor — SDK.run() decomposed into focused private methods; core modules extracted (_models.py, _trajectory.py, _env_setup.py, _scoring.py)
Harbor switched to PyPI — harbor==0.3.0 pin, no more git URL dependency
benchmarks/ directory with reusable YAML configs and runner scripts for TB2 and SkillsBench
benchflow tasks init / tasks check commands for scaffolding and validating new tasks
Oracle agent support — run solution/solve.sh directly for task validation
232 unit tests (up from 66 in 0.1.x)

Benchmark Results

Benchmark	Agent	Model	Score
TB2 single-turn	`codex-acp`	GPT-5.4	69.7% (62/89)
TB2 single-turn	`claude-agent-acp`	Sonnet 4.6	58.4% (52/89)
TB2 multi-turn	`codex-acp`	GPT-5.4	62.9% (56/89)
TB2 multi-turn	`claude-agent-acp`	Haiku 4.5	37.1% (33/89)
SkillsBench	`codex-acp`	GPT-5.4	37.2% (32/86)

Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).

Security fixes

API keys no longer leak in ps aux
ADC credentials fixed for sandbox_user setups (#111)
Daytona sandbox orphan cleanup with --max-age filter (#102)
litellm upgraded to 1.83.0 for CVE-2026-35030
cryptography upgraded to 46.0.7 for CVE-2026-39892
13 transitive Dependabot alerts resolved

Contributors

@kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
@xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth

Full changelog

https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md

Migration from 0.1.x

0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.

Contributors

kywch and xdotli

Assets 2

07 Mar 18:04

kirk-xuhj

v0.1.12

c407dfd

v0.1.12

What's Changed

add crag benchmark by @danielfang001 in #12
fix: remove binary code by @kk-xuhj in #13
Fix/crag by @xdotli in #14
Feat/hot load by @xdotli in #16
Docs/readme by @kk-xuhj in #17
docs: udpate readme by @kk-xuhj in #18
Fix/demo agents by @kk-xuhj in #19

New Contributors

@danielfang001 made their first contribution in #12

Full Changelog: v0.1.8...v0.1.12

Contributors

danielfang001, xdotli, and kirk-xuhj

Assets 2

27 Feb 02:36

kirk-xuhj

v0.1.8

a177063

v0.1.8

What's Changed

Feat/v0.1.5 by @kk-xuhj in #1
Feat/new_interface_for_benchmark by @kk-xuhj in #2
add swebench by @kk-xuhj in #3
Readme suggestion by @tom-doerr in #4
Feat/bff integrate by @kk-xuhj in #6
Docs/better docs by @kk-xuhj in #7
feat: add MMLU-PRO by @kk-xuhj in #9
Feat/type_checking by @kk-xuhj in #10

New Contributors

@kk-xuhj made their first contribution in #1
@tom-doerr made their first contribution in #4

Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8

Contributors

tom-doerr and kirk-xuhj

Assets 2

Releases: benchflow-ai/benchflow

v0.3.2 — BaseUser, verifier hardening, DinD compose

Highlights

Fixes

Chores

SWE-bench Pro validation

Install

Uh oh!

v0.2.3 — verifier hardening follow-ups

Added

Changed

Fixed

Uh oh!

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

Added

Fixed

Results

Uh oh!

v0.2.1 — Sandbox hardening on by default

Added

Fixed

Changed

Uh oh!

v0.2.0 — First Public Release

Install

Highlights

Benchmark Results

Security fixes

Contributors

Full changelog

Migration from 0.1.x

Contributors

Uh oh!

v0.1.12

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.8

What's Changed

New Contributors

Contributors

Uh oh!