Skip to content

Releases: benchflow-ai/benchflow

v0.3.2 — BaseUser, verifier hardening, DinD compose

25 Apr 10:55
23b4de4

Choose a tag to compare

Highlights

  • BaseUser progressive-disclosure abstraction (#194): Python callback drives multi-round agent runs. Built for SWE-bench Pro use case (Josh @ GitHub) and as parity answer to Harbor #1316 in the no-second-LLM case. See docs/progressive-disclosure.md.
  • Per-task [verifier.hardening] opt-outs in task.toml (#194): tasks with legitimate conftest.py setups (qutebrowser-style) opt out of specific cleanup steps. Achieves 5/5 SWE-bench Pro oracle on hardened verifier.
  • DinD compose ACP via Daytona PTY WebSocket (#193, #196): live agent pipes for SkillsBench / DinD compose tasks.
  • --rootdir=/app in PYTEST_ADDOPTS (#194): anchors test node IDs to repo root; openlibrary oracle goes 0/18 → 18/18.

Fixes

  • cfg.agent_env reaches connect_as() (#191, closes #190): YAML-supplied provider creds now reach the agent.
  • DinD env-file path mismatch (#198): shlex.join() was quoting $$ literally so written/read paths diverged; switched to uuid.uuid4() for unique paths.
  • OpenHands sandbox launch + ACP CLI path (#182).
  • Stop copying root tool installs into sandbox home (#181, closes #178).
  • sandbox_setup_timeout wired through configs (#180).

Chores

  • Repo-wide ruff lint debt cleanup (#197): 126 errors → 0.
  • Docs: uv tool install (#176).

SWE-bench Pro validation

Oracle 5/5 on Daytona (ansible, flipt, openlibrary, navidrome, qutebrowser).
Single-round Gemini 3.1 Pro baseline: 2/4.

Install

pip install benchflow==0.3.2

v0.2.3 — verifier hardening follow-ups

16 Apr 00:14
0e5c9e9

Choose a tag to compare

Added

  • benchmarks/tb2_multiturn-claude-haiku45.yaml — shipped config for the README's TB2 multi-turn Claude result.
  • Daytona resource clamping via BENCHFLOW_DAYTONA_MAX_CPUS / MAX_MEMORY_MB.

Changed

  • Renamed skillsbench-claude-glm5.yamlskillsbench-claude-glm51.yaml to match the model ID.
  • codex --login correction in docs/getting-started.md.
  • Restricted sdist build to src/, tests/, and metadata.

Fixed

  • Verifier sandbox hardening follow-ups across several base-image and tooling edge cases.
  • Preserve trusted verifier path entries and workspace answer files.
  • Redirect oracle output to container log.
  • Align YAML path resolution to config file location.

v0.2.2 — BenchJack defenses: 32.6% → 0.15% exploit rate on 666 tasks

14 Apr 18:49
540b011

Choose a tag to compare

Added

  • Sandbox hardening tiers 1–4 — layered defense blocking F1–F6 red-team findings: env scrubbing, path lockdown, workspace freeze, wider build-config snapshot, oracle privilege drop, and a filesystem scrub that deletes injected conftest.py / .pth / sitecustomize.py files outside /tests plus *.py drops in /tmp and /var/tmp before every verifier run.
  • labs/reward-hack-matrix — end-to-end sweep of BenchJack-shaped exploits against 666 real tasks across skillsbench, swebench-verified, and terminal-bench-2 (1332 trials). Per-cell results in sweep_0.2.0_vs_0.2.2.json. Includes per-trial timeout and a long-lived worker pool that keeps local RAM ~1 GB regardless of trial count.

Fixed

  • Multiple sandbox bypass vectors identified in red-team testing (F1–F6).
  • PYTHONHOME="" crashing Py_Initialize — empty value is NOT equivalent to unset; dropped from VERIFIER_ENV.
  • PYTHONSAFEPATH=1 breaking matplotlib setupext.py imports — dropped from VERIFIER_ENV.
  • pytest_plugins AttributeError during hardened verify — guarded with getattr(...).
  • matplotlib LFS EOVERFLOW on qhull build artifacts — replaced rmtree + copytree fallback with shutil.copytree(dirs_exist_ok=True) merge-copy in harden_before_verify so inert overlay stragglers no longer block workspace restore.

Results

BenchJack-shaped exploit success rate on 666 real tasks:

benchmark tasks 0.2.0 EXPLT 0.2.2 EXPLT Δ
skillsbench 77 16 (20.8%) 0 (0%) −20.8 pp
swebench-verified 500 119 (23.8%) 1 (0.2%)¹ −23.6 pp
terminal-bench-2 89 82 (92.1%) 0 (0%) −92.1 pp
total 666 217 (32.6%) 1 (0.15%) −32.4 pp

¹ swebench-verified/django__django-7530 scores reward = 1.0 on BOTH versions because its FAIL_TO_PASS test passes at baseline without any patch — a SWE-bench task-definition quirk, not a 0.2.2 bypass. True bypass count on 0.2.2: 0.

Install: pip install benchflow==0.2.2

Full audit: labs/reward-hack-matrix/ — Hardening design: .dev-docs/harden-sandbox.md

v0.2.1 — Sandbox hardening on by default

14 Apr 18:49
27b5139

Choose a tag to compare

Added

  • Sandbox hardening on by defaultsandbox_user now defaults to "agent" (was None/root). Blocks conftest-hook and answer-lookup exploit patterns.
  • Path lockdown — new sandbox_locked_paths parameter makes /solution and /tests read-only before the verifier runs, blocking .pth-injection and similar pre-verify tampering.
  • Verifier failure isolation — agent errors and verifier errors are now stored separately; a crashing verifier no longer masks the agent result.
  • labs/benchjack-sandbox-hardening — cookbook demonstrating three exploit patterns (P1 conftest-hook, P2 answer-lookup, P7 .pth-injection) and their defenses.

Fixed

  • Oracle runs as sandbox_user — oracle agent now respects path lockdown instead of running as root and bypassing it.
  • Multi-endpoint provider routing — providers with multiple endpoints now route by the agent's native API protocol.
  • Stale API key shadowing subscription auth — emits a warning when ANTHROPIC_API_KEY env var is present alongside claude login credentials.
  • pytest ini-injection bypass — closed a verifier hardening edge case.

Changed

  • Version is now single-sourced via importlib.metadata; no more duplicate version string in __init__.py.
  • User-facing docs — new docs/ directory with getting-started guide, CLI reference, architecture overview, task-authoring guide, and labs index. README trimmed; detailed content moved to docs/.

Install: pip install benchflow==0.2.1

v0.2.0 — First Public Release

09 Apr 22:27
01ee396

Choose a tag to compare

First public release 🎉

Major rearchitecture from the 0.1.x era. API surface has changed — assume breaking changes. Future releases will maintain compatibility within the 0.2.x line.

Install

pip install benchflow

Highlights

  • Multi-agent, multi-provider, multi-auth matrix — 12 end-to-end tested agent × model × provider × auth combinations (docs/tested-agents.md)
  • Subscription auth — use claude login, codex --login, gemini OAuth directly, no API keys required
  • Vertex AI support — ADC auth for google-vertex/, anthropic-vertex/, vertex-zai/ prefixed models
  • Provider registry — data-driven custom LLM endpoints via ProviderConfig. Adding a new provider = single registry entry
  • vLLM supportbenchflow run -a pi-acp --model vllm/Qwen3-80B --ae BENCHFLOW_PROVIDER_BASE_URL=...
  • SDK refactorSDK.run() decomposed into focused private methods; core modules extracted (_models.py, _trajectory.py, _env_setup.py, _scoring.py)
  • Harbor switched to PyPIharbor==0.3.0 pin, no more git URL dependency
  • benchmarks/ directory with reusable YAML configs and runner scripts for TB2 and SkillsBench
  • benchflow tasks init / tasks check commands for scaffolding and validating new tasks
  • Oracle agent support — run solution/solve.sh directly for task validation
  • 232 unit tests (up from 66 in 0.1.x)

Benchmark Results

Benchmark Agent Model Score
TB2 single-turn codex-acp GPT-5.4 69.7% (62/89)
TB2 single-turn claude-agent-acp Sonnet 4.6 58.4% (52/89)
TB2 multi-turn codex-acp GPT-5.4 62.9% (56/89)
TB2 multi-turn claude-agent-acp Haiku 4.5 37.1% (33/89)
SkillsBench codex-acp GPT-5.4 37.2% (32/86)

Notable finding: Multi-turn self-critique hurts capable models (GPT-5.4 regresses −6.8pp from single-turn) but helps weaker models (Haiku 4.5 gains +9.6pp).

Security fixes

  • API keys no longer leak in ps aux
  • ADC credentials fixed for sandbox_user setups (#111)
  • Daytona sandbox orphan cleanup with --max-age filter (#102)
  • litellm upgraded to 1.83.0 for CVE-2026-35030
  • cryptography upgraded to 46.0.7 for CVE-2026-39892
  • 13 transitive Dependabot alerts resolved

Contributors

  • @kywch (Kyoung Whan Choe) — core refactor, benchmark runs, agent test scripts
  • @xdotli (Xiangyi Li) — SDK, providers, Vertex AI, OpenClaw shim, subscription auth

Full changelog

https://github.com/benchflow-ai/benchflow/blob/main/CHANGELOG.md

Migration from 0.1.x

0.1.x users should treat this as a fresh install. The SDK API, CLI, registry pattern, and task format have all changed. There is no automatic migration path. See docs/sdk-reference.md and examples/ for starting points.

v0.1.12

07 Mar 18:04
c407dfd

Choose a tag to compare

What's Changed

  • add crag benchmark by @danielfang001 in #12
  • fix: remove binary code by @kk-xuhj in #13
  • Fix/crag by @xdotli in #14
  • Feat/hot load by @xdotli in #16
  • Docs/readme by @kk-xuhj in #17
  • docs: udpate readme by @kk-xuhj in #18
  • Fix/demo agents by @kk-xuhj in #19

New Contributors

Full Changelog: v0.1.8...v0.1.12

v0.1.8

27 Feb 02:36
a177063

Choose a tag to compare

What's Changed

  • Feat/v0.1.5 by @kk-xuhj in #1
  • Feat/new_interface_for_benchmark by @kk-xuhj in #2
  • add swebench by @kk-xuhj in #3
  • Readme suggestion by @tom-doerr in #4
  • Feat/bff integrate by @kk-xuhj in #6
  • Docs/better docs by @kk-xuhj in #7
  • feat: add MMLU-PRO by @kk-xuhj in #9
  • Feat/type_checking by @kk-xuhj in #10

New Contributors

  • @kk-xuhj made their first contribution in #1
  • @tom-doerr made their first contribution in #4

Full Changelog: https://github.com/benchflow-ai/benchflow/commits/v0.1.8