merge: main → dev-0.3 (release prep for v0.3.2) by xdotli · Pull Request #195 · benchflow-ai/benchflow

xdotli · 2026-04-25T08:08:46Z

Why

dev-0.3 is 9 commits behind main, including PR #173 (oracle agent: skip default model assignment) and PR #174 (oracle chokepoint refactor). These commits modify cli/eval.py and job.py and were never integrated into dev-0.3, so any PR branched off main that touches these files conflicts with dev-0.3.

This is a release-prep merge: bring main fully into dev-0.3, then cut v0.3.2 from dev-0.3 → main, then retire dev-0.3.

Conflicts resolved

Two regions, both trivially favoring main's post-#173/#174 versions:

src/benchflow/job.py:

 @dataclass
 class JobConfig:
     agent: str = DEFAULT_AGENT
-    model: str | None = DEFAULT_MODEL    # dev-0.3
+    model: str | None = None             # main (PR #173)

src/benchflow/cli/eval.py (in both _run_single and _run_batch):

-    effective_model = None if agent == "oracle" else (model or DEFAULT_MODEL)
+    eff_model = _effective_model(agent, model)
     config = JobConfig(
         agent=agent,
-        model=effective_model,
+        model=eff_model,

PR #174 extracted the inline ternary into a _effective_model(agent, model) helper imported from benchflow.job. dev-0.3 still has the inline form. Resolution: take main's helper-based version. The import already exists at the top of the file.

Validation

580 tests passing locally
8 failing — all pre-existing on dev-0.3 (Docker compose env, env-var pollution between subscription auth tests, judge_model default mismatch in test_skill_eval). None caused by this merge.

Test plan

No conflict markers remain in source
Local imports succeed
CI runs against the merge commit
Devin reviews the resolution choice

Next steps after this merges

Open PRs docs: use uv tool install instead of pip install #176 feat: wire sandbox_setup_timeout through all configs #180 fix: stop copying root tool installs into sandbox home #181 fix: merge cfg.agent_env into connect_as() env resolution #191 should re-evaluate to MERGEABLE against dev-0.3 (they were branched off main and are blocked by these same files)
Land remaining dev-0.3 PRs (feat: BaseUser abstraction + per-task verifier hardening opt-outs #194 mine, Fix/openhands sandbox launch #182 AmyTao after rebase)
dev-0.3 → main release PR
Tag v0.3.2 on main
Bump main to 0.3.3.dev0
Delete dev-0.3

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.

Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.

fix: skip model/API-key validation for oracle agent

PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).

…e CLI

fix: oracle chokepoint guard + effective_model helper

# Conflicts: # src/benchflow/cli/eval.py # src/benchflow/job.py

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 5 additional findings.

xdotli and others added 10 commits April 21, 2026 16:38

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

7dc18fc

release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign

Merge pull request #173 from EYH0602/fix/oracle-skip-api-key-check

a099ff9

fix: skip model/API-key validation for oracle agent

docs: clarify cli/eval.py and test_eval_cli.py are not wired into liv…

bc04c59

…e CLI

Merge pull request #174 from EYH0602/fix/oracle-chokepoint-and-cleanup

144b6dc

fix: oracle chokepoint guard + effective_model helper

Merge remote-tracking branch 'origin/main' into merge/main-into-dev-0.3

acd1541

# Conflicts: # src/benchflow/cli/eval.py # src/benchflow/job.py

devin-ai-integration Bot reviewed Apr 25, 2026

View reviewed changes

xdotli merged commit e1cc115 into dev-0.3 Apr 25, 2026
2 checks passed

xdotli deleted the merge/main-into-dev-0.3 branch April 25, 2026 10:04

xdotli mentioned this pull request Apr 25, 2026

release: benchflow 0.3.2 — BaseUser, verifier hardening, DinD compose, lint cleanup #199

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

merge: main → dev-0.3 (release prep for v0.3.2)#195

merge: main → dev-0.3 (release prep for v0.3.2)#195
xdotli merged 10 commits intodev-0.3from
merge/main-into-dev-0.3

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xdotli commented Apr 25, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Conflicts resolved

Validation

Test plan

Next steps after this merges

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

xdotli commented Apr 25, 2026 •

edited by devin-ai-integration Bot

Loading