fix: oracle chokepoint guard + effective_model helper by EYH0602 · Pull Request #174 · benchflow-ai/benchflow

EYH0602 · 2026-04-21T22:46:15Z

Summary

Follow-up to #173. The reviewer flagged that the fix landed in src/benchflow/cli/eval.py, but bench eval create actually dispatches to src/benchflow/cli/main.py:707 — cli/eval.py is not wired into the live CLI. On top of that, #173 deliberately removed the chokepoint guard from _agent_env.py:147 on the assumption that the CLI would pre-filter, so oracle is now more broken on the live code path: bench eval create -a oracle raises ValueError("ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001' but not set …") even though oracle never calls an LLM.

Three layers, one PR:

Layer 1 — restore the chokepoint (_agent_env.py:147). Put back if model and agent != "oracle":. Single point of defense; catches every caller (current + future, CLI + SDK + YAML).
Layer 2 — clarify the not-yet-live module. Keep src/benchflow/cli/eval.py and tests/test_eval_cli.py (future-facing eval design), but add docstring notes that they are not wired into the live CLI. Update eval.py to use the effective_model() helper for the oracle guard. A regression test ensures cli/main.py does not import cli/eval.
Layer 3 — kill the structural pressure. New helper effective_model(agent, model) in job.py. Change JobConfig.model default to None. Replace all 7 model or DEFAULT_MODEL sites in cli/main.py and job.py YAML loaders with the helper. Oracle gets honest model=None end-to-end (in result.json, summary.json, prints).

Regression tests

tests/test_resolve_env_helpers.py::TestResolveAgentEnvOracle pins the chokepoint at unit level. tests/test_oracle_chokepoint.py (13 tests) pins each layer at the right altitude:

Class	What it pins
`TestEvalModuleNotWiredIntoCLI`	`cli/main.py` does not import `cli/eval`
`TestEvalCreateRouting`	`bench eval create` → `cli/main.py:eval_create`
`TestEffectiveModel`	helper behaviour for oracle / non-oracle / empty model
`TestOracleYamlLoaders`	`Job.from_yaml(oracle config) → model is None` (native + Harbor)
`TestEvalCreateOracleCLI`	end-to-end: live `eval_create(agent="oracle")` + no API key → no raise

Test plan

pytest tests/test_oracle_chokepoint.py → 13 passed
pytest tests/test_eval_cli.py → 7 passed
pytest tests/test_resolve_env_helpers.py → all passed
Full suite: same pre-existing failures as main (Docker not running locally). Zero new failures.
ty check src/: same pre-existing diagnostics as main. Zero new ones.

PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.

Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".

The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).

…e CLI

EYH0602

Review Summary

Reviewed 7 files, 346 lines added / 28 deleted across bug detection, type design, test coverage, and guidelines compliance.

Critical Issues (confidence 90-100)

None.

Important Issues (confidence 75-89)

None.

All findings scored below the 75 confidence threshold — no actionable issues found.

Positive Observations

Three-layer defense is excellent architecture. Layer 1 (chokepoint guard in resolve_agent_env), Layer 2 (docstring/structural clarification on the not-yet-live cli/eval.py), and Layer 3 (effective_model() helper) each defend independently — any single layer failing still leaves the others protecting.
effective_model() eliminates 9 duplicate model or DEFAULT_MODEL sites across cli/main.py, cli/eval.py, and the YAML loaders in job.py. Centralizing the rule "oracle → None, else → model or default" in one helper is a meaningful duplication reduction.
Test layering matches the code layering. Tests pin each defense at the right altitude: TestResolveAgentEnvOracle pins Layer 1, TestEvalModuleNotWiredIntoCLI pins Layer 2 structurally, TestEffectiveModel pins the helper, TestOracleYamlLoaders pins the YAML entry points, and TestEvalCreateOracleCLI pins the full CLI → chokepoint path end-to-end. 13 new tests for a 3-line core change is proportionate to the bug's subtlety.
Structural regression test is clever. test_cli_main_does_not_import_cli_eval reads the source text of cli/main.py and asserts no cli/eval import — this directly prevents the mistake that made PR #173's fix land in dead code.
JobConfig.model default → None is clean. Forcing callers through effective_model() means config objects carry honest data (oracle shows model=null in summary.json, not a bogus default). All internal callers were updated.
Edge cases covered. Empty string model from Harbor YAML, explicit model forwarded to oracle, non-oracle fallback to DEFAULT_MODEL — all verified in tests.

Verified: all 9 model or DEFAULT_MODEL sites in cli/main.py + cli/eval.py are replaced, both YAML loaders in job.py route through the helper, and DEFAULT_MODEL is only imported where still needed. No remaining unguarded call sites. LGTM — request an independent reviewer per CLAUDE.md policy.

EYH0602 · 2026-04-21T23:14:36Z

@xdotli this fix should do the work

xdotli · 2026-04-22T00:50:07Z

@xdotli this fix should do the work

waiting for @devin to review it

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 4 additional findings.

xdotli · 2026-04-22T01:10:49Z


-The future-facing entry point for running evaluations. Anthropic-style shape:
-resource creation, one command, return the result or a job-id.
+NOTE: This module is **not wired into the live CLI**.  The active


@claude is this comment good?

EYH0602 added 2 commits April 21, 2026 15:32

EYH0602 marked this pull request as draft April 21, 2026 22:48

EYH0602 added 2 commits April 21, 2026 15:53

docs: clarify cli/eval.py and test_eval_cli.py are not wired into liv…

bc04c59

…e CLI

EYH0602 marked this pull request as ready for review April 21, 2026 22:55

EYH0602 changed the title ~~fix: oracle chokepoint guard + drop orphan eval CLI + effective_model helper~~ fix: oracle chokepoint guard + effective_model helper Apr 21, 2026

EYH0602 commented Apr 21, 2026

View reviewed changes

xdotli self-requested a review April 22, 2026 00:49

devin-ai-integration Bot reviewed Apr 22, 2026

View reviewed changes

xdotli reviewed Apr 22, 2026

View reviewed changes

xdotli approved these changes Apr 22, 2026

View reviewed changes

xdotli merged commit 144b6dc into benchflow-ai:main Apr 22, 2026
1 of 2 checks passed

EYH0602 deleted the fix/oracle-chokepoint-and-cleanup branch April 22, 2026 07:13

xdotli mentioned this pull request Apr 25, 2026

merge: main → dev-0.3 (release prep for v0.3.2) #195

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: oracle chokepoint guard + effective_model helper#174

fix: oracle chokepoint guard + effective_model helper#174
xdotli merged 4 commits intobenchflow-ai:mainfrom
EYH0602:fix/oracle-chokepoint-and-cleanup

EYH0602 commented Apr 21, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

EYH0602 left a comment

Uh oh!

EYH0602 commented Apr 21, 2026

Uh oh!

xdotli commented Apr 22, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

xdotli Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

EYH0602 commented Apr 21, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Regression tests

Test plan

Uh oh!

EYH0602 left a comment

Choose a reason for hiding this comment

Review Summary

Critical Issues (confidence 90-100)

Important Issues (confidence 75-89)

Positive Observations

Uh oh!

EYH0602 commented Apr 21, 2026

Uh oh!

xdotli commented Apr 22, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

xdotli Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

EYH0602 commented Apr 21, 2026 •

edited by devin-ai-integration Bot

Loading