fix: oracle chokepoint guard + effective_model helper#174
fix: oracle chokepoint guard + effective_model helper#174xdotli merged 4 commits intobenchflow-ai:mainfrom
Conversation
PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).
EYH0602
left a comment
There was a problem hiding this comment.
Review Summary
Reviewed 7 files, 346 lines added / 28 deleted across bug detection, type design, test coverage, and guidelines compliance.
Critical Issues (confidence 90-100)
None.
Important Issues (confidence 75-89)
None.
All findings scored below the 75 confidence threshold — no actionable issues found.
Positive Observations
-
Three-layer defense is excellent architecture. Layer 1 (chokepoint guard in
resolve_agent_env), Layer 2 (docstring/structural clarification on the not-yet-livecli/eval.py), and Layer 3 (effective_model()helper) each defend independently — any single layer failing still leaves the others protecting. -
effective_model()eliminates 9 duplicatemodel or DEFAULT_MODELsites acrosscli/main.py,cli/eval.py, and the YAML loaders injob.py. Centralizing the rule "oracle → None, else → model or default" in one helper is a meaningful duplication reduction. -
Test layering matches the code layering. Tests pin each defense at the right altitude:
TestResolveAgentEnvOraclepins Layer 1,TestEvalModuleNotWiredIntoCLIpins Layer 2 structurally,TestEffectiveModelpins the helper,TestOracleYamlLoaderspins the YAML entry points, andTestEvalCreateOracleCLIpins the full CLI → chokepoint path end-to-end. 13 new tests for a 3-line core change is proportionate to the bug's subtlety. -
Structural regression test is clever.
test_cli_main_does_not_import_cli_evalreads the source text ofcli/main.pyand asserts nocli/evalimport — this directly prevents the mistake that made PR #173's fix land in dead code. -
JobConfig.model default → None is clean. Forcing callers through
effective_model()means config objects carry honest data (oracle showsmodel=nullinsummary.json, not a bogus default). All internal callers were updated. -
Edge cases covered. Empty string model from Harbor YAML, explicit model forwarded to oracle, non-oracle fallback to DEFAULT_MODEL — all verified in tests.
Verified: all 9 model or DEFAULT_MODEL sites in cli/main.py + cli/eval.py are replaced, both YAML loaders in job.py route through the helper, and DEFAULT_MODEL is only imported where still needed. No remaining unguarded call sites. LGTM — request an independent reviewer per CLAUDE.md policy.
|
@xdotli this fix should do the work |
|
|
||
| The future-facing entry point for running evaluations. Anthropic-style shape: | ||
| resource creation, one command, return the result or a job-id. | ||
| NOTE: This module is **not wired into the live CLI**. The active |
Summary
Follow-up to #173. The reviewer flagged that the fix landed in
src/benchflow/cli/eval.py, butbench eval createactually dispatches tosrc/benchflow/cli/main.py:707—cli/eval.pyis not wired into the live CLI. On top of that, #173 deliberately removed the chokepoint guard from_agent_env.py:147on the assumption that the CLI would pre-filter, so oracle is now more broken on the live code path:bench eval create -a oracleraisesValueError("ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001' but not set …")even though oracle never calls an LLM.Three layers, one PR:
_agent_env.py:147). Put backif model and agent != "oracle":. Single point of defense; catches every caller (current + future, CLI + SDK + YAML).src/benchflow/cli/eval.pyandtests/test_eval_cli.py(future-facing eval design), but add docstring notes that they are not wired into the live CLI. Updateeval.pyto use theeffective_model()helper for the oracle guard. A regression test ensurescli/main.pydoes not importcli/eval.effective_model(agent, model)injob.py. ChangeJobConfig.modeldefault toNone. Replace all 7model or DEFAULT_MODELsites incli/main.pyandjob.pyYAML loaders with the helper. Oracle gets honestmodel=Noneend-to-end (inresult.json,summary.json, prints).Regression tests
tests/test_resolve_env_helpers.py::TestResolveAgentEnvOraclepins the chokepoint at unit level.tests/test_oracle_chokepoint.py(13 tests) pins each layer at the right altitude:TestEvalModuleNotWiredIntoCLIcli/main.pydoes not importcli/evalTestEvalCreateRoutingbench eval create→cli/main.py:eval_createTestEffectiveModelTestOracleYamlLoadersJob.from_yaml(oracle config) → model is None(native + Harbor)TestEvalCreateOracleCLIeval_create(agent="oracle")+ no API key → no raiseTest plan
pytest tests/test_oracle_chokepoint.py→ 13 passedpytest tests/test_eval_cli.py→ 7 passedpytest tests/test_resolve_env_helpers.py→ all passedmain(Docker not running locally). Zero new failures.ty check src/: same pre-existing diagnostics asmain. Zero new ones.