fix: skip model/API-key validation for oracle agent#173
fix: skip model/API-key validation for oracle agent#173xdotli merged 2 commits intobenchflow-ai:mainfrom
Conversation
The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.
|
CI failed on main, not triggered by this PR. |
EYH0602
left a comment
There was a problem hiding this comment.
PR #173 fix is installed and works for its intended purpose (OAuth token as alternative auth), but it doesn't help the oracle case — oracle needs no model/key at all, yet benchflow always forces DEFAULT_MODEL = "claude-haiku-4-5-20251001" (in job.py:141). When you run bench eval create -t <task> -a oracle without -m, the CLI does model=model or DEFAULT_MODEL (cli/eval.py:235), which always falls back to haiku, triggering the API key validation even though oracle just runs solve.sh and never calls any LLM.
This is a separate bug worth tracking — the oracle agent should either skip model/API-key validation entirely or not have a model assigned by default.
Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.
|
Moved the fix to the CLI layer as suggested. Oracle now gets Changes in
|
|
/review |
EYH0602
left a comment
There was a problem hiding this comment.
The fix in cli/eval.py is correct, but bench eval create actually dispatches through cli/main.py:707 (eval_create registered on eval_app), not cli/eval.py. That code path still has the unfixed model or DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be applied there too.
should we open another pr? |
I think so @xdotli , should I do it? |
|
Sure!
…On Tue, Apr 21, 2026 at 6:05 PM, Yifeng He < ***@***.*** > wrote:
*EYH0602* left a comment (benchflow-ai/ benchflow#173) (
#173?email_source=notifications&email_token=AMLDGK4J3RZVNC3EX2LL7334W7WENA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGIYTKNRSHA42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292156289
)
>
>>
>>
>> The fix in cli/ eval. py ( http://cli/eval.py ) is correct, but bench eval
>> create actually dispatches through cli/ main. py:707 (
>> http://cli/main.py:707 ) ( eval_create registered on eval_app ), not cli/ eval.
>> py ( http://cli/eval.py ). That code path still has the unfixed model or
>> DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be
>> applied there too.
>>
>>
>
>
>
> should we open another pr?
>
>
>
>>
>>
>> The fix in cli/ eval. py ( http://cli/eval.py ) is correct, but bench eval
>> create actually dispatches through cli/ main. py:707 (
>> http://cli/main.py:707 ) ( eval_create registered on eval_app ), not cli/ eval.
>> py ( http://cli/eval.py ). That code path still has the unfixed model or
>> DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be
>> applied there too.
>>
>>
>
>
>
> should we open another pr?
>
>
>
>>
>>
>> The fix in cli/ eval. py ( http://cli/eval.py ) is correct, but bench eval
>> create actually dispatches through cli/ main. py:707 (
>> http://cli/main.py:707 ) ( eval_create registered on eval_app ), not cli/ eval.
>> py ( http://cli/eval.py ). That code path still has the unfixed model or
>> DEFAULT_MODEL on lines 763, 769, and 780. The same oracle skip needs to be
>> applied there too.
>>
>>
>
>
>
> should we open another pr?
>
>
I think so @ xdotli ( https://github.com/xdotli ) , should I do it?
—
Reply to this email directly, view it on GitHub (
#173?email_source=notifications&email_token=AMLDGK4J3RZVNC3EX2LL7334W7WENA5CNFSNUABFM5UWIORPF5TWS5BNNB2WEL2JONZXKZKDN5WW2ZLOOQXTIMRZGIYTKNRSHA42M4TFMFZW63VHNVSW45DJN5XKKZLWMVXHJNLQOJPWG33NNVSW45C7N5YGK3S7MNWGSY3L#issuecomment-4292156289
) , or unsubscribe (
https://github.com/notifications/unsubscribe-auth/AMLDGK3LRZTKJ4TVKZO5ZPL4W7WENAVCNFSM6AAAAACYBPXXZKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHM2DEOJSGE2TMMRYHE
).
You are receiving this because you were mentioned. Message ID: <benchflow-ai/benchflow/pull/173/c4292156289
@ github. com>
|
PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI --------- Co-authored-by: Yifeng He <yfhe.prsn@gmail.com>
* fix: skip model/API-key validation for oracle agent The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it. * fix: don't assign default model to oracle agent Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this. * fix: oracle agent — chokepoint guard, drop orphan eval CLI, helper PR #173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix. * test: regression suite pinning oracle chokepoint + orphan removal Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR #173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…". * fix: restore cli/eval.py and test_eval_cli.py, apply oracle guard The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant). * docs: clarify cli/eval.py and test_eval_cli.py are not wired into live CLI * docs(plan): add plan to fix sandbox io problem * test: lock sandbox setup contract Plan step 1/6: Lock the new sandbox contract in tests * fix: stop copying root tool installs into sandbox home Plan step 2/6: Narrow setup_sandbox_user() to user state only * refactor: derive sandbox home dirs from registry config Plan step 3/6: Align registry semantics with the new contract * refactor: symlink skills into sandbox, enforce shared install prefixes Replace per-trial skill-tree copies with ln -sfn into a shared /skills (or task skills_dir) root, drop skill_paths from get_sandbox_home_dirs(), and add registry + sandbox-setup invariants that keep agent binaries on /usr/local/* rather than /root-only home paths. Updates task-authoring and api-reference docs to describe the new lightweight sandbox contract. * chore: remove completed sandbox plan doc --------- Co-authored-by: Xiangyi Li <xiangyi@benchmarkthing.com> Co-authored-by: Xiangyi Li <xiangyi@benchflow.ai>
Closes #172
Summary
solution/solve.shand never calls an LLM, butresolve_agent_env()was validating API keys for the CLI's default model (claude-haiku-4-5-20251001)bench eval create -a oraclenow works withoutANTHROPIC_API_KEYsetagent == "oracle"Test plan
test_resolve_env_helpers.pyandtest_subscription_auth.pypassbench eval create -t <task> -a oraclewithoutANTHROPIC_API_KEY— should succeedbench eval create -t <task> -a claude-agent-acp— API key validation still works as before