feat: wire sandbox_setup_timeout through all configs#180
Merged
xdotli merged 13 commits intobenchflow-ai:dev-0.3from Apr 25, 2026
Merged
feat: wire sandbox_setup_timeout through all configs#180xdotli merged 13 commits intobenchflow-ai:dev-0.3from
xdotli merged 13 commits intobenchflow-ai:dev-0.3from
Conversation
release: benchflow 0.3.0 — Scene lifecycle, multi-agent, CLI redesign
The oracle agent runs solution/solve.sh and never calls an LLM, but resolve_agent_env() was validating API keys for whatever model the CLI defaulted to (claude-haiku-4-5-20251001). This made `bench eval create -a oracle` fail without ANTHROPIC_API_KEY set, even though oracle doesn't need it.
Move the fix from resolve_agent_env to the CLI layer: oracle runs solve.sh and never calls an LLM, so it should not receive DEFAULT_MODEL at all. Both _run_single and _run_batch now pass model=None for oracle. Widen JobConfig.model to str | None to support this.
…key-check fix: skip model/API-key validation for oracle agent
PR benchflow-ai#173 moved the oracle/DEFAULT_MODEL guard from resolve_agent_env to cli/eval.py, but cli/eval.py is orphaned (never imported into the live CLI), so `bench eval create` still passes DEFAULT_MODEL to oracle and trips ANTHROPIC_API_KEY validation. Three changes: - Restore the `agent != "oracle"` guard in resolve_agent_env so the chokepoint defends against any caller that forwards a model. - Delete the orphan cli/eval.py and its tests — the live eval_create lives in cli/main.py and was the actual code path users hit. - Add effective_model(agent, model) helper, change JobConfig.model default to None, replace seven `model or DEFAULT_MODEL` sites in cli/main.py and job.py YAML loaders so oracle gets honest model=None end-to-end (in result/summary JSON, prints, and downstream Trial). Regression test in test_resolve_env_helpers.py pins the chokepoint by calling resolve_agent_env("oracle", DEFAULT_MODEL, {}) with no API key and no host auth — verified to fail on main with the user-facing ANTHROPIC_API_KEY error and pass after the fix.
Bundle 14 tests in tests/test_oracle_chokepoint.py that pin each layer of the prior fix at the right altitude: - TestOrphanRemoval — cli/eval.py is gone (ModuleNotFoundError) and no src/ file references benchflow.cli.eval, guarding against a future re-introduction that could swallow the next bug fix the same way. - TestEvalCreateRouting — `bench eval create` callback lives in cli/main.py:eval_create. Pins the architectural fact PR benchflow-ai#173 missed. - TestEffectiveModel — unit tests for the helper: oracle drops model, non-oracle falls back to DEFAULT_MODEL, empty string treated as unset. - TestOracleYamlLoaders — Job.from_yaml(oracle config) → model is None for both native and Harbor formats; non-oracle backwards-compat preserved. - TestEvalCreateOracleCLI — end-to-end: live eval_create(agent="oracle") with no API key in env does not raise. Mocks Trial.create and resets the asyncio loop after to avoid polluting pre-existing tests that use the deprecated asyncio.get_event_loop() pattern. Verified to fail on main in the right shape: 9 of 14 fail (each pinning a deleted/added behavior), 5 pass (asserting structural facts already true). The CLI test fails on main with the user-reported error "ANTHROPIC_API_KEY required for model 'claude-haiku-4-5-20251001'…".
The previous commit deleted cli/eval.py and its tests as orphans, but they are intentionally kept. Restore both from main, update eval.py to use the effective_model() helper for the oracle chokepoint fix, and replace the "module is gone" regression test with a guard that cli/main.py does not import cli/eval (the actual invariant).
…t-and-cleanup fix: oracle chokepoint guard + effective_model helper
`setup_sandbox_user()` already accepted a `timeout_sec` kwarg (default 120s) but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into /home/<sandbox_user>) the 120s cap was hit with no user override. Add `sandbox_setup_timeout: int = 120` to TrialConfig, JobConfig, and RuntimeConfig, and forward it through: - trial YAML (`trial_config_from_dict`) - job YAML (both native and Harbor-compatible loaders) - `SDK.run(..., sandbox_setup_timeout=...)` - `bench eval create --sandbox-setup-timeout` - `Trial.install_agent()` into both `setup_sandbox_user()` call sites (oracle + normal agent) The value is also recorded in the run's `config.json` snapshot to aid post-hoc diagnosis. Default stays at 120s — this change is about making the value configurable, not changing runtime behavior.
Contributor
Author
EYH0602
added a commit
to EYH0602/benchflow
that referenced
this pull request
Apr 23, 2026
This was referenced Apr 25, 2026
xdotli
added a commit
that referenced
this pull request
Apr 25, 2026
Brings 126 ruff errors → 0 so CI's lint check goes green and unblocks the 5 PRs targeting dev-0.3 (#176, #180, #181, #182, #191) that were landing on top of pre-existing repo lint debt. What changed: 1. Auto-fixes via `ruff check --fix --unsafe-fixes`: - 40 F401 unused-imports across src/, tests/, examples/ - 8 I001 unsorted-imports - 6 UP037 quoted-annotations modernized - Other auto-fixable rules 2. Hand fixes: - src/benchflow/__init__.py: removed `Trial` from the `from harbor` re-export block (it was shadowed by `from benchflow.trial import Trial` at line 65, which is the canonical public Trial). Added `trial_config_from_yaml` to __all__. - src/benchflow/process.py: 3x `raise ConnectionError(...) from e` for B904 (errors raised inside except clauses). - src/benchflow/mcp/reviewer_server.py: same B904 fix for fastmcp ImportError reraise. - tests/test_skill_eval.py: raw string for `pytest.raises(match=...)` pattern (RUF043). - 3 files: replaced `×` (Unicode multiplication sign) in comments and f-strings with `x` (latin x) to clear RUF001/RUF003. 3. Per-file ignores added to pyproject.toml `[tool.ruff.lint.per-file-ignores]`: - `experiments/*.py` and `tests/conformance/*.py` ignore E402 — these are standalone scripts that legitimately set sys.path before importing. - `src/benchflow/runtime.py` ignores F821 — uses forward references resolved by `from __future__ import annotations`; explicit TYPE_CHECKING imports would force eager loads. No code behavior changes. 580 tests pass; the 8 pre-existing failures (env-leak between subscription auth tests, Docker compose env, judge model default mismatch) are unrelated to this PR.
# Conflicts: # src/benchflow/_agent_env.py # src/benchflow/cli/eval.py # tests/test_oracle_chokepoint.py
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
setup_sandbox_user()already accepted atimeout_seckwarg (default 120s), but no live call site surfaced it — the knob was unreachable for normal runs. Under heavy sandbox bootstrap (parallel containers copying large tool caches into/home/<sandbox_user>) users hit the 120s cap with no override.sandbox_setup_timeout: int = 120toTrialConfig,JobConfig, andRuntimeConfig, and forwards it through every live config entry point: trial YAML, job YAML (native + Harbor),SDK.run(),bench eval create --sandbox-setup-timeout, and bothsetup_sandbox_user()call sites inTrial.install_agent()(oracle + normal agent).config.jsonsnapshot for post-hoc diagnosis.Commits
bc7e841feat: wire sandbox setup timeout through configs055f605test: cover sandbox setup timeout wiringdb9d99adocs: document sandbox setup timeout