Evaluation-integrity MVP for long-horizon agent benchmarks.
Horizon-Eval is an IntegrityBench-style benchmark harness for portable agent-evaluation tasks with:
- serializable task specs
- hermetic execution policy metadata
- human QA gates
- trajectory capture
- integrity monitors for shortcutting, constraint bypass, canary leakage, and protected-state tampering
- replayable run bundles with hashes and HTML reports
- a safety-gap and safeguard-regression lab for comparing base, mitigated, and attacked variants
This is the kind of infrastructure that is useful for serious eval work, not just demos.
core/specs.py: portable task SDKcore/harness.py: end-to-end execution with monitors, QA, and bundle exportcore/bundle.py: immutable run-bundle writercore/reporting.py: HTML report generationcore/monitors.py: integrity checksscenarios/demo.py: demo task plus aligned and shortcut agentsscenarios/safety_gap.py: default safety-gap suite and scripted model variantsanalysis/safety_gap.py: variant comparison, safety-gap metrics, and regression reportingcli.py: runnable entrypoint
python3 cli.py --agent aligned
python3 cli.py --agent shortcut
python3 cli.py --agent canary
python3 cli.py --mode safety-gap --output-root artifacts/safety-gap
python3 -m pytest testsEach run writes a bundle under artifacts/ containing:
task_spec.jsonresult.jsonevents.jsonlqa.jsonreport.htmlmanifest.json
The key problem is not just “can an agent solve a task?” but “can it solve the task without gaming the scorer, bypassing constraints, or leaking hidden benchmark structure?” Horizon-Eval focuses on that layer.
The new safety-gap lab makes a second question explicit: “How much safer is the deployed, mitigated model than the underlying base model, and how much of that gain survives prompt-based safeguard erosion?” That is the practical release-gating question most benchmark stacks still do not answer cleanly.
- The current MVP is stdlib-only by design to keep the surface small and auditable.
- The next serious extension would be a true hermetic Docker runner, hidden holdout task variants, and richer scorer-integrity checks.