A lightweight but production-oriented evaluation harness for experimenting with frontier models.
This repository grew out of a practical need: running consistent capability and safety evaluations across rapidly changing model variants without rewriting pipelines every week. The focus is reproducibility, statistical rigor, and iteration speed, with minimal infrastructure overhead.
If you are benchmarking checkpoints, comparing providers, or tracking regressions across releases, this is designed to be a clean evaluation spine you can rely on long term.
In practice, model evaluation workflows degrade over time:
- Scripts fork and drift across experiments
- Dataset versions become ambiguous
- Results are compared informally without statistical grounding
This harness addresses those issues without becoming a heavyweight platform.
Key properties:
- Deterministic run metadata with suite hashing
- Modular provider and scoring abstractions
- Paired statistical comparisons with confidence intervals
- Human-readable HTML reports
- Small dependency surface and fast startup
The goal is not to replace large benchmark ecosystems. It is to provide a reliable evaluation backbone that scales from local experiments to research infrastructure.
Frontier model development moves fast. Evaluation often does not.
This harness is built around three principles:
Single prompts do not constitute evidence. Every claim should be tied to a versioned prompt set and reproducible metrics.
Naive average comparisons are noisy and misleading. This harness uses paired comparisons and bootstrap confidence intervals over per-case deltas to reduce false conclusions.
Prompt sets are versioned, hashed, and embedded in run artifacts. Results are reproducible across machines and time.
Evaluation should be treated with the same rigor as model training.
- A reproducible evaluation spine for capability, safety, and tool-use testing
- A regression detection framework with statistical grounding
- A lightweight foundation that integrates into larger research stacks
- Suitable for both local experimentation and production validation workflows
- A leaderboard platform
- A benchmark aggregation suite
- A model hosting system
- A training framework
The emphasis is disciplined evaluation, not feature breadth.
pip install -e .
python -m frontier_evals run --model dummy --out out/run.json --html out/report.html
frontier-evals compare --a out/run.json --b out/run.json --html out/compare.htmlfrontier-evals run \
--model dummy \
--suites capability,safety,tool_use \
--seed 0 \
--out out/run.json \
--html out/report.htmlModel backends:
-
--model dummyDeterministic local reference model for pipeline validation. -
--model hf --hf_model gpt2Local HuggingFace model. Requires:pip install -e .[hf] -
--model openai_http --base_url http://localhost:8000/v1 --openai_model your-model --endpoint chat|responsesOpenAI-compatible HTTP endpoint compatible with hosted and self-served stacks.
frontier-evals compare \
--a out/run_a.json \
--b out/run_b.json \
--html out/compare.htmlThe comparison pipeline reports:
- Category-level deltas
- Overall score differences
- Paired bootstrap confidence intervals over per-case changes
- Overlap-aware statistics when case IDs match
This makes regression detection materially more reliable than simple averages.
Comparisons between runs use paired bootstrap resampling over per-case score deltas when case IDs overlap.
Given two runs A and B:
-
Compute per-case differences
delta_i = score_A_i - score_B_i -
Resample the deltas with replacement
-
Compute confidence intervals over the mean delta
This approach:
- Reduces variance relative to unpaired comparisons
- Accounts for shared prompt structure
- Produces confidence intervals that better reflect true regression risk
By default, 2000 bootstrap iterations are used. This is configurable.
The architecture separates evaluation concerns so components can evolve independently:
Suites → Model Provider → Scorer → Run Artifact → Report
Core components:
-
Suites JSONL datasets under
frontier_evals/suites/, one case per line. -
Suite Versioning Each suite has a manifest (
*.manifest.json). Runs embed asuite_hashto guarantee dataset integrity. -
Models Provider adapters live in
frontier_evals/models/. -
Scorers Modular scoring logic in
frontier_evals/scorers.py. -
Schema Stable dataclasses in
frontier_evals/schema.pydefine run artifacts. -
Reporting HTML and summaries generated in
frontier_evals/report.py.
The structure is intentionally compact. The entire codebase should be readable in a short sitting.
{
"meta": {
"model": "hf:gpt2",
"suite_hash": "3f8a2c9d1b7e4a10",
"seed": 0,
"created_utc": "2026-02-19T18:14:02Z"
},
"summary": {
"overall": {
"mean_score": 0.73,
"ci_mean": {"lo": 0.68, "hi": 0.77}
}
},
"results": [
{
"case_id": "cap_math_1",
"score": 1.0,
"latency_ms": 42.1
}
]
}Each run is self-contained and auditable.
Designed for:
- Checkpoint comparison during rapid iteration
- Safety boundary testing and alignment regression tracking
- Tool-use reliability validation in agentic systems
- Reproducible academic experimentation
- Well-suited for research teams that require disciplined measurement without heavy orchestration.
It is particularly well-suited for small research teams that need rigor without heavy infrastructure.
Each run artifact captures:
- Model configuration
- Suite manifest hash
- Random seed
- Per-case outputs and scores
- Aggregated metrics
This enables experiment replay and auditability across environments.
Typical extension paths are straightforward.
- Create a JSONL file in
frontier_evals/suites/ - Add a manifest file
- Register the suite in the CLI
- Implement a provider in
frontier_evals/models/ - Map CLI arguments to configuration
- Return standardized outputs
Extend frontier_evals/scorers.py and register the scorer.
See:
examples/quickstart.sh
for runnable end-to-end commands.
Planned extensions include:
- Long-context stress testing suites
- Adversarial prompt generation
- Cost and latency benchmarking modules
- Multi-model leaderboard mode
- Plugin-based scorer registry
- Async and distributed execution backends
Contributions aligned with rigorous evaluation methodology are welcome.
- Assumes case independence for statistical comparisons
- Reporting is optimized for inspection rather than dashboards
- No built-in orchestration layer for distributed runs
These can be layered externally if required.
MIT
Contributions are welcome, especially:
- New evaluation suites
- Additional provider adapters
- Statistical analysis improvements
- Reporting and visualization upgrades
If this is useful in research or production workflows, feel free to reference the repository.