frontier-evals-harness

A lightweight but production-oriented evaluation harness for experimenting with frontier models.

This repository grew out of a practical need: running consistent capability and safety evaluations across rapidly changing model variants without rewriting pipelines every week. The focus is reproducibility, statistical rigor, and iteration speed, with minimal infrastructure overhead.

If you are benchmarking checkpoints, comparing providers, or tracking regressions across releases, this is designed to be a clean evaluation spine you can rely on long term.

Why this exists

In practice, model evaluation workflows degrade over time:

Scripts fork and drift across experiments
Dataset versions become ambiguous
Results are compared informally without statistical grounding

This harness addresses those issues without becoming a heavyweight platform.

Key properties:

Deterministic run metadata with suite hashing
Modular provider and scoring abstractions
Paired statistical comparisons with confidence intervals
Human-readable HTML reports
Small dependency surface and fast startup

The goal is not to replace large benchmark ecosystems. It is to provide a reliable evaluation backbone that scales from local experiments to research infrastructure.

Evaluation Philosophy

Frontier model development moves fast. Evaluation often does not.

This harness is built around three principles:

1. Measurement over anecdotes

Single prompts do not constitute evidence. Every claim should be tied to a versioned prompt set and reproducible metrics.

2. Regressions must be statistically grounded

Naive average comparisons are noisy and misleading. This harness uses paired comparisons and bootstrap confidence intervals over per-case deltas to reduce false conclusions.

3. Evaluation is infrastructure, not a script

Prompt sets are versioned, hashed, and embedded in run artifacts. Results are reproducible across machines and time.

Evaluation should be treated with the same rigor as model training.

What This Is

A reproducible evaluation spine for capability, safety, and tool-use testing
A regression detection framework with statistical grounding
A lightweight foundation that integrates into larger research stacks
Suitable for both local experimentation and production validation workflows

What This Is Not

A leaderboard platform
A benchmark aggregation suite
A model hosting system
A training framework

The emphasis is disciplined evaluation, not feature breadth.

Quickstart

pip install -e .
python -m frontier_evals run --model dummy --out out/run.json --html out/report.html
frontier-evals compare --a out/run.json --b out/run.json --html out/compare.html

CLI Usage

Run an evaluation

frontier-evals run \
  --model dummy \
  --suites capability,safety,tool_use \
  --seed 0 \
  --out out/run.json \
  --html out/report.html

Model backends:

--model dummy Deterministic local reference model for pipeline validation.
--model hf --hf_model gpt2 Local HuggingFace model. Requires: pip install -e .[hf]
--model openai_http --base_url http://localhost:8000/v1 --openai_model your-model --endpoint chat|responses OpenAI-compatible HTTP endpoint compatible with hosted and self-served stacks.

Compare two runs

frontier-evals compare \
  --a out/run_a.json \
  --b out/run_b.json \
  --html out/compare.html

The comparison pipeline reports:

Category-level deltas
Overall score differences
Paired bootstrap confidence intervals over per-case changes
Overlap-aware statistics when case IDs match

This makes regression detection materially more reliable than simple averages.

Statistical Methodology

Comparisons between runs use paired bootstrap resampling over per-case score deltas when case IDs overlap.

Given two runs A and B:

Compute per-case differences delta_i = score_A_i - score_B_i
Resample the deltas with replacement
Compute confidence intervals over the mean delta

This approach:

Reduces variance relative to unpaired comparisons
Accounts for shared prompt structure
Produces confidence intervals that better reflect true regression risk

By default, 2000 bootstrap iterations are used. This is configurable.

Design Overview

The architecture separates evaluation concerns so components can evolve independently:

Suites → Model Provider → Scorer → Run Artifact → Report

Core components:

Suites JSONL datasets under frontier_evals/suites/, one case per line.
Suite Versioning Each suite has a manifest (*.manifest.json). Runs embed a suite_hash to guarantee dataset integrity.
Models Provider adapters live in frontier_evals/models/.
Scorers Modular scoring logic in frontier_evals/scorers.py.
Schema Stable dataclasses in frontier_evals/schema.py define run artifacts.
Reporting HTML and summaries generated in frontier_evals/report.py.

The structure is intentionally compact. The entire codebase should be readable in a short sitting.

Example Run Artifact (truncated)

{
  "meta": {
    "model": "hf:gpt2",
    "suite_hash": "3f8a2c9d1b7e4a10",
    "seed": 0,
    "created_utc": "2026-02-19T18:14:02Z"
  },
  "summary": {
    "overall": {
      "mean_score": 0.73,
      "ci_mean": {"lo": 0.68, "hi": 0.77}
    }
  },
  "results": [
    {
      "case_id": "cap_math_1",
      "score": 1.0,
      "latency_ms": 42.1
    }
  ]
}

Each run is self-contained and auditable.

Intended Use Cases

Designed for:

Checkpoint comparison during rapid iteration
Safety boundary testing and alignment regression tracking
Tool-use reliability validation in agentic systems
Reproducible academic experimentation
Well-suited for research teams that require disciplined measurement without heavy orchestration.

It is particularly well-suited for small research teams that need rigor without heavy infrastructure.

Reproducibility

Each run artifact captures:

Model configuration
Suite manifest hash
Random seed
Per-case outputs and scores
Aggregated metrics

This enables experiment replay and auditability across environments.

Extending the Harness

Typical extension paths are straightforward.

Add a new evaluation suite

Create a JSONL file in frontier_evals/suites/
Add a manifest file
Register the suite in the CLI

Add a new model provider

Implement a provider in frontier_evals/models/
Map CLI arguments to configuration
Return standardized outputs

Add custom scoring

Extend frontier_evals/scorers.py and register the scorer.

Example Scripts

See:

examples/quickstart.sh

for runnable end-to-end commands.

Roadmap

Planned extensions include:

Long-context stress testing suites
Adversarial prompt generation
Cost and latency benchmarking modules
Multi-model leaderboard mode
Plugin-based scorer registry
Async and distributed execution backends

Contributions aligned with rigorous evaluation methodology are welcome.

Limitations

Assumes case independence for statistical comparisons
Reporting is optimized for inspection rather than dashboards
No built-in orchestration layer for distributed runs

These can be layered externally if required.

License

MIT

Contributions

Contributions are welcome, especially:

New evaluation suites
Additional provider adapters
Statistical analysis improvements
Reporting and visualization upgrades

If this is useful in research or production workflows, feel free to reference the repository.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
examples		examples
frontier_evals		frontier_evals
tests		tests
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
docs_adding_new_eval.md		docs_adding_new_eval.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

frontier-evals-harness

Why this exists

Evaluation Philosophy

1. Measurement over anecdotes

2. Regressions must be statistically grounded

3. Evaluation is infrastructure, not a script

What This Is

What This Is Not

Quickstart

CLI Usage

Run an evaluation

Compare two runs

Statistical Methodology

Design Overview

Example Run Artifact (truncated)

Intended Use Cases

Reproducibility

Extending the Harness

Add a new evaluation suite

Add a new model provider

Add custom scoring

Example Scripts

Roadmap

Limitations

License

Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

frontier-evals-harness

Why this exists

Evaluation Philosophy

1. Measurement over anecdotes

2. Regressions must be statistically grounded

3. Evaluation is infrastructure, not a script

What This Is

What This Is Not

Quickstart

CLI Usage

Run an evaluation

Compare two runs

Statistical Methodology

Design Overview

Example Run Artifact (truncated)

Intended Use Cases

Reproducibility

Extending the Harness

Add a new evaluation suite

Add a new model provider

Add custom scoring

Example Scripts

Roadmap

Limitations

License

Contributions

About

Topics

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages