Skip to content

jsp2195/frontier-evals-harness

Repository files navigation

frontier-evals-harness

A lightweight but production-oriented evaluation harness for experimenting with frontier models.

This repository grew out of a practical need: running consistent capability and safety evaluations across rapidly changing model variants without rewriting pipelines every week. The focus is reproducibility, statistical rigor, and iteration speed, with minimal infrastructure overhead.

If you are benchmarking checkpoints, comparing providers, or tracking regressions across releases, this is designed to be a clean evaluation spine you can rely on long term.


Why this exists

In practice, model evaluation workflows degrade over time:

  • Scripts fork and drift across experiments
  • Dataset versions become ambiguous
  • Results are compared informally without statistical grounding

This harness addresses those issues without becoming a heavyweight platform.

Key properties:

  • Deterministic run metadata with suite hashing
  • Modular provider and scoring abstractions
  • Paired statistical comparisons with confidence intervals
  • Human-readable HTML reports
  • Small dependency surface and fast startup

The goal is not to replace large benchmark ecosystems. It is to provide a reliable evaluation backbone that scales from local experiments to research infrastructure.


Evaluation Philosophy

Frontier model development moves fast. Evaluation often does not.

This harness is built around three principles:

1. Measurement over anecdotes

Single prompts do not constitute evidence. Every claim should be tied to a versioned prompt set and reproducible metrics.

2. Regressions must be statistically grounded

Naive average comparisons are noisy and misleading. This harness uses paired comparisons and bootstrap confidence intervals over per-case deltas to reduce false conclusions.

3. Evaluation is infrastructure, not a script

Prompt sets are versioned, hashed, and embedded in run artifacts. Results are reproducible across machines and time.

Evaluation should be treated with the same rigor as model training.


What This Is

  • A reproducible evaluation spine for capability, safety, and tool-use testing
  • A regression detection framework with statistical grounding
  • A lightweight foundation that integrates into larger research stacks
  • Suitable for both local experimentation and production validation workflows

What This Is Not

  • A leaderboard platform
  • A benchmark aggregation suite
  • A model hosting system
  • A training framework

The emphasis is disciplined evaluation, not feature breadth.


Quickstart

pip install -e .
python -m frontier_evals run --model dummy --out out/run.json --html out/report.html
frontier-evals compare --a out/run.json --b out/run.json --html out/compare.html

CLI Usage

Run an evaluation

frontier-evals run \
  --model dummy \
  --suites capability,safety,tool_use \
  --seed 0 \
  --out out/run.json \
  --html out/report.html

Model backends:

  • --model dummy Deterministic local reference model for pipeline validation.

  • --model hf --hf_model gpt2 Local HuggingFace model. Requires: pip install -e .[hf]

  • --model openai_http --base_url http://localhost:8000/v1 --openai_model your-model --endpoint chat|responses OpenAI-compatible HTTP endpoint compatible with hosted and self-served stacks.


Compare two runs

frontier-evals compare \
  --a out/run_a.json \
  --b out/run_b.json \
  --html out/compare.html

The comparison pipeline reports:

  • Category-level deltas
  • Overall score differences
  • Paired bootstrap confidence intervals over per-case changes
  • Overlap-aware statistics when case IDs match

This makes regression detection materially more reliable than simple averages.


Statistical Methodology

Comparisons between runs use paired bootstrap resampling over per-case score deltas when case IDs overlap.

Given two runs A and B:

  1. Compute per-case differences delta_i = score_A_i - score_B_i

  2. Resample the deltas with replacement

  3. Compute confidence intervals over the mean delta

This approach:

  • Reduces variance relative to unpaired comparisons
  • Accounts for shared prompt structure
  • Produces confidence intervals that better reflect true regression risk

By default, 2000 bootstrap iterations are used. This is configurable.


Design Overview

The architecture separates evaluation concerns so components can evolve independently:

Suites → Model Provider → Scorer → Run Artifact → Report

Core components:

  • Suites JSONL datasets under frontier_evals/suites/, one case per line.

  • Suite Versioning Each suite has a manifest (*.manifest.json). Runs embed a suite_hash to guarantee dataset integrity.

  • Models Provider adapters live in frontier_evals/models/.

  • Scorers Modular scoring logic in frontier_evals/scorers.py.

  • Schema Stable dataclasses in frontier_evals/schema.py define run artifacts.

  • Reporting HTML and summaries generated in frontier_evals/report.py.

The structure is intentionally compact. The entire codebase should be readable in a short sitting.


Example Run Artifact (truncated)

{
  "meta": {
    "model": "hf:gpt2",
    "suite_hash": "3f8a2c9d1b7e4a10",
    "seed": 0,
    "created_utc": "2026-02-19T18:14:02Z"
  },
  "summary": {
    "overall": {
      "mean_score": 0.73,
      "ci_mean": {"lo": 0.68, "hi": 0.77}
    }
  },
  "results": [
    {
      "case_id": "cap_math_1",
      "score": 1.0,
      "latency_ms": 42.1
    }
  ]
}

Each run is self-contained and auditable.


Intended Use Cases

Designed for:

  • Checkpoint comparison during rapid iteration
  • Safety boundary testing and alignment regression tracking
  • Tool-use reliability validation in agentic systems
  • Reproducible academic experimentation
  • Well-suited for research teams that require disciplined measurement without heavy orchestration.

It is particularly well-suited for small research teams that need rigor without heavy infrastructure.


Reproducibility

Each run artifact captures:

  • Model configuration
  • Suite manifest hash
  • Random seed
  • Per-case outputs and scores
  • Aggregated metrics

This enables experiment replay and auditability across environments.


Extending the Harness

Typical extension paths are straightforward.

Add a new evaluation suite

  1. Create a JSONL file in frontier_evals/suites/
  2. Add a manifest file
  3. Register the suite in the CLI

Add a new model provider

  1. Implement a provider in frontier_evals/models/
  2. Map CLI arguments to configuration
  3. Return standardized outputs

Add custom scoring

Extend frontier_evals/scorers.py and register the scorer.


Example Scripts

See:

examples/quickstart.sh

for runnable end-to-end commands.


Roadmap

Planned extensions include:

  • Long-context stress testing suites
  • Adversarial prompt generation
  • Cost and latency benchmarking modules
  • Multi-model leaderboard mode
  • Plugin-based scorer registry
  • Async and distributed execution backends

Contributions aligned with rigorous evaluation methodology are welcome.


Limitations

  • Assumes case independence for statistical comparisons
  • Reporting is optimized for inspection rather than dashboards
  • No built-in orchestration layer for distributed runs

These can be layered externally if required.


License

MIT


Contributions

Contributions are welcome, especially:

  • New evaluation suites
  • Additional provider adapters
  • Statistical analysis improvements
  • Reporting and visualization upgrades

If this is useful in research or production workflows, feel free to reference the repository.

About

frontier-evals-harness is a lightweight framework for benchmarking frontier language models. It provides deterministic suite versioning, modular adapters, standardized scoring, and paired statistical comparisons with confidence intervals. Built for regression tracking and analysis, it enables reproducible evaluation without infrastructure.

Topics

Resources

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages