Agentic LLM-as-judge evaluation harness for Mathlib (Lean 4) proof quality.
The dataset is a collection of PRs to mathlib - their first commit, and final one. The judge is rated as successful if it rates the final accepted PR over the initial rejected one.
The evaluation dataset is hosted on HuggingFace at SJCaldwell/proofjudge. Files are downloaded and cached automatically on first run via huggingface_hub.
Build Sandbox
The sandbox is used to place mathlib source code at the time of the PR, so the agentic judge can understand the context of the PR.
python scripts/run_eval.py build-imageRun Eval
You can run the eval on one datapoint, or the whole set.
# Full run
python scripts/run_eval.py evaluate
# Single PR for testing
python scripts/run_eval.py evaluate --pr 254
# Different model
python scripts/run_eval.py evaluate --model "openai:gpt-4o"Analyze Results
python scripts/run_eval.py analyze results/results_*.jsonl --with-metadataAll settings via env vars (prefix PROOFJUDGE_):
| Variable | Default | Description |
|---|---|---|
PROOFJUDGE_HF_DATASET |
SJCaldwell/proofjudge |
HuggingFace dataset repo ID |
PROOFJUDGE_HF_DATASET_REVISION |
v0.1.0 |
Dataset revision (tag, branch, or commit SHA) |
PROOFJUDGE_JUDGE_MODEL |
anthropic:claude-sonnet-4-20250514 |
Pydantic AI model |
PROOFJUDGE_MAX_CONTAINERS |
4 |
Concurrent sandboxes |
PROOFJUDGE_DOCKER_IMAGE |
proofjudge-sandbox:latest |
Sandbox image |