ProofJudge

Agentic LLM-as-judge evaluation harness for Mathlib (Lean 4) proof quality.

The dataset is a collection of PRs to mathlib - their first commit, and final one. The judge is rated as successful if it rates the final accepted PR over the initial rejected one.

Setup

The evaluation dataset is hosted on HuggingFace at SJCaldwell/proofjudge. Files are downloaded and cached automatically on first run via huggingface_hub.

Usage

Build Sandbox

The sandbox is used to place mathlib source code at the time of the PR, so the agentic judge can understand the context of the PR.

python scripts/run_eval.py build-image

Run Eval

You can run the eval on one datapoint, or the whole set.

# Full run
python scripts/run_eval.py evaluate

# Single PR for testing
python scripts/run_eval.py evaluate --pr 254

# Different model
python scripts/run_eval.py evaluate --model "openai:gpt-4o"

Analyze Results

python scripts/run_eval.py analyze results/results_*.jsonl --with-metadata

Configuration

All settings via env vars (prefix PROOFJUDGE_):

Variable	Default	Description
`PROOFJUDGE_HF_DATASET`	`SJCaldwell/proofjudge`	HuggingFace dataset repo ID
`PROOFJUDGE_HF_DATASET_REVISION`	`v0.1.0`	Dataset revision (tag, branch, or commit SHA)
`PROOFJUDGE_JUDGE_MODEL`	`anthropic:claude-sonnet-4-20250514`	Pydantic AI model
`PROOFJUDGE_MAX_CONTAINERS`	`4`	Concurrent sandboxes
`PROOFJUDGE_DOCKER_IMAGE`	`proofjudge-sandbox:latest`	Sandbox image

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
proofjudge		proofjudge
results		results
rubrics		rubrics
scripts		scripts
.gitignore		.gitignore
Dockerfile.sandbox		Dockerfile.sandbox
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ProofJudge

Setup

Usage

Configuration

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

ProofJudge

Setup

Usage

Configuration

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 1

Languages

Packages