Skip to content

SJCaldwell/ProofJudge

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProofJudge

Agentic LLM-as-judge evaluation harness for Mathlib (Lean 4) proof quality.

The dataset is a collection of PRs to mathlib - their first commit, and final one. The judge is rated as successful if it rates the final accepted PR over the initial rejected one.

Setup

The evaluation dataset is hosted on HuggingFace at SJCaldwell/proofjudge. Files are downloaded and cached automatically on first run via huggingface_hub.

Usage

Build Sandbox

The sandbox is used to place mathlib source code at the time of the PR, so the agentic judge can understand the context of the PR.

python scripts/run_eval.py build-image

Run Eval

You can run the eval on one datapoint, or the whole set.

# Full run
python scripts/run_eval.py evaluate

# Single PR for testing
python scripts/run_eval.py evaluate --pr 254

# Different model
python scripts/run_eval.py evaluate --model "openai:gpt-4o"

Analyze Results

python scripts/run_eval.py analyze results/results_*.jsonl --with-metadata

Configuration

All settings via env vars (prefix PROOFJUDGE_):

Variable Default Description
PROOFJUDGE_HF_DATASET SJCaldwell/proofjudge HuggingFace dataset repo ID
PROOFJUDGE_HF_DATASET_REVISION v0.1.0 Dataset revision (tag, branch, or commit SHA)
PROOFJUDGE_JUDGE_MODEL anthropic:claude-sonnet-4-20250514 Pydantic AI model
PROOFJUDGE_MAX_CONTAINERS 4 Concurrent sandboxes
PROOFJUDGE_DOCKER_IMAGE proofjudge-sandbox:latest Sandbox image

About

Harness for ProofJudge Evaluations

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages