ProofJudge v0.1.0

Initial public release of the ProofJudge evaluation harness and dataset.

What is ProofJudge?

ProofJudge is an agentic LLM-as-judge system for evaluating Lean 4 / Mathlib proof quality. It measures agreement between human Mathlib reviewers and LLM judges. Autoformalization has tended to producing proofs that are correct, but not idiomatic, composed into reusable lemmas, or otherwise library ready.

The judge scores proofs on a 1-7 scale. The evaluation works by having the judge agent grade the initial (rejected) version of a proof included in a PR, as well as the final (accepted) version. These are separate runs, and the judge does not know whether it's looking at the initial or final version. "Alignment" is defined by the judge assigning a higher score to the final proof than the initial proof suggesting the model is capable of discriminating between two "correct" proofs.

What's in this release

Evaluation harness (proofjudge Python package)

Agentic judge with tools for inspecting the Mathlib codebase via Docker sandboxes
Pairwise evaluation pipeline with sandbox pooling

Dataset (SJCaldwell/proofjudge on HuggingFace, tagged v0.1.0)

HF Link

Consists of 123 proof pairs from 100 PRs.

Note: This represents early research, the dataset and harness are likely to change in response to community feedback. This tag will enable the initial results of this research to be reproduced in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

ProofJudge v0.1.0

What is ProofJudge?

What's in this release

Uh oh!

Releases: SJCaldwell/ProofJudge

ProofJudge v0.1.0

ProofJudge v0.1.0

What is ProofJudge?

What's in this release

Uh oh!