Skip to content

Releases: SJCaldwell/ProofJudge

ProofJudge v0.1.0

29 Mar 16:22
b13629a

Choose a tag to compare

ProofJudge v0.1.0

Initial public release of the ProofJudge evaluation harness and dataset.

What is ProofJudge?

ProofJudge is an agentic LLM-as-judge system for evaluating Lean 4 / Mathlib proof quality. It measures agreement between human Mathlib reviewers and LLM judges. Autoformalization has tended to producing proofs that are correct, but not idiomatic, composed into reusable lemmas, or otherwise library ready.

The judge scores proofs on a 1-7 scale. The evaluation works by having the judge agent grade the initial (rejected) version of a proof included in a PR, as well as the final (accepted) version. These are separate runs, and the judge does not know whether it's looking at the initial or final version. "Alignment" is defined by the judge assigning a higher score to the final proof than the initial proof suggesting the model is capable of discriminating between two "correct" proofs.

What's in this release

Evaluation harness (proofjudge Python package)

  • Agentic judge with tools for inspecting the Mathlib codebase via Docker sandboxes
  • Pairwise evaluation pipeline with sandbox pooling

Dataset (SJCaldwell/proofjudge on HuggingFace, tagged v0.1.0)

Consists of 123 proof pairs from 100 PRs.

Note: This represents early research, the dataset and harness are likely to change in response to community feedback. This tag will enable the initial results of this research to be reproduced in the future.