A framework for generating and evaluating mathematical proofs using large language models.
ProofGrader provides three independent scripts:
generate.py: Generate solutions from multiple models (run once)generate_marking_schemes.py: Generate marking schemes for problems (optional, run once)evaluate.py: Evaluate solutions with workflows (run many times with different evaluators)
Generation and evaluation are completely separate. Generate expensive solutions once, optionally add marking schemes, then evaluate with multiple evaluators without re-generating.
# 1. Install Git LFS (required for large data files)
# On Ubuntu/Debian:
sudo apt-get install git-lfs
# On macOS:
brew install git-lfs
# On Windows (use Git Bash or WSL)
# Download from: https://git-lfs.github.com/
# 2. Clone and install
git clone https://github.com/euclidgame/proofgrader.git
cd proofgrader
git lfs install
git lfs pull # Download large files (problems.jsonl, etc.)
pip install -r requirements.txtexport OPENAI_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"
export GOOGLE_APPLICATION_CREDENTIALS="your-credentials_json"
export OPENROUTER_API_KEY="your-api-key"If you encounter problems with api calls, please refer to proofgrader/api_client.py/_get_model_response_async and modify it as you need.
# Create test data
mkdir -p data/test
echo '{"id": "test1", "problem": "What is 2+2?"}' > data/test/problems.jsonl
# Step 1: Generate (once)
python scripts/generate.py \
--data-dir data/test \
--models gpt-4 gemini-2.5-pro
# Step 2: Evaluate (can run multiple times!)
python scripts/evaluate.py \
--data-dir data/test \
--model gemini-2.5-proThat's it! Solutions saved to data/test/model_solutions.jsonl, evaluations to data/test/outputs/evaluations/.
Purpose: Generate solutions from one or more models
python scripts/generate.py \
--data-dir data/my_dataset \
--models gpt-4 o3 openrouter/qwen/qwen3-235b-a22b-thinking-2507 gemini-2.5-proKey Options:
--data-dir: Directory withproblems.jsonl(required)--models: One or more model names (required)--output: Output file (default:data-dir/model_solutions.jsonl)--template: Generation template (default:default)--max-concurrent: Concurrent requests (default: 100)--max-problems: Limit problems for testing--strict-validation: Exit on validation failure--no-cache: Disable caching
What it does:
- β
Validates
problems.jsonl - π Generates solutions from each model (sequentially)
- π Validates generated solutions
- β
Saves to
model_solutions.jsonl(one solution per (problem, model) pair)
Output: model_solutions.jsonl with fields:
problem_id: Problem identifiergenerator: Model namesolution: Generated solution textreference_solutions: Preserved from problems.jsonl- All other problem fields preserved
Purpose: Generate detailed grading rubrics for problems
python scripts/generate_marking_schemes.py \
--data-dir data/my_dataset \
--model gemini-2.5-proKey Options:
--data-dir: Directory withproblems.jsonl(required)--model: Model to use for generation (default: gemini-2.5-pro)--template: Template name (default:marking_scheme)--output: Output file (default:data-dir/problems_with_marking_schemes.jsonl)--overwrite: Overwrite original problems.jsonl--max-problems: Limit problems for testing
What it does:
- β
Reads
problems.jsonlwith reference solutions - π Generates marking schemes using LLM
- πΎ Adds
marking_schemefield to each problem - β
Saves to new file (or overwrites if
--overwrite)
Output: Problems with added marking_scheme field containing:
- Checkpoints with point values
- Zero-credit items
- Deductions for common errors
Why use it?: Marking schemes improve evaluation consistency and enable more accurate grading with templates like with_marking_scheme_and_reference.
See MARKING_SCHEMES_GUIDE.md for detailed usage.
Purpose: Evaluate solutions using various workflows (completely independent of generation)
python scripts/evaluate.py \
--data-dir data/my_dataset \
--model gemini-2.5-proKey Options:
--data-dir: Directory with solutions (required)--model: Evaluator model name (default:gemini-2.5-pro)--dataset: Solutions file (default:data-dir/model_solutions.jsonl)--workflow: Evaluation strategy (default:single)single: Basic single-shot evaluationdecompose-then-judge: Break into steps, then evaluaterepeat-and-aggregate: Multiple evaluations, aggregatereflect-and-revise: Self-critique and revision
--template: Evaluation template (default:basic)--compute-metrics: Compute metrics if expert gradings exist--output-dir: Custom output directory
Workflow-Specific Options:
--steps-model MODEL: For decompose-then-judge--num-runs N: For repeat-and-aggregate--critic-model MODEL: For reflect-and-revise
What it does:
- π― Reads solutions from
model_solutions.jsonl - π Evaluates using specified workflow
- β
Saves to
data-dir/outputs/evaluations/ - π Computes metrics if
--compute-metrics(optional)
Output: *.eval.jsonl files with fields:
id: Problem IDgenerator: Model that generated solutionscore: Evaluation scoreassessment: Detailed feedbackcomments: Specific notes
Location: data-dir/problems.jsonl
Format: One JSON per line
Required fields:
id: Unique identifierproblem: Problem statement
Optional:
reference_solutions: Ground truth (preserved throughout)- Any metadata (contest, year, difficulty, etc.)
Example:
{
"id": "APMO-2025-1",
"problem": "Let ABC be an acute triangle...",
"reference_solutions": ["Solution: First notice that..."],
"contest": "APMO",
"year": "2025"
}Location: data-dir/ (one of these names):
expert_gradings.jsonlevaluation_merged.jsonlevaluations.jsonl
Format:
{
"problem_id": "APMO-2025-1",
"model_name": "gpt-4",
"score": 7.5,
"comment": "Correct approach but missing final step"
}Required for: Metrics computation
First, prepare your problems.jsonl with reference solutions:
mkdir -p data/my_dataset
# Add problems.jsonl with id, problem, reference_solutions fieldsThen generate marking schemes (optional but recommended):
python scripts/generate_marking_schemes.py \
--data-dir data/my_dataset \
--model gemini-2.5-pro \
--overwriteOutput: data/my_dataset/problems.jsonl (now includes marking_scheme field)
Generate solutions from multiple models:
python scripts/generate.py \
--data-dir data/my_dataset \
--models gpt-4o o3 gemini-2.5-proOutput: data/my_dataset/model_solutions.jsonl
# Basic evaluation
python scripts/evaluate.py \
--data-dir data/my_dataset \
--model gemini-2.5-pro
# With marking schemes (if generated in Step 1)
python scripts/evaluate.py \
--data-dir data/my_dataset \
--model gpt-4o \
--template with_marking_scheme_and_reference
# Different workflow
python scripts/evaluate.py \
--data-dir data/my_dataset \
--model o3 \
--workflow decompose-then-judge \
--steps-model gemini-2.5-proOutput: data/my_dataset/evaluation_outputs/evaluator_gradings/
If you have human expert scores, create data/my_dataset/expert_gradings.jsonl:
{"problem_id": "problem-1", "model_name": "gpt-4o", "score": 6.0}
{"problem_id": "problem-1", "model_name": "o3", "score": 7.0}Then compute metrics to compare evaluators against experts:
python scripts/evaluate.py \
--data-dir data/my_dataset \
--metrics-onlyOutput: data/my_dataset/evaluation_outputs/metrics/
See EXPERT_GRADINGS_FORMAT.md for details on creating ground truth data
data/my_dataset/
βββ problems.jsonl # Input (with marking_scheme if Step 1 done)
βββ model_solutions.jsonl # Step 2 output: Generated solutions
βββ expert_gradings.jsonl # Step 4 input: Human expert scores (optional)
βββ evaluation_outputs/ # All evaluation outputs
βββ evaluation_runs/ # Step 3: Raw evaluator outputs
β βββ single__gemini-2.5-pro__basic__<timestamp>/
β βββ single__gpt-4o__with_marking_scheme__<timestamp>/
βββ evaluator_gradings/ # Step 3: Parsed per-generator scores
β βββ single__gemini-2.5-pro__basic/
β β βββ gpt-4o.eval.jsonl
β β βββ o3.eval.jsonl
β β βββ gemini-2.5-pro.eval.jsonl
β βββ single__gpt-4o__with_marking_scheme/
β βββ ...
βββ metrics/ # Step 4: Metrics (if expert_gradings exist)
βββ per_evaluator_overall.csv
βββ per_evaluator_per_generator.csv
βββ per_evaluator_per_source.csv
Both scripts include automatic validation:
Default mode (recommended):
python scripts/generate.py --data-dir data/test --models gpt-4
# Validates but continues on warningsStrict mode (production):
python scripts/generate.py --data-dir data/test --models gpt-4 --strict-validation
# Exits immediately on any validation failureSkip mode (fast):
python scripts/generate.py --data-dir data/test --models gpt-4 --skip-validation
# No validation (not recommended)ProofGrader/
βββ scripts/
β βββ generate.py β Generate solutions (run once)
β βββ evaluate.py β Evaluate solutions (run many times)
β
βββ proofgrader/ # Core library
β βββ inference.py
β βββ api_client.py
β βββ data_validation.py
β βββ workflow_runner.py
β βββ workflows/ # Evaluation workflows
β βββ metrics/ # Metrics computation
β
βββ templates/ # Prompt templates
β βββ generation.yaml
β βββ evaluation.yaml
β βββ workflows.yaml
β
βββ data/ # Your datasets
βββ my_dataset/
βββ problems.jsonl # You create this
βββ expert_gradings.jsonl # Optional
βββ model_solutions.jsonl # generate.py creates this
βββ outputs/ # evaluate.py creates this