Skip to content

euclidgame/proofgrader

Repository files navigation

ProofGrader

A framework for generating and evaluating mathematical proofs using large language models.

Overview

ProofGrader provides three independent scripts:

  • generate.py: Generate solutions from multiple models (run once)
  • generate_marking_schemes.py: Generate marking schemes for problems (optional, run once)
  • evaluate.py: Evaluate solutions with workflows (run many times with different evaluators)

Generation and evaluation are completely separate. Generate expensive solutions once, optionally add marking schemes, then evaluate with multiple evaluators without re-generating.


Quick Start

Installation

# 1. Install Git LFS (required for large data files)
# On Ubuntu/Debian:
sudo apt-get install git-lfs

# On macOS:
brew install git-lfs

# On Windows (use Git Bash or WSL)
# Download from: https://git-lfs.github.com/

# 2. Clone and install
git clone https://github.com/euclidgame/proofgrader.git
cd proofgrader
git lfs install
git lfs pull  # Download large files (problems.jsonl, etc.)
pip install -r requirements.txt

API Keys

export OPENAI_API_KEY="your-key"
export GOOGLE_API_KEY="your-key"
export ANTHROPIC_API_KEY="your-key"

export GOOGLE_APPLICATION_CREDENTIALS="your-credentials_json"

export OPENROUTER_API_KEY="your-api-key"

If you encounter problems with api calls, please refer to proofgrader/api_client.py/_get_model_response_async and modify it as you need.

Your First Run

# Create test data
mkdir -p data/test
echo '{"id": "test1", "problem": "What is 2+2?"}' > data/test/problems.jsonl

# Step 1: Generate (once)
python scripts/generate.py \
  --data-dir data/test \
  --models gpt-4 gemini-2.5-pro

# Step 2: Evaluate (can run multiple times!)
python scripts/evaluate.py \
  --data-dir data/test \
  --model gemini-2.5-pro

That's it! Solutions saved to data/test/model_solutions.jsonl, evaluations to data/test/outputs/evaluations/.


The Three Scripts

πŸ“ generate.py - Solution Generation

Purpose: Generate solutions from one or more models

python scripts/generate.py \
  --data-dir data/my_dataset \
  --models gpt-4 o3 openrouter/qwen/qwen3-235b-a22b-thinking-2507 gemini-2.5-pro

Key Options:

  • --data-dir: Directory with problems.jsonl (required)
  • --models: One or more model names (required)
  • --output: Output file (default: data-dir/model_solutions.jsonl)
  • --template: Generation template (default: default)
  • --max-concurrent: Concurrent requests (default: 100)
  • --max-problems: Limit problems for testing
  • --strict-validation: Exit on validation failure
  • --no-cache: Disable caching

What it does:

  1. βœ… Validates problems.jsonl
  2. πŸ“ Generates solutions from each model (sequentially)
  3. πŸ” Validates generated solutions
  4. βœ… Saves to model_solutions.jsonl (one solution per (problem, model) pair)

Output: model_solutions.jsonl with fields:

  • problem_id: Problem identifier
  • generator: Model name
  • solution: Generated solution text
  • reference_solutions: Preserved from problems.jsonl
  • All other problem fields preserved

πŸ“‹ generate_marking_schemes.py - Marking Scheme Generation (Optional)

Purpose: Generate detailed grading rubrics for problems

python scripts/generate_marking_schemes.py \
  --data-dir data/my_dataset \
  --model gemini-2.5-pro

Key Options:

  • --data-dir: Directory with problems.jsonl (required)
  • --model: Model to use for generation (default: gemini-2.5-pro)
  • --template: Template name (default: marking_scheme)
  • --output: Output file (default: data-dir/problems_with_marking_schemes.jsonl)
  • --overwrite: Overwrite original problems.jsonl
  • --max-problems: Limit problems for testing

What it does:

  1. βœ… Reads problems.jsonl with reference solutions
  2. πŸ“‹ Generates marking schemes using LLM
  3. πŸ’Ύ Adds marking_scheme field to each problem
  4. βœ… Saves to new file (or overwrites if --overwrite)

Output: Problems with added marking_scheme field containing:

  • Checkpoints with point values
  • Zero-credit items
  • Deductions for common errors

Why use it?: Marking schemes improve evaluation consistency and enable more accurate grading with templates like with_marking_scheme_and_reference.

See MARKING_SCHEMES_GUIDE.md for detailed usage.


🎯 evaluate.py - Solution Evaluation

Purpose: Evaluate solutions using various workflows (completely independent of generation)

python scripts/evaluate.py \
  --data-dir data/my_dataset \
  --model gemini-2.5-pro

Key Options:

  • --data-dir: Directory with solutions (required)
  • --model: Evaluator model name (default: gemini-2.5-pro)
  • --dataset: Solutions file (default: data-dir/model_solutions.jsonl)
  • --workflow: Evaluation strategy (default: single)
    • single: Basic single-shot evaluation
    • decompose-then-judge: Break into steps, then evaluate
    • repeat-and-aggregate: Multiple evaluations, aggregate
    • reflect-and-revise: Self-critique and revision
  • --template: Evaluation template (default: basic)
  • --compute-metrics: Compute metrics if expert gradings exist
  • --output-dir: Custom output directory

Workflow-Specific Options:

  • --steps-model MODEL: For decompose-then-judge
  • --num-runs N: For repeat-and-aggregate
  • --critic-model MODEL: For reflect-and-revise

What it does:

  1. 🎯 Reads solutions from model_solutions.jsonl
  2. πŸ“Š Evaluates using specified workflow
  3. βœ… Saves to data-dir/outputs/evaluations/
  4. πŸ“ˆ Computes metrics if --compute-metrics (optional)

Output: *.eval.jsonl files with fields:

  • id: Problem ID
  • generator: Model that generated solution
  • score: Evaluation score
  • assessment: Detailed feedback
  • comments: Specific notes

Input Format

Problems File (Required)

Location: data-dir/problems.jsonl

Format: One JSON per line

Required fields:

  • id: Unique identifier
  • problem: Problem statement

Optional:

  • reference_solutions: Ground truth (preserved throughout)
  • Any metadata (contest, year, difficulty, etc.)

Example:

{
  "id": "APMO-2025-1",
  "problem": "Let ABC be an acute triangle...",
  "reference_solutions": ["Solution: First notice that..."],
  "contest": "APMO",
  "year": "2025"
}

Expert Gradings (Optional)

Location: data-dir/ (one of these names):

  • expert_gradings.jsonl
  • evaluation_merged.jsonl
  • evaluations.jsonl

Format:

{
  "problem_id": "APMO-2025-1",
  "model_name": "gpt-4",
  "score": 7.5,
  "comment": "Correct approach but missing final step"
}

Required for: Metrics computation


Complete Workflow

Step 1: Prepare Problems with Reference Solutions and Generate Marking Schemes

First, prepare your problems.jsonl with reference solutions:

mkdir -p data/my_dataset
# Add problems.jsonl with id, problem, reference_solutions fields

Then generate marking schemes (optional but recommended):

python scripts/generate_marking_schemes.py \
  --data-dir data/my_dataset \
  --model gemini-2.5-pro \
  --overwrite

Output: data/my_dataset/problems.jsonl (now includes marking_scheme field)


Step 2: Generate Solutions

Generate solutions from multiple models:

python scripts/generate.py \
  --data-dir data/my_dataset \
  --models gpt-4o o3 gemini-2.5-pro

Output: data/my_dataset/model_solutions.jsonl


Step 3: Evaluate with Multiple Evaluators

# Basic evaluation
python scripts/evaluate.py \
  --data-dir data/my_dataset \
  --model gemini-2.5-pro

# With marking schemes (if generated in Step 1)
python scripts/evaluate.py \
  --data-dir data/my_dataset \
  --model gpt-4o \
  --template with_marking_scheme_and_reference

# Different workflow
python scripts/evaluate.py \
  --data-dir data/my_dataset \
  --model o3 \
  --workflow decompose-then-judge \
  --steps-model gemini-2.5-pro

Output: data/my_dataset/evaluation_outputs/evaluator_gradings/


Step 4: (Optional) Gather Expert Gradings and Compute Metrics

If you have human expert scores, create data/my_dataset/expert_gradings.jsonl:

{"problem_id": "problem-1", "model_name": "gpt-4o", "score": 6.0}
{"problem_id": "problem-1", "model_name": "o3", "score": 7.0}

Then compute metrics to compare evaluators against experts:

python scripts/evaluate.py \
  --data-dir data/my_dataset \
  --metrics-only

Output: data/my_dataset/evaluation_outputs/metrics/

See EXPERT_GRADINGS_FORMAT.md for details on creating ground truth data


Output Structure

data/my_dataset/
β”œβ”€β”€ problems.jsonl                      # Input (with marking_scheme if Step 1 done)
β”œβ”€β”€ model_solutions.jsonl               # Step 2 output: Generated solutions
β”œβ”€β”€ expert_gradings.jsonl              # Step 4 input: Human expert scores (optional)
└── evaluation_outputs/                 # All evaluation outputs
    β”œβ”€β”€ evaluation_runs/                # Step 3: Raw evaluator outputs
    β”‚   β”œβ”€β”€ single__gemini-2.5-pro__basic__<timestamp>/
    β”‚   └── single__gpt-4o__with_marking_scheme__<timestamp>/
    β”œβ”€β”€ evaluator_gradings/             # Step 3: Parsed per-generator scores
    β”‚   β”œβ”€β”€ single__gemini-2.5-pro__basic/
    β”‚   β”‚   β”œβ”€β”€ gpt-4o.eval.jsonl
    β”‚   β”‚   β”œβ”€β”€ o3.eval.jsonl
    β”‚   β”‚   └── gemini-2.5-pro.eval.jsonl
    β”‚   └── single__gpt-4o__with_marking_scheme/
    β”‚       └── ...
    └── metrics/                        # Step 4: Metrics (if expert_gradings exist)
        β”œβ”€β”€ per_evaluator_overall.csv
        β”œβ”€β”€ per_evaluator_per_generator.csv
        └── per_evaluator_per_source.csv

Data Validation

Both scripts include automatic validation:

Default mode (recommended):

python scripts/generate.py --data-dir data/test --models gpt-4
# Validates but continues on warnings

Strict mode (production):

python scripts/generate.py --data-dir data/test --models gpt-4 --strict-validation
# Exits immediately on any validation failure

Skip mode (fast):

python scripts/generate.py --data-dir data/test --models gpt-4 --skip-validation
# No validation (not recommended)

Project Structure

ProofGrader/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ generate.py          ⭐ Generate solutions (run once)
β”‚   └── evaluate.py          ⭐ Evaluate solutions (run many times)
β”‚
β”œβ”€β”€ proofgrader/             # Core library
β”‚   β”œβ”€β”€ inference.py
β”‚   β”œβ”€β”€ api_client.py
β”‚   β”œβ”€β”€ data_validation.py
β”‚   β”œβ”€β”€ workflow_runner.py
β”‚   β”œβ”€β”€ workflows/           # Evaluation workflows
β”‚   └── metrics/             # Metrics computation
β”‚
β”œβ”€β”€ templates/               # Prompt templates
β”‚   β”œβ”€β”€ generation.yaml
β”‚   β”œβ”€β”€ evaluation.yaml
β”‚   └── workflows.yaml
β”‚
└── data/                    # Your datasets
    └── my_dataset/
        β”œβ”€β”€ problems.jsonl            # You create this
        β”œβ”€β”€ expert_gradings.jsonl     # Optional
        β”œβ”€β”€ model_solutions.jsonl     # generate.py creates this
        └── outputs/                  # evaluate.py creates this

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published