Skip to content

acornprover/tactics

Repository files navigation

tactics

Code for training and evaluating a "tactics" model, which suggests proof steps generatively.

Quick Start

1. Generate Training Data

Generate proof data from acornlib:

acorn --lib /path/to/acornlib training ./data/proofs

This creates one proof file per theorem in data/proofs/. Format uses @T (theorem prefix), @G (goal), @C (counterfactual), @P (proof) markers.

When you regenerate proof data, you typically want to clean up old checkpoints and tokenizers.

rm ./checkpoints/*

2. Train Tokenizer

Train a BPE tokenizer on your proof dataset:

uv run train_tokenizer.py

This will:

  • Read all proof files from data/proofs/
  • Train a BPE tokenizer with vocab_size=4096
  • Save tokenizer.json to checkpoints/ directory
  • Show compression ratio (expect ~3-4x compression)

Options: --vocab_size 4096, --data_dir data/proofs, --output_dir checkpoints

3. Train Model

Train the model from scratch:

uv run train.py

This will:

  • Load tokenizer from checkpoints/tokenizer.json
  • Tokenize proof files from data/proofs/
  • Train GPT model (~9M parameters)
  • Save checkpoints to checkpoints/
  • Save best model as checkpoints/best_model.pt

Configuration (edit config.py):

  • vocab_size: 4096 (set from tokenizer)
  • context_length: 256 tokens
  • max_epochs: 30
  • batch_size: 32
  • learning_rate: 3e-4

4. Resume Training (Optional)

Resume from checkpoint:

uv run resume_training.py

Or specify in config.py: training_config.resume_from = "checkpoints/checkpoint_step_5000.pt"

5. Export to ONNX

Export for use by Acorn:

uv run export_onnx.py

This creates a timestamped directory in export/ following HuggingFace convention:

export/tactics-2025-11-10-14-30-45/
├── model.onnx       # ONNX model with KV caching
├── tokenizer.json   # HuggingFace tokenizer
└── config.json      # Model architecture config

The exported model uses KV caching for fast autoregressive generation (50-90% speedup). See the KV Cache section below for details.

Custom export directory: uv run export_onnx.py checkpoints/best_model.pt export/my-model

Tokenization

Uses BPE (Byte-Pair Encoding) instead of character-level:

Benefits:

  • Efficient compression: "theorem add_comm" → ~4 tokens
  • Longer effective context: 256 tokens ≈ 768-1024 characters
  • Domain-specific vocabulary learned from your proofs
  • Common terms like "theorem", "proof", "forall" become single tokens

Files:

  • checkpoints/tokenizer.json - Trained tokenizer (vocabulary and merge rules)

Model Architecture

  • Type: BPE-tokenized GPT (decoder-only transformer)
  • Vocab size: 4096
  • Context length: 256 tokens (~768-1024 chars effective)
  • Model dimension: 256
  • Layers: 6
  • Attention heads: 8
  • Head dimension: 32
  • Parameters: ~9M
  • Features: RMSNorm, causal self-attention, tied embeddings, KV caching

Troubleshooting

"Tokenizer file not found" - Run uv run train_tokenizer.py first

Out of memory - Reduce batch_size or context_length in config.py

Vocab size mismatch - Delete old checkpoints after retraining tokenizer


Reference Information

KV Cache (ONNX Export)

The exported ONNX model uses Key-Value caching for efficient autoregressive generation:

Model Inputs (13 total):

  • input_ids: [batch, seq_len] - Input token IDs (typically [1, 1] during generation)
  • past_key_values.{0-5}.key: [batch, 8, cache_len, 32] - Cached keys per layer
  • past_key_values.{0-5}.value: [batch, 8, cache_len, 32] - Cached values per layer

Model Outputs (13 total):

  • logits: [batch, seq_len, vocab_size] - Next token predictions
  • present_key_values.{0-5}.key: [batch, 8, cache_len+1, 32] - Updated keys
  • present_key_values.{0-5}.value: [batch, 8, cache_len+1, 32] - Updated values

Usage Pattern:

  1. First token: Pass input with cache shape [1, 8, 1, 32] filled with zeros
  2. Subsequent tokens: Feed back the present_key_values as past_key_values
  3. Cache grows: [1,8,1,32][1,8,2,32] → ... → [1,8,256,32] (max context)

Performance: 50-90% speedup for generation compared to non-cached inference.

Dataset

  • ~6,600 proof files, 3.7MB text
  • Mathematical proofs for algebraic structures
  • Structured format: @T, @G, @C, @P markers

Training Details

  • Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
  • Schedule: Cosine decay, 1000-step warmup, min_lr=3e-5
  • Data split: 90% train / 10% validation
  • Early stopping: Patience of 10 evaluations

Tokenization Stats

  • Dataset size: 3.7MB, ~6,600 files
  • Tokenized: ~1.1M tokens (3.4x compression ratio)
  • Effective context: 256 tokens ≈ 870 chars average

Files

  • train.py - Training script
  • model.py - GPT architecture
  • data.py - Data loading
  • config.py - Configuration
  • tokenizer.py - BPE tokenizer
  • train_tokenizer.py - Train tokenizer
  • export_onnx.py - ONNX export

About

Code for training and evaluating a "tactics" model, which suggests proof steps generatively.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages