Code for training and evaluating a "tactics" model, which suggests proof steps generatively.
Generate proof data from acornlib:
acorn --lib /path/to/acornlib training ./data/proofsThis creates one proof file per theorem in data/proofs/. Format uses @T (theorem prefix), @G (goal), @C (counterfactual), @P (proof) markers.
When you regenerate proof data, you typically want to clean up old checkpoints and tokenizers.
rm ./checkpoints/*Train a BPE tokenizer on your proof dataset:
uv run train_tokenizer.pyThis will:
- Read all proof files from
data/proofs/ - Train a BPE tokenizer with vocab_size=4096
- Save
tokenizer.jsontocheckpoints/directory - Show compression ratio (expect ~3-4x compression)
Options: --vocab_size 4096, --data_dir data/proofs, --output_dir checkpoints
Train the model from scratch:
uv run train.pyThis will:
- Load tokenizer from
checkpoints/tokenizer.json - Tokenize proof files from
data/proofs/ - Train GPT model (~9M parameters)
- Save checkpoints to
checkpoints/ - Save best model as
checkpoints/best_model.pt
Configuration (edit config.py):
vocab_size: 4096 (set from tokenizer)context_length: 256 tokensmax_epochs: 30batch_size: 32learning_rate: 3e-4
Resume from checkpoint:
uv run resume_training.pyOr specify in config.py: training_config.resume_from = "checkpoints/checkpoint_step_5000.pt"
Export for use by Acorn:
uv run export_onnx.pyThis creates a timestamped directory in export/ following HuggingFace convention:
export/tactics-2025-11-10-14-30-45/
├── model.onnx # ONNX model with KV caching
├── tokenizer.json # HuggingFace tokenizer
└── config.json # Model architecture config
The exported model uses KV caching for fast autoregressive generation (50-90% speedup). See the KV Cache section below for details.
Custom export directory: uv run export_onnx.py checkpoints/best_model.pt export/my-model
Uses BPE (Byte-Pair Encoding) instead of character-level:
Benefits:
- Efficient compression: "theorem add_comm" → ~4 tokens
- Longer effective context: 256 tokens ≈ 768-1024 characters
- Domain-specific vocabulary learned from your proofs
- Common terms like "theorem", "proof", "forall" become single tokens
Files:
checkpoints/tokenizer.json- Trained tokenizer (vocabulary and merge rules)
- Type: BPE-tokenized GPT (decoder-only transformer)
- Vocab size: 4096
- Context length: 256 tokens (~768-1024 chars effective)
- Model dimension: 256
- Layers: 6
- Attention heads: 8
- Head dimension: 32
- Parameters: ~9M
- Features: RMSNorm, causal self-attention, tied embeddings, KV caching
"Tokenizer file not found" - Run uv run train_tokenizer.py first
Out of memory - Reduce batch_size or context_length in config.py
Vocab size mismatch - Delete old checkpoints after retraining tokenizer
The exported ONNX model uses Key-Value caching for efficient autoregressive generation:
Model Inputs (13 total):
input_ids:[batch, seq_len]- Input token IDs (typically[1, 1]during generation)past_key_values.{0-5}.key:[batch, 8, cache_len, 32]- Cached keys per layerpast_key_values.{0-5}.value:[batch, 8, cache_len, 32]- Cached values per layer
Model Outputs (13 total):
logits:[batch, seq_len, vocab_size]- Next token predictionspresent_key_values.{0-5}.key:[batch, 8, cache_len+1, 32]- Updated keyspresent_key_values.{0-5}.value:[batch, 8, cache_len+1, 32]- Updated values
Usage Pattern:
- First token: Pass input with cache shape
[1, 8, 1, 32]filled with zeros - Subsequent tokens: Feed back the
present_key_valuesaspast_key_values - Cache grows:
[1,8,1,32]→[1,8,2,32]→ ... →[1,8,256,32](max context)
Performance: 50-90% speedup for generation compared to non-cached inference.
- ~6,600 proof files, 3.7MB text
- Mathematical proofs for algebraic structures
- Structured format: @T, @G, @C, @P markers
- Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
- Schedule: Cosine decay, 1000-step warmup, min_lr=3e-5
- Data split: 90% train / 10% validation
- Early stopping: Patience of 10 evaluations
- Dataset size: 3.7MB, ~6,600 files
- Tokenized: ~1.1M tokens (3.4x compression ratio)
- Effective context: 256 tokens ≈ 870 chars average
train.py- Training scriptmodel.py- GPT architecturedata.py- Data loadingconfig.py- Configurationtokenizer.py- BPE tokenizertrain_tokenizer.py- Train tokenizerexport_onnx.py- ONNX export