tactics

Code for training and evaluating a "tactics" model, which suggests proof steps generatively.

Quick Start

1. Generate Training Data

Generate proof data from acornlib:

acorn --lib /path/to/acornlib training ./data/proofs

This creates one proof file per theorem in data/proofs/. Format uses @T (theorem prefix), @G (goal), @C (counterfactual), @P (proof) markers.

When you regenerate proof data, you typically want to clean up old checkpoints and tokenizers.

rm ./checkpoints/*

2. Train Tokenizer

Train a BPE tokenizer on your proof dataset:

uv run train_tokenizer.py

This will:

Read all proof files from data/proofs/
Train a BPE tokenizer with vocab_size=4096
Save tokenizer.json to checkpoints/ directory
Show compression ratio (expect ~3-4x compression)

Options: --vocab_size 4096, --data_dir data/proofs, --output_dir checkpoints

3. Train Model

Train the model from scratch:

uv run train.py

This will:

Load tokenizer from checkpoints/tokenizer.json
Tokenize proof files from data/proofs/
Train GPT model (~9M parameters)
Save checkpoints to checkpoints/
Save best model as checkpoints/best_model.pt

Configuration (edit config.py):

vocab_size: 4096 (set from tokenizer)
context_length: 256 tokens
max_epochs: 30
batch_size: 32
learning_rate: 3e-4

4. Resume Training (Optional)

Resume from checkpoint:

uv run resume_training.py

Or specify in config.py: training_config.resume_from = "checkpoints/checkpoint_step_5000.pt"

5. Export to ONNX

Export for use by Acorn:

uv run export_onnx.py

This creates a timestamped directory in export/ following HuggingFace convention:

export/tactics-2025-11-10-14-30-45/
├── model.onnx       # ONNX model with KV caching
├── tokenizer.json   # HuggingFace tokenizer
└── config.json      # Model architecture config

The exported model uses KV caching for fast autoregressive generation (50-90% speedup). See the KV Cache section below for details.

Custom export directory: uv run export_onnx.py checkpoints/best_model.pt export/my-model

Tokenization

Uses BPE (Byte-Pair Encoding) instead of character-level:

Benefits:

Efficient compression: "theorem add_comm" → ~4 tokens
Longer effective context: 256 tokens ≈ 768-1024 characters
Domain-specific vocabulary learned from your proofs
Common terms like "theorem", "proof", "forall" become single tokens

Files:

checkpoints/tokenizer.json - Trained tokenizer (vocabulary and merge rules)

Model Architecture

Type: BPE-tokenized GPT (decoder-only transformer)
Vocab size: 4096
Context length: 256 tokens (~768-1024 chars effective)
Model dimension: 256
Layers: 6
Attention heads: 8
Head dimension: 32
Parameters: ~9M
Features: RMSNorm, causal self-attention, tied embeddings, KV caching

Troubleshooting

"Tokenizer file not found" - Run uv run train_tokenizer.py first

Out of memory - Reduce batch_size or context_length in config.py

Vocab size mismatch - Delete old checkpoints after retraining tokenizer

Reference Information

KV Cache (ONNX Export)

The exported ONNX model uses Key-Value caching for efficient autoregressive generation:

Model Inputs (13 total):

input_ids: [batch, seq_len] - Input token IDs (typically [1, 1] during generation)
past_key_values.{0-5}.key: [batch, 8, cache_len, 32] - Cached keys per layer
past_key_values.{0-5}.value: [batch, 8, cache_len, 32] - Cached values per layer

Model Outputs (13 total):

logits: [batch, seq_len, vocab_size] - Next token predictions
present_key_values.{0-5}.key: [batch, 8, cache_len+1, 32] - Updated keys
present_key_values.{0-5}.value: [batch, 8, cache_len+1, 32] - Updated values

Usage Pattern:

First token: Pass input with cache shape [1, 8, 1, 32] filled with zeros
Subsequent tokens: Feed back the present_key_values as past_key_values
Cache grows: [1,8,1,32] → [1,8,2,32] → ... → [1,8,256,32] (max context)

Performance: 50-90% speedup for generation compared to non-cached inference.

Dataset

~6,600 proof files, 3.7MB text
Mathematical proofs for algebraic structures
Structured format: @T, @G, @C, @P markers

Training Details

Optimizer: AdamW (lr=3e-4, weight_decay=0.1)
Schedule: Cosine decay, 1000-step warmup, min_lr=3e-5
Data split: 90% train / 10% validation
Early stopping: Patience of 10 evaluations

Tokenization Stats

Dataset size: 3.7MB, ~6,600 files
Tokenized: ~1.1M tokens (3.4x compression ratio)
Effective context: 256 tokens ≈ 870 chars average

Files

train.py - Training script
model.py - GPT architecture
data.py - Data loading
config.py - Configuration
tokenizer.py - BPE tokenizer
train_tokenizer.py - Train tokenizer
export_onnx.py - ONNX export

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tactics

Quick Start

1. Generate Training Data

2. Train Tokenizer

3. Train Model

4. Resume Training (Optional)

5. Export to ONNX

Tokenization

Model Architecture

Troubleshooting

Reference Information

KV Cache (ONNX Export)

Dataset

Training Details

Tokenization Stats

Files

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
export		export
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
config.py		config.py
data.py		data.py
export_onnx.py		export_onnx.py
model.py		model.py
pyproject.toml		pyproject.toml
resume_training.py		resume_training.py
tokenizer.py		tokenizer.py
train.py		train.py
train_tokenizer.py		train_tokenizer.py
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

tactics

Quick Start

1. Generate Training Data

2. Train Tokenizer

3. Train Model

4. Resume Training (Optional)

5. Export to ONNX

Tokenization

Model Architecture

Troubleshooting

Reference Information

KV Cache (ONNX Export)

Dataset

Training Details

Tokenization Stats

Files

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages