A minimal GPT implementation from scratch in PyTorch. Built for learning and experimentation with transformer architectures.
This is a character-level GPT trained on the Tiny Shakespeare dataset. The implementation includes:
- Causal self-attention with multi-head mechanism
- Transformer blocks with pre-norm architecture
- Position and token embeddings
- Text generation with temperature and top-k sampling
Will expand with different tokenization, pre-processing, architectures, interfaces.
pip install -r requirements.txtTrain on Shakespeare dataset (downloads automatically):
python src/train.pyThe script will:
- Download the Tiny Shakespeare dataset to
data/ - Train for 1000 iterations (~5-10 minutes on GPU)
- Save checkpoints to
checkpoints/ - Generate a sample at the end
python src/test.pysrc/
├── configs/ # Model configuration
├── model/ # GPT architecture (attention, layers, main model)
├── processing/ # Tokenizer and data batching
├── train.py # Training script
└── test.py # Model testing
Modify GPTConfig to experiment with different architectures:
config = GPTConfig(
vocab_size=65, # Character vocabulary size
max_seq_len=256, # Maximum sequence length
d_models=384, # Model dimension
n_heads=6, # Number of attention heads
n_layers=6, # Number of transformer blocks
d_feedforward=1536, # Feedforward dimension
dropout=0.2, # Dropout rate
)- Model:
GPT- Main transformer model with generation - Attention:
CausalSelfAttention- Masked multi-head attention - Layers:
TransformerBlock,FeedForward
- Optimizer: AdamW with learning rate 3e-4
- Scheduler: Cosine annealing
- Loss: Cross-entropy
- Dataset: Tiny Shakespeare (~1MB of text)
- Device: Automatically uses CUDA, MPS (Apple Silicon), or CPU
After training, generate text with:
from model.gpt import GPT
import torch
# Load model
checkpoint = torch.load('checkpoints/best_model.pt')
model = GPT(config)
model.load_state_dict(checkpoint['model_state_dict'])
# Generate
prompt = torch.tensor([[tokenizer.encode("ROMEO:")]], dtype=torch.long)
output = model.generate(prompt, max_new_tokens=200, temperature=0.8)Under Construction
MIT