This repository contains production-quality implementations of three major LLM architectures with 2,400+ lines of extensively commented code (41% comment ratio). Perfect for learning, interviews, and understanding how modern language models really work.
- Quick Start
- Three Core Architectures
- Repository Structure
- Learning Paths
- Core Concepts
- Training Techniques
- Inference Strategies
- Advanced Topics
- Hyperparameter Reference
- Customization Ideas
- Troubleshooting
- Further Reading
# 1. Install dependencies
pip install -r requirements.txt
# 2. Run quick demo (trains all 3 architectures on toy data)
python quick_start.py
# 3. Generate visualizations
python visualizations.py
# 4. Compare architectures
python compare_architectures.pyThat's it! You now have trained models and visualizations. Read on to understand what just happened.
| Architecture | Purpose | Attention | Best For | Pros | Cons |
|---|---|---|---|---|---|
| Encoder-Only (BERT) | Understand | Bidirectional | Classification, NER | ✅ Full context | ❌ Can't generate naturally |
| Decoder-Only (GPT) | Generate | Unidirectional | Text generation | ✅ Efficient generation | ❌ Limited context integration |
| Encoder-Decoder (T5) | Transform | Both | Translation, summarization | ✅ Best for seq2seq | ❌ More parameters |
- Understanding tasks? → Encoder-Only
- Generating text? → Decoder-Only
- Transforming sequences? → Encoder-Decoder
| File | Lines | Description | Key Concepts |
|---|---|---|---|
| README.md | - | You are here! Complete guide | Everything |
| requirements.txt | - | Python dependencies | torch, numpy, tqdm, matplotlib |
| toy_dataset.py | 350 | Number→word translation | Vocabulary, tokenization, masking, DataLoader |
| encoder_only.py | 650 | BERT-style (bidirectional) | Multi-head attention, LayerNorm, residual connections |
| decoder_only.py | 700 | GPT-style (causal) | Causal masking, autoregressive generation, sampling |
| encoder_decoder.py | 850 | T5-style (seq2seq) | Cross-attention, beam search, teacher forcing |
| quick_start.py | 250 | 5-minute demo of all 3 | Quick overview, minimal training |
| train_all.py | 500 | Full training pipeline | LR scheduling, gradient accumulation |
| compare_architectures.py | 400 | Side-by-side comparison | Decision making, use cases |
| visualizations.py | 400 | Generate 7 diagram types | Positional encoding, masks, architectures |
Total: 4,100+ lines with 1,670+ lines of comments (41% ratio!)
Perfect for getting hands-on quickly:
1. python quick_start.py # 5 min - See everything in action
2. python visualizations.py # 2 min - Generate diagrams
3. python compare_architectures.py # 5 min - Understand differences
4. Read encoder_only.py # 30 min - Study bidirectional attention
5. Read decoder_only.py # 30 min - Study causal attention
6. Read encoder_decoder.py # 45 min - Study cross-attention
7. Experiment with hyperparameters # Variable - Make it your ownStart with theory, then code:
1. Read "Core Concepts" section below # 60 min - Understand theory
2. python visualizations.py # 2 min - See concepts visually
3. Study encoder_only.py # 45 min - Code + theory
4. Study decoder_only.py # 45 min - Code + theory
5. Study encoder_decoder.py # 60 min - Code + theory
6. python train_all.py # Variable - Advanced techniquesMaster everything:
1. Read entire README # 90 min - Full context
2. Study toy_dataset.py # 15 min - Data preparation
3. Study + modify encoder_only.py # 90 min - Deep dive
4. Study + modify decoder_only.py # 90 min - Deep dive
5. Study + modify encoder_decoder.py # 120 min - Deep dive
6. Implement your own variations # Variable - InnovationWhat: Convert discrete tokens (words, subwords) into continuous vectors.
Why: Neural networks need numerical input, embeddings learn semantic relationships.
embedding(token_id) -> vector of size d_modelKey Points:
- Each token gets a unique learnable vector
- Similar tokens have similar embeddings
- Typical dimensions: 128 (small), 768 (BERT), 12288 (GPT-3)
What: Add position information to embeddings.
Why: Transformers have no inherent notion of order (unlike RNNs).
Sinusoidal Formula:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Benefits:
- No learnable parameters
- Works for any sequence length
- Smooth interpolation between positions
The Core Innovation:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Components:
- Q (Query): "What am I looking for?"
- K (Key): "What do I contain?"
- V (Value): "What do I output?"
- Scaling: Prevents vanishing gradients (sqrt(d_k))
Intuition: For "The cat sat on the mat":
- "sat" attends to "cat" (who sat?) and "mat" (where?)
- Creates contextual representations automatically!
Why Multiple Heads?
- Each head learns different relationships:
- Head 1: Grammatical relationships
- Head 2: Semantic relationships
- Head 3: Long-range dependencies
Formula:
MultiHead(Q,K,V) = Concat(head_1,...,head_h) * W_O
where head_i = Attention(Q*W_Q^i, K*W_K^i, V*W_V^i)
Typical Values: 8-16 heads, d_model / num_heads = d_k
| Type | Query | Key/Value | Used In | Purpose |
|---|---|---|---|---|
| Self-Attention | Same seq | Same seq | All models | Token attends to same sequence |
| Cross-Attention | Decoder | Encoder | Encoder-Decoder | Decoder attends to encoder output |
Cross-attention is the secret sauce of translation models!
Three Types:
-
Padding Mask: Ignore padding tokens
mask = (tokens != PAD_TOKEN) # True = real, False = padding
-
Causal Mask: Prevent looking ahead (decoder-only)
mask = torch.tril(torch.ones(n, n)) # Lower triangular
-
Combined Mask: Both padding and causal
mask = padding_mask & causal_mask
Implementation:
scores = scores.masked_fill(mask == 0, float('-inf'))
# -inf becomes ~0 after softmaxLayer Normalization:
LN(x) = γ * (x - μ) / sqrt(σ² + ε) + β- Stabilizes training
- Enables deeper networks
Residual Connections:
output = layer(input) + input # Skip connection- Addresses vanishing gradients
- Model learns "what to change" not "what to output"
Pre-Norm vs Post-Norm:
- Post-Norm:
x = LN(x + sublayer(x))(original) - Pre-Norm:
x = x + sublayer(LN(x))✓ Better for deep models
Architecture:
Input -> Linear(d_model, d_ff) -> ReLU -> Dropout -> Linear(d_ff, d_model)
Why?
- Adds non-linearity after attention
- Typically d_ff = 4 × d_model (expansion then compression)
- Applied to each position independently
What: Use ground truth as input during training, not model predictions.
Example (translation):
Step 1: Decoder gets "<SOS>" → Predict "Bonjour"
Step 2: Decoder gets "<SOS> Bonjour" → Predict "le"
(Use ground truth "Bonjour", not prediction!)
Benefits: Faster training, stable gradients
Downside: Exposure bias (training ≠ inference)
Warmup + Cosine Decay (State-of-the-art):
if step < warmup_steps:
lr = base_lr * (step / warmup_steps) # Linear warmup
else:
progress = (step - warmup) / (total - warmup)
lr = min_lr + (base_lr - min_lr) * 0.5 * (1 + cos(π * progress))Why Warmup? Initial parameters are random, large gradients destabilize training
Why Decay? As loss decreases, smaller updates for fine-tuning
Problem: Limited GPU memory → small batch size
Solution: Accumulate gradients over multiple mini-batches
accumulation_steps = 4 # Effective batch = 32 * 4 = 128
for i, batch in enumerate(loader):
loss = model(batch) / accumulation_steps
loss.backward() # Accumulate gradients
if (i + 1) % accumulation_steps == 0:
optimizer.step() # Update weights
optimizer.zero_grad()Problem: Exploding gradients in deep networks
Solution: Clip gradient norm
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)When to use: Always! Essential for stable transformer training.
| Method | Speed | Diversity | Quality | Best For |
|---|---|---|---|---|
| Greedy | ⚡⚡⚡ Fastest | ❌ None | 👍 Good | Quick demos, deterministic |
| Temperature | ⚡⚡⚡ Fast | 🎨 Adjustable | 👍👍 Good | Controlled creativity |
| Top-k | ⚡⚡ Fast | 🎨🎨 Medium | 👍👍 Good | General purpose |
| Top-p | ⚡⚡ Fast | 🎨🎨🎨 High | 👍👍 Very Good | Creative writing |
| Beam Search | ⚡ Slow | ❌ Low | 👍👍👍 Best | Translation, summarization |
next_token = torch.argmax(probs, dim=-1) # Pick highest probabilityPros: Fast, deterministic
Cons: Can miss better sequences
probs = softmax(logits / temperature)
# temperature < 1: Conservative (factual)
# temperature > 1: Creative (stories)# Keep only top k tokens, sample from them
top_k_logits, indices = torch.topk(logits, k)# Keep tokens until cumulative probability >= p
# Adaptive: fewer tokens when confident, more when uncertain# Keep top-k candidates at each step
# More systematic search, better quality
# k times slowerLength Penalty:
score = log_prob / (length ** penalty)
# penalty > 1: favor longer sequences- Flash Attention: 2-4x faster, memory-efficient, exact
- Multi-Query Attention (MQA): Single K,V for all heads → faster inference
- Grouped-Query Attention (GQA): Middle ground between MHA and MQA
- RoPE (Rotary): Better length extrapolation (used in LLaMA)
- ALiBi: Add linear bias to attention scores, no embeddings needed
Problem: Full attention is O(n²)
Solutions: Local attention, strided attention, Longformer, BigBird
Loss = (N_0 / N)^α + (D_0 / D)^β
- Larger models are more sample-efficient
- Chinchilla scaling: Balance model size and data size
LoRA (Low-Rank Adaptation):
W_new = W_pretrained + AB # A:(d,r), B:(r,d), r<<d
Only train A and B → 0.1% of parameters!
Other methods: Prefix tuning, adapters, prompt tuning
- Gradient Checkpointing: Trade compute for memory
- Mixed Precision: FP16 operations → 2x speedup, 2x memory reduction
- Parameter Sharing: Share weights across layers (ALBERT)
| Parameter | Small | Medium | Large | Notes |
|---|---|---|---|---|
| d_model | 128-256 | 512-768 | 1024-2048 | Embedding dimension |
| num_heads | 4-8 | 8-12 | 16-32 | Must divide d_model |
| num_layers | 2-4 | 6-12 | 24-96 | More = better but slower |
| d_ff | 512-1024 | 2048-3072 | 4096-8192 | Typically 4×d_model |
| dropout | 0.1 | 0.1 | 0.1-0.3 | More for larger models |
| Parameter | Conservative | Standard | Aggressive |
|---|---|---|---|
| Learning Rate | 1e-5 | 1e-4 | 5e-4 |
| Warmup Steps | 1000 | 4000 | 10000 |
| Batch Size | 16-32 | 64-128 | 256-512 |
| Grad Clip | 0.5 | 1.0 | 5.0 |
| Parameter | Conservative | Balanced | Creative |
|---|---|---|---|
| Temperature | 0.7 | 1.0 | 1.5 |
| Top-k | 10 | 50 | 100 |
| Top-p | 0.75 | 0.9 | 0.95 |
| Beam Width | 3 | 5 | 10 |
# In any model file, change these:
D_MODEL = 128 # Try: 64, 256, 512
NUM_LAYERS = 2 # Try: 1, 4, 6
NUM_HEADS = 4 # Try: 2, 8 (must divide d_model)
BATCH_SIZE = 32 # Try: 16, 64
LEARNING_RATE = 1e-4 # Try: 5e-5, 5e-4- Implement different positional encodings (learnable, RoPE)
- Add dropout in different locations
- Try different activation functions (GELU, Swish)
- Implement weight tying (share embeddings with output layer)
- Implement Flash Attention
- Add sparse attention patterns
- Implement mixture of experts (MoE)
- Add LoRA for fine-tuning
- Implement model quantization
Loss not decreasing?
- ✓ Check learning rate (1e-4 is safe)
- ✓ Verify data preprocessing (tokens, masks)
- ✓ Reduce model size (start with d_model=128)
- ✓ Check for NaN gradients (
torch.isnan(loss))
Training too slow?
- ✓ Enable mixed precision (FP16)
- ✓ Increase batch size
- ✓ Use gradient accumulation
- ✓ Profile with
torch.profiler
CUDA out of memory?
# Solutions (in order of preference):
1. Reduce batch_size: 32 → 16 → 8
2. Reduce sequence length: 512 → 256
3. Enable gradient checkpointing
4. Reduce d_model: 512 → 256
5. Use gradient accumulationGenerating gibberish?
- ✓ Train longer (loss < 1.0 for toy dataset)
- ✓ Lower temperature (1.0 → 0.7)
- ✓ Use beam search (width=5)
- ✓ Check tokenization/detokenization
Generation too slow?
- ✓ Use greedy decoding for prototyping
- ✓ Reduce beam width (10 → 3)
- ✓ Use caching (store past key/values)
- ✓ Consider model quantization
Repetitive outputs?
- ✓ Use top-p/top-k sampling
- ✓ Increase temperature
- ✓ Add repetition penalty
- ✓ Use beam search with diversity
Attention weights are uniform?
- ✓ Check masking logic
- ✓ Verify positional encodings
- ✓ Try different initialization
- ✓ Train longer
Overfitting quickly?
- ✓ Increase dropout (0.1 → 0.3)
- ✓ Add weight decay (L2 reg)
- ✓ Use more data
- ✓ Reduce model capacity
After 1 hour, you should understand:
- ✅ The three architecture types and when to use each
- ✅ Basic attention mechanism
- ✅ How to run and modify the code
After 3 hours, you should understand:
- ✅ Multi-head attention implementation
- ✅ Masking strategies (padding, causal, cross)
- ✅ Training loop with teacher forcing
- ✅ Basic inference strategies
After 6 hours, you should understand:
- ✅ All code in detail
- ✅ Cross-attention in encoder-decoder
- ✅ Advanced inference (beam search, sampling)
- ✅ Hyperparameter tuning
After 10 hours, you should be able to:
- ✅ Implement custom architectures
- ✅ Debug training issues
- ✅ Optimize for production
- ✅ Explain concepts to others
-
Attention Is All You Need (2017) - The original Transformer
- Difficulty: Medium | Priority: Must Read
-
BERT: Pre-training of Deep Bidirectional Transformers (2018)
- Difficulty: Easy | Use Case: Understanding tasks
-
Language Models are Unsupervised Multitask Learners (GPT-2, 2019)
- Difficulty: Easy | Use Case: Generation
-
Exploring the Limits of Transfer Learning with T5 (2019)
- Difficulty: Medium | Use Case: Seq2seq tasks
-
LLaMA: Open and Efficient Foundation Language Models (2023)
- Difficulty: Advanced | Modern best practices
- The Illustrated Transformer - Best visual introduction
- The Annotated Transformer - Line-by-line PyTorch
- Lil'Log: The Transformer Family - Comprehensive variants
- Andrej Karpathy's "Let's build GPT" - From scratch tutorial
- Stanford CS224N - NLP with Deep Learning course
- Hugging Face Course - Practical transformers
- Hugging Face Transformers - Production library
- nanoGPT - Minimal GPT implementation
- minGPT - Educational GPT
# Training
python encoder_only.py # Train BERT-style (classification)
python decoder_only.py # Train GPT-style (generation)
python encoder_decoder.py # Train T5-style (translation)
python train_all.py # Advanced training pipeline
# Visualization
python visualizations.py # Generate all diagrams
python compare_architectures.py # Understand differences
# Quick demos
python quick_start.py # 5-minute overview
python toy_dataset.py # Inspect dataset
# Common modifications
# In any training file:
# - Change BATCH_SIZE = 32 → your value
# - Change NUM_EPOCHS = 20 → your value
# - Change D_MODEL = 256 → your value
# Then rerun!This is an educational repository. Contributions welcome!
How to help:
- 🐛 Report bugs or unclear comments
- 📝 Improve documentation
- 💡 Suggest additional concepts to cover
- ✨ Add more visualizations
- 🌍 Translate comments to other languages
Guidelines:
- Keep code heavily commented (40%+ ratio)
- Prioritize clarity over performance
- Include visual examples where possible
- Test on toy dataset (fast iteration)
Inspired by:
- Andrej Karpathy's educational philosophy
- Harvard's Annotated Transformer
- Jay Alammar's visual explanations
Built for:
- ML students preparing for interviews
- Engineers transitioning to LLM work
- Anyone curious about how ChatGPT works under the hood
MIT License - Use freely for learning, teaching, and commercial projects.
Getting stuck? Check:
- Comments in the code (1,670+ lines!)
- Troubleshooting section above
- Generated visualizations (
python visualizations.py) - Output from
python compare_architectures.py
Still confused? The code is the documentation! Every line is explained.
Happy Learning! 🚀
Remember: The best way to learn is to read the code, run it, break it, fix it, and make it your own!