A comprehensive implementation of the Transformer architecture from scratch, building understanding through first principles and detailed explanations.
- Complete Transformer Implementation: From positional encodings to multi-head attention
- Educational Focus: Detailed explanations of why each component exists and how it works
- From-Scratch Approach: Built using PyTorch primitives without relying on high-level transformer libraries
- Architecture Coverage:
- Positional Embeddings (sinusoidal)
- Scaled Dot-Product Attention
- Multi-Head Attention
- Feed-Forward Networks
- Encoder and Decoder Blocks
- Causal Masking
- Full Encoder-Decoder Transformer
- Full Decoder-Only Transformer and Core Differences
- PositionalEncoding: Fixed sinusoidal embeddings that provide position information
- ScaledDotProductAttention: Core attention mechanism with proper scaling
- MultiHeadAttention: Parallel attention heads with learned projections
- FeedForward: Position-wise non-linear transformations
- EncoderBlock: Self-attention + FFN with residual connections
- DecoderBlock: Masked self-attention + cross-attention + FFN
- Transformer: Complete encoder-decoder architecture
- Post-LayerNorm: Following the original Transformer paper
- Residual Connections: Enable training of deep models
- Causal Masking: Ensures autoregressive behavior in decoder
- Learned Projections: Separate Q, K, V projections for flexibility
GPT_From_Scratch/
├── GPT_From_Scratch (2).ipynb # Main implementation notebook
├── index.html # Static HTML export of notebook
└── README.md # This file
- Python 3.7+
- PyTorch
- Jupyter Notebook
# Clone the repository
git clone <repository-url>
cd GPT_From_Scratch
# Install dependencies
pip install torch torchvision torchaudio
pip install jupyter notebook# Start Jupyter notebook
jupyter notebook
# Open and run "GPT_From_Scratch (2).ipynb"This implementation is designed for learning and understanding. Each component includes:
- Detailed Comments: Explaining the "why" behind design choices
- Mathematical Context: References to original papers and theory
- Architectural Rationale: Why certain components are necessary
- Historical Context: Evolution from RNNs to Transformers
This repository is primarily educational. Contributions that improve clarity, fix bugs, or enhance the educational value are welcome.
This project is provided for educational purposes. Please refer to the original papers for proper attribution when using these concepts.