This repository documents my journey of building a Large Language Model (LLM) from scratch
- Studied the basics of Large Language Models (LLMs)
- Revised Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks
- Started watching videos and reading basics of:
- Implemented a simple tokenizer from scratch & Added special character tokens → Tokenizer.ipynb
- Implemented Byte Pair Encoding (BPE) using
tiktoken
→ Bytepairencoding.ipynb
-
Implemented Input-Target data pair generation using
DataLoader
→ TargetPair.ipynb -
Explored vector embedding → Word2Vec Google News (300D)
-
Created a token embedder in Torch using
torch.nn.Embedding
→ TokenEmbedding.ipynb -
Implemented positional token embedding in Torch → PositionalTokenEmbedding.ipynb
- Read Attention Is All You Need
- Explored simplified , self ,causal , multi-head attention, why RNN fails
- History of RNN , LSTM , Transformer
- Learned about Bahdanau attention
-
Implemented simplified attention mechanism with non-trainable weights from scratch → SimplifiedAttention.ipynb
-
Implemented self-attention mechanism using key, query, and value matrices with trainable weights from scratch → SelfAttention.ipynb
-
Implemented casual-attention mechanism with dropout from scratch → CasualAttention.ipynb
-
Implemented Multihead Attention Mechanism from Scratch using simple Implementation → Multihead.ipynb
-
Implemented Multihead Attention Mechanism from Scratch with weight split and one class( no wrapper class ) → Multihead.ipynb
-
Added boilerplate Code for gpt 2 architecture → BoilerplateCode.ipynb
-
Implemented Layer Normalization class for LLM → LayerNorm.ipynb
-
Implemented a feed forward network with GELU activations for LLM → Gelu.ipynb
-
Shortcut /Skips connections for LLM → ShortCutconnection.ipynb
-
Implemented Entire LLM Transformer Block → Transformer.ipynb
-
Coding the 124 million parameter GPT-2 model → GPT2.ipynb
-
Coding the GPT-2 to predict the next token → nextwordprediction.ipynb
-
Implemented Cross entropy and perplexity loss for LLM → Loss.ipynb
-
Evaluating LLM performance on real dataset → GPT2_RealDataset.ipynb
-
Coding the Entire LLM Pre-training Loop → GPT2_entirePretraining.ipynb
-
Implemented Temperature Scaling in LLM → TemperatureScaling.ipynb
-
Implemented Top-k sampling in LLM → TOP-Ksampling.ipynb
-
Saving and loading LLM model weights using PyTorch → Save_Load_weights.ipynb
-
Loading pre-trained weights from OpenAI GPT-2 124M → Loading_OPEN-AI_weights.ipynb
-
Training using OpenAI GPT-2 774M weights → 774M_weights.ipynb
-
Classification Finetuned 124M Model on Spam classification dataset → SpamClassificationFinetuned.ipynb
-
Classification Finetuned 124M Model on PubMed 20k dataset → MedicalClassificationFinetuned.ipynb
- Instruction Finetuned 124M Model on Alpaca-style prompt formatting → InstructionFinetuned.ipynb