building-llm

LLM from Scratch — Hands-On Curriculum (PyTorch)

Part 0 — Foundations & Mindset

0.1 Understanding the high-level LLM training pipeline (pretraining → finetuning → alignment)
0.2 Hardware & software environment setup (PyTorch, CUDA/Mac, mixed precision, profiling tools)

conda create -n llm_from_scratch python=3.11
conda activate llm_from_scratch
pip install -r requirements.txt

Understanding the transformer architechture

It is based on encoder-decoder model
Built entirely on self attention + feedforward network layers (no recurrence and convolution)
Scales well because attention lets it look at all tokens at once.

Main Block

1. Input Representation

Tokens->Embeddings: word mapping to dense vectors
Positional Encodings: Since there's no recurrence/convolution, we inject position info (sine-cosine patterns or learned embeddings)

2. Encoder (stacked N times)

Each encoder layer has:

Multi Head Self Attention

Each token attends to all other tokens in the sequence
Multi-head = multiple attention "views" running in parallel

Add & Norm (residual connection + layer normalization)
Feedforward network

Two dense layers with non linearity in between
works independently on each token

Add & Norm again.

Output: contextualized embeddings of input tokens.

3. Decoder (stacked N times)

Each decoder layers has:

Masked multi head self attention

prevents tokens from attending to future tokens (casual mask)

Cross attention

Decoder tokens attend to encoder outputs

Feedforward Network.
Residuals + Norm after each step.

Output: representation of target tokens so far, guided by encoder context.

4. Output Projection

Final linear layer → softmax → gives probability distribution over the vocabulary.

Flow summary

Input sentence -> embeddings + positional encodings
Encoder layers -> produce context-aware vectors
Decoder layers -> use masked self-attention + encoder info to generate next tokens.
Prediction -> token-by-token autoregressive decoding

Implementing masked attention so this can be used for decoder. Implementing simple positional_encoding

Part 1 — Core Transformer Architecture

1.1 Positional embeddings (absolute learned vs. sinusoidal)
1.2 Self-attention from first principles (manual computation with a tiny example)
1.3 Building a single attention head in PyTorch
1.4 Multi-head attention (splitting, concatenation, projections)
1.5 Feed-forward networks (MLP layers) — GELU, dimensionality expansion
1.5 Residual connections & LayerNorm
1.6 Stacking into a full Transformer block

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
attention_positional_encoding		attention_positional_encoding
media		media
.gitignore		.gitignore
README.md		README.md
building_llm.ipynb		building_llm.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

building-llm

LLM from Scratch — Hands-On Curriculum (PyTorch)

Part 0 — Foundations & Mindset

Understanding the transformer architechture

Main Block

1. Input Representation

2. Encoder (stacked N times)

3. Decoder (stacked N times)

4. Output Projection

Flow summary

Part 1 — Core Transformer Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

building-llm

LLM from Scratch — Hands-On Curriculum (PyTorch)

Part 0 — Foundations & Mindset

Understanding the transformer architechture

Main Block

1. Input Representation

2. Encoder (stacked N times)

3. Decoder (stacked N times)

4. Output Projection

Flow summary

Part 1 — Core Transformer Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages