This repository is a from-scratch implementation of Word2Vec (Skip-Gram with naïve softmax).
The goal is not performance, but understanding.
Instead of directly using high-level libraries such as gensim, this project rebuilds Word2Vec step by step in order to expose:
- how text is transformed into training data,
- how the probabilistic model ( P(o \mid c) ) is defined,
- what is really optimized,
- how gradients modify the embedding matrices,
- and why semantic structure emerges from co-occurrence.
Everything is written with an explicit math → code correspondence, so that every equation in the README has a concrete implementation in the code.
Let the corpus be a sequence of words:
Let ( m ) be the context window radius.
Skip-Gram generates training pairs:
The training set is:
The parameters of the model are:
where:
Each word ( w \in \mathcal{V} ) has:
A small toy corpus is used:
- "king is a man"
- "queen is a woman"
- "boy is a man"
- "girl is a woman"
- "paris is france"
- "rome is italy"
- "france is europe"
- "italy is europe"
After preprocessing, each sentence becomes a sequence of tokens.
Example:
"king is a man" → ["king", "man"]
Let the set of unique words be ( \mathcal{V}{=tex} ).
We define a bijection:
Implemented as:
wordi[word] = id(word)
iindex[id] = wordTwo vectors per word:
- Center embedding:
$$v_w \in \mathbb{R}^D$$ - Context embedding:
$$u_w \in \mathbb{R}^D$$
Stored as:
Given a window ( m ):
Each pair ( (c,o) ) is one training example.
Scores:
Probabilities:
In code:
logits = v @ self.UM.T
loss = F.cross_entropy(logits, o)Update rule:
In code:
opt.zero_grad()
loss.backward()
opt.step()This project is a minimal, math-first Word2Vec implementation.