Skip to content

MeidiLprog/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec Homemade implementation

This repository is a from-scratch implementation of Word2Vec (Skip-Gram with naïve softmax).

The goal is not performance, but understanding.

Instead of directly using high-level libraries such as gensim, this project rebuilds Word2Vec step by step in order to expose:

  • how text is transformed into training data,
  • how the probabilistic model ( P(o \mid c) ) is defined,
  • what is really optimized,
  • how gradients modify the embedding matrices,
  • and why semantic structure emerges from co-occurrence.

Everything is written with an explicit math → code correspondence, so that every equation in the README has a concrete implementation in the code.


0) Implementation (Skip-Gram objective)

Let the corpus be a sequence of words:

$$ (w_1, w_2, \dots, w_T) $$

Let ( m ) be the context window radius.

Skip-Gram generates training pairs:

$$ (w_t, w_{t+j}) \quad \text{with} \quad -m \le j \le m,; j \ne 0 $$

The training set is:

$$ \mathcal{D} = {(w_t, w_{t+j}) \mid 1 \le t \le T,; -m \le j \le m,; j \ne 0} $$

Model parameters

The parameters of the model are:

$$ \theta = (\mathbf U, \mathbf V) $$

where:

$$ \mathbf V \in \mathbb{R}^{|\mathcal{V}| \times D}, \qquad \mathbf U \in \mathbb{R}^{|\mathcal{V}| \times D} $$

Each word ( w \in \mathcal{V} ) has:

$$ v_w \in \mathbb{R}^D, \qquad u_w \in \mathbb{R}^D $$

Probability model

$$ P(w_{t+j} \mid w_t; \theta) = \frac{\exp(u_{w_{t+j}}^\top v_{w_t})} {\sum_{w \in \mathcal{V}} \exp(u_w^\top v_{w_t})} $$

Likelihood

$$ L(\theta) = \prod_{t=1}^{T} \prod_{\substack{-m \le j \le m \ j \ne 0}} P(w_{t+j} \mid w_t; \theta) $$

Objective function

$$ J(\theta) = - \sum_{t=1}^{T} \sum_{\substack{-m \le j \le m \ j \ne 0}} \log P(w_{t+j} \mid w_t; \theta) $$


1) Data and preprocessing (text → tokens)

A small toy corpus is used:

  • "king is a man"
  • "queen is a woman"
  • "boy is a man"
  • "girl is a woman"
  • "paris is france"
  • "rome is italy"
  • "france is europe"
  • "italy is europe"

After preprocessing, each sentence becomes a sequence of tokens.

Example:

"king is a man" → ["king", "man"]


2) Vocabulary

Let the set of unique words be ( \mathcal{V}{=tex} ).

We define a bijection:

$$ \text{id}:\mathcal{V}\rightarrow {0,1,\dots,V-1} $$

Implemented as:

wordi[word] = id(word)
iindex[id] = word

3) Embedding matrices

Two vectors per word:

  • Center embedding: $$v_w \in \mathbb{R}^D$$
  • Context embedding: $$u_w \in \mathbb{R}^D$$

Stored as:

$$ \mathbf{V} \in \mathbb{R}^{V \times D}, \quad \mathbf{U} \in \mathbb{R}^{V \times D} $$


4) Skip-Gram pairs

Given a window ( m ):

$$ \mathcal{D} = {(w_t, w_{t+j}) \mid -m\le j\le m, j\ne 0} $$

Each pair ( (c,o) ) is one training example.


5) Model

Scores:

$$ s(w;c) = u_w^\top v_c $$

Probabilities:

$$ P(w\mid c) = \frac{\exp(u_w^\top v_c)}{\sum_{w'} \exp(u_{w'}^\top v_c)} $$

In code:

logits = v @ self.UM.T
loss = F.cross_entropy(logits, o)

6) Optimization

Update rule:

$$ \theta \leftarrow \theta - \alpha \nabla_\theta \ell(c,o) $$

In code:

opt.zero_grad()
loss.backward()
opt.step()

This project is a minimal, math-first Word2Vec implementation.

About

Homemade word2vec

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages