This project implements Byte Pair Encoding (BPE), originally proposed by Philip Gage (1994) for data compression which was later adapted for Natural Language Processing. The repository demonstrates both:
- The original compression-oriented BPE algorithm
- The modern text tokenization variant used in NLP models
It provides a minimal, modular Python implementation for educational, experimental, and research use.
- Python implementation
- Java implementation
- Modular design
- Configurable number of merges (
num_merges) - Test corpus and reproducible examples
Byte Pair Encoding iteratively merges the most frequent adjacent symbol pairs in a corpus, creating new tokens that represent common tokens. This reduces token count and captures recurring patterns without relying on a fixed dictionary
- Start with a vocabulary consisting of characters
- Count all adjacent pairs in the corpus
- Merge the most frequent pair into a single token
- Repeat until maximum merges or convergence
In NLP tokenization, this helps models generalize across rare and unseen words while limiting vocabulary growth. Example:
Input corpus: ["low", "lowest", "newer", "wider"]After a few merges:
Initial symbols: ['l o w', 'l o w e s t', 'n e w e r', 'w i d e r'] Merge 1: ('l', 'o') → 'lo'
Merge 2: ('lo', 'w') → 'low'
Merge 3: ('e', 'r') → 'er'Resulting vocabulary includes:
{'l', 'o', 'w', 'low', 'er', 'new', 'wid', ...}BPE_BytePairEncoding/
├── Files
| └── Corpus.txt # Corpus.txt
├── PDFs # PDF of the notebook and Philip Gage's paper
| ├── BPE_Gage.pdf # Gage's paper
| └── BPE.pdf # PDF of this notebook
├── Java # Java Implementation of Byte Pair Encoding (BPE)
| ├── BPETrain_Interface.java # Interface defining BPE trainer methods (train, encode, decode)
| ├── BPETrainer.java # Implementation of BPE trainer
| ├── CorpusLoader.java # Load text corpus from file
| ├── Main.java # Driver class to run BPE training and test encoding/decoding
| ├── PairCounter.java # Counts frequency of symbol pairs in tokenized corpus
| └── Tokenizer.java # Tokenizes corpus into character-level tokens with word boundaries
├── BPE.ipynb # Jupyter Notebook
├── LICENSE
└── README.md # This file- BPE is deterministic and purely frequency-based — no semantic awareness.
- Not optimal for morphologically rich languages.
- Merges can cause fragmentation for rare patterns.
This project is open-source under the MIT License — free to use, modify, and distribute for research and educational purposes.