Byte Pair Encoding (BPE) ~ Philip Gage (1994)

Overview

This project implements Byte Pair Encoding (BPE), originally proposed by Philip Gage (1994) for data compression which was later adapted for Natural Language Processing. The repository demonstrates both:

The original compression-oriented BPE algorithm
The modern text tokenization variant used in NLP models

It provides a minimal, modular Python implementation for educational, experimental, and research use.

Features

Python implementation
Java implementation
Modular design
Configurable number of merges (num_merges)
Test corpus and reproducible examples

Concepts

Byte Pair Encoding iteratively merges the most frequent adjacent symbol pairs in a corpus, creating new tokens that represent common tokens. This reduces token count and captures recurring patterns without relying on a fixed dictionary

Start with a vocabulary consisting of characters
Count all adjacent pairs in the corpus
Merge the most frequent pair into a single token
Repeat until maximum merges or convergence

In NLP tokenization, this helps models generalize across rare and unseen words while limiting vocabulary growth. Example:

    Input corpus: ["low", "lowest", "newer", "wider"]

After a few merges:

    Initial symbols: ['l o w', 'l o w e s t', 'n e w e r', 'w i d e r']

    Merge 1: ('l', 'o') → 'lo'
    Merge 2: ('lo', 'w') → 'low'
    Merge 3: ('e', 'r') → 'er'

Resulting vocabulary includes:

    {'l', 'o', 'w', 'low', 'er', 'new', 'wid', ...}

Project Structure

BPE_BytePairEncoding/
├── Files                  
|   └── Corpus.txt                  # Corpus.txt
├── PDFs                            # PDF of the notebook and Philip Gage's paper
|   ├── BPE_Gage.pdf                # Gage's paper
|   └── BPE.pdf                     # PDF of this notebook
├── Java                            # Java Implementation of Byte Pair Encoding (BPE)
|   ├── BPETrain_Interface.java     # Interface defining BPE trainer methods (train, encode, decode)
|   ├── BPETrainer.java             # Implementation of BPE trainer
|   ├── CorpusLoader.java           # Load text corpus from file
|   ├── Main.java                   # Driver class to run BPE training and test encoding/decoding
|   ├── PairCounter.java            # Counts frequency of symbol pairs in tokenized corpus
|   └── Tokenizer.java              # Tokenizes corpus into character-level tokens with word boundaries
├── BPE.ipynb                       # Jupyter Notebook
├── LICENSE
└── README.md                       # This file

Limitations

BPE is deterministic and purely frequency-based — no semantic awareness.
Not optimal for morphologically rich languages.
Merges can cause fragmentation for rare patterns.

References

Gage, P. (1994). A New Algorithm for Data Compression | The C Users Journal, 12(2)

License

This project is open-source under the MIT License — free to use, modify, and distribute for research and educational purposes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Byte Pair Encoding (BPE) ~ Philip Gage (1994)

Overview

Features

Concepts

Project Structure

Limitations

References

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Files		Files
Java		Java
PDFs		PDFs
BPE.ipynb		BPE.ipynb
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Byte Pair Encoding (BPE) ~ Philip Gage (1994)

Overview

Features

Concepts

Project Structure

Limitations

References

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages