Skip to content

BlackNinjaKR/BPE_BytePairEncoding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Byte Pair Encoding (BPE) ~ Philip Gage (1994)

Overview

This project implements Byte Pair Encoding (BPE), originally proposed by Philip Gage (1994) for data compression which was later adapted for Natural Language Processing. The repository demonstrates both:

  • The original compression-oriented BPE algorithm
  • The modern text tokenization variant used in NLP models

It provides a minimal, modular Python implementation for educational, experimental, and research use.


Features

  • Python implementation
  • Java implementation
  • Modular design
  • Configurable number of merges (num_merges)
  • Test corpus and reproducible examples

Concepts

Byte Pair Encoding iteratively merges the most frequent adjacent symbol pairs in a corpus, creating new tokens that represent common tokens. This reduces token count and captures recurring patterns without relying on a fixed dictionary

  1. Start with a vocabulary consisting of characters
  2. Count all adjacent pairs in the corpus
  3. Merge the most frequent pair into a single token
  4. Repeat until maximum merges or convergence

In NLP tokenization, this helps models generalize across rare and unseen words while limiting vocabulary growth. Example:

    Input corpus: ["low", "lowest", "newer", "wider"]

After a few merges:

    Initial symbols: ['l o w', 'l o w e s t', 'n e w e r', 'w i d e r']
    Merge 1: ('l', 'o') → 'lo'
    Merge 2: ('lo', 'w') → 'low'
    Merge 3: ('e', 'r') → 'er'

Resulting vocabulary includes:

    {'l', 'o', 'w', 'low', 'er', 'new', 'wid', ...}

Project Structure

BPE_BytePairEncoding/
├── Files                  
|   └── Corpus.txt                  # Corpus.txt
├── PDFs                            # PDF of the notebook and Philip Gage's paper
|   ├── BPE_Gage.pdf                # Gage's paper
|   └── BPE.pdf                     # PDF of this notebook
├── Java                            # Java Implementation of Byte Pair Encoding (BPE)
|   ├── BPETrain_Interface.java     # Interface defining BPE trainer methods (train, encode, decode)
|   ├── BPETrainer.java             # Implementation of BPE trainer
|   ├── CorpusLoader.java           # Load text corpus from file
|   ├── Main.java                   # Driver class to run BPE training and test encoding/decoding
|   ├── PairCounter.java            # Counts frequency of symbol pairs in tokenized corpus
|   └── Tokenizer.java              # Tokenizes corpus into character-level tokens with word boundaries
├── BPE.ipynb                       # Jupyter Notebook
├── LICENSE
└── README.md                       # This file

Limitations

  • BPE is deterministic and purely frequency-based — no semantic awareness.
  • Not optimal for morphologically rich languages.
  • Merges can cause fragmentation for rare patterns.

References

  1. Gage, P. (1994). A New Algorithm for Data Compression | The C Users Journal, 12(2)

License

This project is open-source under the MIT License — free to use, modify, and distribute for research and educational purposes.


Releases

No releases published

Packages

 
 
 

Contributors