SuperBPE: Space Travel for Language Models

This repository contains instructions for training SuperBPE tokenizers, along with analysis notebooks and the pretraining configs used in the paper.

For model developers who wish to experiment quickly with an off-the-shelf tokenizer in their pretraining pipeline, we have released an English SuperBPE tokenizer with a vocab size of 128K. Nonetheless, we highly encourage you to train your own SuperBPE tokenizer to customize it to your use case!

Setup

First, clone the project with:

git clone --recurse-submodules https://github.com/PythonNut/superbpe.git

We use a custom fork of huggingface/tokenizers which conflicts with the original. Because of this, we recommend always installing this project in its own virtual environment.

Setup virtual environment

Using `conda`

conda create -n superbpe python=3.12 rust
conda activate superbpe
pip install -r requirements.txt

Using `venv`

You will need to install rust and Python 3.12. Then, you can do:

python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Data download

Our tokenizer training data is available here. You can download it using huggingface-cli (after logging into your HuggingFace account) using:

mkdir -p data/olmo2_p99_truncate
cd data/olmo2_p99_truncate
huggingface-cli download UW/olmo-mix-1124-subset-p99 --repo-type dataset --local-dir .

Tokenizer training

Training a SuperBPE tokenizer involves two stages:

Stage 1: Learn subwords by enforcing whitespace pretokenization (equivalent to regular BPE training). See scripts/train_tokenizer.sh.
Stage 2: Learn superwords by resuming tokenizer training, but this time skip the whitespace pretokenization step. See scripts/extend_tokenizer.sh.

Note that you can choose to use different training data for Stage 1 and Stage 2, or perform Stage 2 directly on top of an existing BPE tokenizer to augment it with superwords (scripts/extend_existing_tokenizer.sh).

After tokenizer training, you'll want to use the construct_hf_tokenizer() function to construct a HuggingFace tokenizer with all the bells and whistles. It will create an EOS token, update the tokenizer.json with a decoder field, and construct additional files like tokenizer_config.json and special_tokens_map.json. You can modify this function depending on your own model development pipeline.

Citation

If you found this codebase helpful, please cite

@inproceedings{liu-etal-2025-superbpe,
  title={{SuperBPE}: Space travel for language models},
  author={Alisa Liu and Jonathan Hayase and Valentin Hofmann and Sewoong Oh and Noah A Smith and Yejin Choi},
  booktitle={Second Conference on Language Modeling},
  year={2025},
  url={https://arxiv.org/abs/2503.13423}
}

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
FONTS		FONTS
configs		configs
notebooks		notebooks
scripts		scripts
tokenizer_json		tokenizer_json
tokenizers_superbpe @ 757f2a5		tokenizers_superbpe @ 757f2a5
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
encode.py		encode.py
plot_utils.py		plot_utils.py
requirements.txt		requirements.txt
train_tokenizer.py		train_tokenizer.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SuperBPE: Space Travel for Language Models

Setup

Setup virtual environment

Using `conda`

Using `venv`

Data download

Tokenizer training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

PythonNut/superbpe

Folders and files

Latest commit

History

Repository files navigation

SuperBPE: Space Travel for Language Models

Setup

Setup virtual environment

Using conda

Using venv

Data download

Tokenizer training

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Using `conda`

Using `venv`

Packages