This repo explores the different subword tokenizers.
| Algorithm | Base unit | Implementations | Paper |
|---|---|---|---|
| Byte-pair encoding (BPE) | Unicode code | original implementation, FastBPE, SentencePiece repo | Neural Machine Translation of Rare Words with Subword Units |
| byte-level BPE | byte | HuggingFace repo, GPT2 repo | Language Models are Unsupervised Multitask Learners (GPT2) |
| Wordpiece | Unicode code | BERT repo | Google's Neural Machine Translation System |
| Unigram Language Model | Unicode code | SentencePiece repo | Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates |
|
| Model | Repo | Tokenizer |
|---|---|---|
| BERT (Google) | GitHub link | WordPiece |
| GPT2 (OpenAI) | GitHub link | byte-level BPE |
| RoBERTa (Facebook) | GitHub link | byte-level BPE |
| Transformer-XL (CMU) | GitHub link | words |
| XLM (Facebook) | GitHub link | BPE |
| XLNet (CMU) | GitHub link | BPE (from SentencePiece) |
| CTRL (Salesforce) | GitHub link | BPE (from fastBPE) |