-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Migration guide: tokenization #42477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -74,22 +74,19 @@ While this is being implemented, expect varying levels of support across differe | |||||
|
|
||||||
| Linked PR: https://github.com/huggingface/transformers/pull/41580 | ||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
| ## Tokenization | ||||||
|
|
||||||
| Just as we moved towards a single backend library for model definition, we want `Tokenizer` to be a lot more intuitive. | ||||||
| With v5, you can now initialize an empty `LlamaTokenizer` and train it directly on your new task! | ||||||
| Just as we moved towards a single backend library for model definition, we want our tokenizers, and the `Tokenizer` object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty `LlamaTokenizer` and train it directly on your corpus. | ||||||
|
|
||||||
| Defining a new tokenizer object should be as simple as this: | ||||||
|
|
||||||
| ```python | ||||||
| from transformers import TokenizersBackend, generate_merges | ||||||
| from tokenizers import pre_tokenizers, Tokenizer | ||||||
| from tokenizers.model import BPE | ||||||
|
|
||||||
| class Llama5Tokenizer(TokenizersBackend): | ||||||
| def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ): | ||||||
| def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ): | ||||||
| if vocab is None: | ||||||
| self._vocab = { | ||||||
| str(unk_token): 0, | ||||||
|
|
@@ -119,110 +116,100 @@ class Llama5Tokenizer(TokenizersBackend): | |||||
| ) | ||||||
| ``` | ||||||
|
|
||||||
| And now if you call `Llama5Tokenizer()` you just get an empty, trainable tokenizer that follows the definition of the authors of `Llama5` (it does not exist yet :wink:). | ||||||
| Once the tokenizer is defined as above, you can load it with the following: `Llama5Tokenizer()`. Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of `Llama5` (it does not exist yet :wink:). | ||||||
|
|
||||||
| The above is the main motivation towards refactoring tokenization: we want people to just instantiate a tokenizer like they would a model, empty or not and with exactly what they defined. | ||||||
| The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition. | ||||||
|
|
||||||
| ### Non-tokenizers | ||||||
| If you tokenizers is not common, or you just don't want to rely on `sentencepiece` nor `tokenizers` you can just import the `PythonBackend` (previousl `PreTrainedTokenzier`) which has all the API and logic for added tokens, encoding and decoding wieht them etc. | ||||||
| ### Backend Architecture Changes: moving away from the slow/fast tokenizer separation | ||||||
|
|
||||||
| If you want to have en less features, you can use the common `PreTrainedTokenizerBase` mixin, which mostly defines `transformers` tokenizer API: `encode`, `decode`, `vocab_size`, `get_vocab`, `convert_tokens_to_ids`, `convert_ids_to_tokens`, `from_pretrained`, `save_pretrained`, etc. | ||||||
|
|
||||||
| ### Backend Architecture Changes | ||||||
|
|
||||||
| **Moving away from "slow" vs "fast" tokenizers:** | ||||||
|
|
||||||
| Previously, transformers maintained two parallel implementations for many tokenizers: | ||||||
| Up to now, transformers maintained two parallel implementations for many tokenizers: | ||||||
| - "Slow" tokenizers (`tokenization_<model>.py`) - Python-based implementations, often using [SentencePiece](https://github.com/google/sentencepiece) as the backend. | ||||||
| - "Fast" tokenizers (`tokenization_<model>_fast.py`) - Rust-based implementations using the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. | ||||||
|
|
||||||
| In v5, we consolidate to a single tokenizer file per model: `tokenization_<model>.py`. This file will use the most appropriate backend available: | ||||||
|
|
||||||
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general its performances are better, but it also offers a lot more features that are comonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelisation etc. | ||||||
| 2. **SentencePieceBackend**: For models requiring SentencePiece | ||||||
| 3. **PythonBackend**: Pure Python implementations | ||||||
| 4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`) | ||||||
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. | ||||||
| 2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| 3. **PythonBackend**: pure Python implementations | ||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| 4. **MistralCommonBackend**: relies on `MistralCommon`'s tokenization library. (Previously known as the `MistralCommonTokenizer`) | ||||||
|
|
||||||
| The `AutoTokenizer` automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use `AutoTokenizer.from_pretrained()` as before. This allows transformers to be future-proof and modular to easily support future backends. | ||||||
|
|
||||||
| ### Defining a tokenizers outside of the existing backends | ||||||
|
|
||||||
| We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as `tokenizers`, `sentencepiece` or `mistral-common`, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends. | ||||||
|
|
||||||
| To do so, you can import the `PythonBackend` (which was previously known as `PreTrainedTokenizer`). This class encapsulates all the logic related to added tokens, encoding, and decoding. | ||||||
|
|
||||||
| If you want something even higher up the stack, then `PreTrainedTokenizerBase` is what `PythonBackend` inherits from. It contains the very basic tokenizer API features: | ||||||
| - `encode` | ||||||
| - `decode` | ||||||
| - `vocab_size` | ||||||
| - `get_vocab` | ||||||
| - `convert_tokens_to_ids` | ||||||
| - `convert_ids_to_tokens` | ||||||
| - `from_pretrained` | ||||||
| - `save_pretrained` | ||||||
| - among a few others | ||||||
|
|
||||||
| ### API Changes | ||||||
|
|
||||||
| **1. Direct tokenizer initialization with vocab and merges:** | ||||||
| #### 1. Direct tokenizer initialization with vocab and merges | ||||||
|
|
||||||
| In v5, you can now initialize tokenizers directly with vocabulary and merges, enabling training custom tokenizers from scratch: | ||||||
| Starting with v5, we now enable initializing blank, untrained `tokenizers`-backed tokenizers: | ||||||
|
|
||||||
| ```python | ||||||
| # v5: Initialize a blank tokenizer for training | ||||||
| ```py | ||||||
| from transformers import LlamaTokenizer | ||||||
|
|
||||||
| tokenizer = LlamaTokenizer() | ||||||
| ``` | ||||||
|
|
||||||
| This tokenizer will therefore follow the definition of the `LlamaTokenizer` as defined in its class definition. It can then be trained on a corpus as can be seen in [the `tokenizers` documentation](https://huggingface.co/docs/tokenizers/training_from_memory). | ||||||
|
|
||||||
| These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers: | ||||||
|
|
||||||
| ```py | ||||||
| from transformers import LlamaTokenizer | ||||||
|
|
||||||
| # Create a tokenizer with custom vocabulary and merges | ||||||
| vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4} | ||||||
| merges = [("h", "e"), ("l", "l"), ("o", " ")] | ||||||
|
|
||||||
| tokenizer = LlamaTokenizer(vocab=vocab, merges=merges) | ||||||
|
|
||||||
| # Or initialize a blank tokenizer to train on your own dataset | ||||||
| tokenizer = LlamaTokenizer() # Creates a blank Llama-like tokenizer | ||||||
| ``` | ||||||
| But you can no longer pass a vocab file. As this accounts for `from_pretrained` use-case. | ||||||
|
|
||||||
| **2. Simplified decoding API:** | ||||||
| This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc. | ||||||
|
|
||||||
| ⚠️ The `vocab_file` (as in, a path towards a file containing the vocabulary) cannot be used to initialize the `LlamaTokenizer` as loading from files is reserved to the `from_pretrained` method. | ||||||
|
|
||||||
| #### 2. Simplified decoding API | ||||||
|
|
||||||
| The `batch_decode` and `decode` methods have been unified to reflect behavior of the `encode` method. Both single and batch decoding now use the same `decode` method. See an example of the new behavior below: | ||||||
|
|
||||||
| The `batch_decode` method has been unified with `decode`. Both single and batch decoding now use the same method: | ||||||
| ```python | ||||||
| from transformers import AutoTokenizer | ||||||
| tokenizer = AutoTokenizer.from_pretrained("t5-small") | ||||||
| inputs = ["hey how are you?", "fine"] | ||||||
| tokenizer.decode(tokenizer.encode(inputs)) | ||||||
| ``` | ||||||
|
|
||||||
| Gives: | ||||||
| ```diff | ||||||
| - 'hey how are you?</s> fine</s>' | ||||||
| + ['hey how are you?</s>', 'fine</s>'] | ||||||
| ``` | ||||||
|
|
||||||
| This is mostly because people get `list[list[int]]` out of `generate`, but then they would use `decode` because they use `encode` and would get: | ||||||
| ```python | ||||||
| ...: tokenizer.decode([[1,2], [1,4]]) | ||||||
| --------------------------------------------------------------------------- | ||||||
| TypeError Traceback (most recent call last) | ||||||
| Cell In[2], line 4 | ||||||
| 2 tokenizer = AutoTokenizer.from_pretrained("t5-small") | ||||||
| 3 inputs = ["hey how are you?", "fine"] | ||||||
| ----> 4 tokenizer.decode([[1,2], [1,4]]) | ||||||
|
|
||||||
| File /raid/arthur/transformers/src/transformers/tokenization_utils_base.py:3948, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs) | ||||||
| 3945 # Convert inputs to python lists | ||||||
| 3946 token_ids = to_py_obj(token_ids) | ||||||
| -> 3948 return self._decode( | ||||||
| 3949 token_ids=token_ids, | ||||||
| 3950 skip_special_tokens=skip_special_tokens, | ||||||
| 3951 clean_up_tokenization_spaces=clean_up_tokenization_spaces, | ||||||
| 3952 **kwargs, | ||||||
| 3953 ) | ||||||
|
|
||||||
| File /raid/arthur/transformers/src/transformers/tokenization_utils_fast.py:682, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs) | ||||||
| 680 if isinstance(token_ids, int): | ||||||
| 681 token_ids = [token_ids] | ||||||
| --> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) | ||||||
| 684 clean_up_tokenization_spaces = ( | ||||||
| 685 clean_up_tokenization_spaces | ||||||
| 686 if clean_up_tokenization_spaces is not None | ||||||
| 687 else self.clean_up_tokenization_spaces | ||||||
| 688 ) | ||||||
| 689 if clean_up_tokenization_spaces: | ||||||
|
|
||||||
| TypeError: argument 'ids': 'list' object cannot be interpreted as an integer | ||||||
|
Comment on lines
-208
to
-216
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is fine to keep no? |
||||||
| ``` | ||||||
| We expect `encode` and `decode` to behave, as two sides of the same coin: `encode`, `process`, `decode`, should work. | ||||||
|
|
||||||
| **3. Unified encoding API:** | ||||||
| > [!NOTE] | ||||||
| > A common use-case would be: `encode`, `model.generate`, `decode`. However, using `generate` would return `list[list[int]]`, which would then be incompatible with `decode`. | ||||||
|
|
||||||
| The `encode_plus` is deprecated → call directly with `__call__` | ||||||
| #### 3. Unified encoding API | ||||||
|
|
||||||
| **3. `apply_chat_template` returns `BatchEncoding`:** | ||||||
| The `encode_plus` method is deprecated in favor of the single `__call__` method. | ||||||
|
|
||||||
| Previously, `apply_chat_template` returned `input_ids` for backward compatibility. In v5, it now consistently returns a `BatchEncoding` dict like other tokenizer methods: | ||||||
| #### 4. `apply_chat_template` returns `BatchEncoding` | ||||||
|
|
||||||
| Previously, `apply_chat_template` returned `input_ids` for backward compatibility. Starting with v5, it now consistently returns a `BatchEncoding` dict like other tokenizer methods. | ||||||
|
|
||||||
| ```python | ||||||
| # v5 | ||||||
|
|
@@ -236,15 +223,17 @@ outputs = tokenizer.apply_chat_template(messages, return_tensors="pt") | |||||
| print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask']) | ||||||
| ``` | ||||||
|
|
||||||
| #### Removed legacy configuration file saving: | ||||||
| #### 5. Removed legacy configuration file saving: | ||||||
|
|
||||||
| We simplify the serialization of tokenization attributes: | ||||||
|
|
||||||
| - `special_tokens_map.json` - special tokens are now stored in `tokenizer_config.json`. | ||||||
| - `added_tokens.json` - added tokens are now stored in `tokenizer.json`. | ||||||
| - `added_tokens_decoder` is only stored when there is no `tokenizer.json`. | ||||||
|
|
||||||
| When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. | ||||||
| When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably. | ||||||
|
|
||||||
| ### Model-Specific Changes | ||||||
| #### 6. Model-Specific Changes | ||||||
|
|
||||||
| Several models that had identical tokenizers now import from their base implementation: | ||||||
|
|
||||||
|
|
@@ -255,25 +244,20 @@ Several models that had identical tokenizers now import from their base implemen | |||||
| - **MT5** → uses T5Tokenizer | ||||||
| - **MVP** → uses BartTokenizer | ||||||
|
|
||||||
| We're just gonna remove these files at term. | ||||||
| These modules will eventually be removed altogether. | ||||||
|
|
||||||
| **Removed T5-specific workarounds:** | ||||||
| **Removed T5-specific workarounds** | ||||||
|
|
||||||
| The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models. | ||||||
|
|
||||||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 7. Deprecated / Modified FeaturesSpecial Tokens Structure:
# v4
tokenizer.special_tokens_map # Included 'additional_special_tokens'
# v5
tokenizer.special_tokens_map # Only named tokens
tokenizer.extra_special_tokens # Additional tokens
Deprecated Methods:
# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
Removed Methods:
# v4
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, ...)
# v5
labels = tokenizer(text_target=tgt_texts, ...)
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ^ sorry I'm commenting instead of 'suggestion' bc as a suggestion it breaks the formatting |
||||||
| ### Testing Changes | ||||||
|
|
||||||
| Model-specific tokenization test files now focus on integration tests. | ||||||
| Common tokenization API tests (e.g., `add_tokens`, `encode`, `decode`) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior | ||||||
|
|
||||||
| A few testing changes specific to tokenizers have been applied: | ||||||
| - Model-specific tokenization test files now focus on integration tests. | ||||||
| - Common tokenization API tests (e.g., `add_tokens`, `encode`, `decode`) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior | ||||||
|
|
||||||
| For legacy implementations, the original BERT Python tokenizer code (including `WhitespaceTokenizer`, `BasicTokenizer`, etc.) is preserved in `bert_legacy.py` for reference purposes. | ||||||
|
|
||||||
| **Linked PRs:** | ||||||
| - https://github.com/huggingface/transformers/issues/40938 | ||||||
| - https://github.com/huggingface/transformers/pull/40936 | ||||||
| - https://github.com/huggingface/transformers/pull/41626 | ||||||
|
|
||||||
|
|
||||||
| ## Library-wide changes with lesser impact | ||||||
|
|
||||||
|
|
||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.