-
Notifications
You must be signed in to change notification settings - Fork 31.3k
Migration guide: tokenization #42477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 2. **SentencePieceBackend**: For models requiring SentencePiece | ||
| 3. **PythonBackend**: Pure Python implementations | ||
| 4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`) | ||
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. | |
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem: | |
| - handling additional tokens | |
| - a full python API for setting and updating | |
| - automatic parallelization, | |
| - automatic offsets | |
| - customization | |
| - training |
| 3. **PythonBackend**: Pure Python implementations | ||
| 4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`) | ||
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. | ||
| 2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library | |
| 2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library. It inherits from `PythonBackend`. |
| 4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`) | ||
| 1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. | ||
| 2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library | ||
| 3. **PythonBackend**: pure Python implementations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| 3. **PythonBackend**: pure Python implementations | |
| 3. **PythonBackend**: a Python implementations of the features provided by `tokenizers`. Basically allows adding tokens. |
| --> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens) | ||
| 684 clean_up_tokenization_spaces = ( | ||
| 685 clean_up_tokenization_spaces | ||
| 686 if clean_up_tokenization_spaces is not None | ||
| 687 else self.clean_up_tokenization_spaces | ||
| 688 ) | ||
| 689 if clean_up_tokenization_spaces: | ||
|
|
||
| TypeError: argument 'ids': 'list' object cannot be interpreted as an integer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine to keep no?
| **Removed T5-specific workarounds** | ||
|
|
||
| The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7. Deprecated / Modified Features
Special Tokens Structure:
SpecialTokensMixin: Merged intoPreTrainedTokenizerBaseto simplify the tokenizer architecture.special_tokens_map: Now only stores named special token attributes (e.g.,bos_token,eos_token). Useextra_special_tokensfor additional special tokens (formerlyadditional_special_tokens).all_special_tokensincludes both named and extra tokens.
# v4
tokenizer.special_tokens_map # Included 'additional_special_tokens'
# v5
tokenizer.special_tokens_map # Only named tokens
tokenizer.extra_special_tokens # Additional tokensspecial_tokens_map_extendedandall_special_tokens_extended: Removed. AccessAddedTokenobjects directly from_special_tokens_mapor_extra_special_tokensif needed.additional_special_tokens: Still accepted for backward compatibility but is automatically converted toextra_special_tokens.
Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.prepare_seq2seq_batch(): Deprecated; use__call__()withtext_targetparameter instead.
# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")BatchEncoding.words(): Deprecated; useword_ids()instead.
Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.clean_up_tokenization(): Removed from base class. Now defined at model class level for models that need it (e.g., PLBart, CLVP, Wav2Vec2).prepare_for_model(),build_inputs_with_special_tokens(),truncate_sequences(): Moved fromtokenization_utils_base.pytotokenization_python.pyforPythonBackendtokenizers.TokenizersBackendprovides model-ready input viatokenize()andencode(), so these methods are no longer needed in the base class._switch_to_input_mode(),_switch_to_target_mode(),as_target_tokenizer(): Removed from base class. Use__call__()withtext_targetparameter instead.
# v4
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, ...)
# v5
labels = tokenizer(text_target=tgt_texts, ...)parse_response(): Removed from base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ sorry I'm commenting instead of 'suggestion' bc as a suggestion it breaks the formatting
No description provided.