Migration guide: tokenization #42477

LysandreJik · 2025-11-28T14:32:37Z

No description provided.

ArthurZucker · 2025-11-28T15:14:44Z

MIGRATION_GUIDE_V5.md

-2. **SentencePieceBackend**: For models requiring SentencePiece
-3. **PythonBackend**: Pure Python implementations
-4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
+1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. 


Suggested change

1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc.

1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:

- handling additional tokens

- a full python API for setting and updating

- automatic parallelization,

- automatic offsets

- customization

- training

ArthurZucker · 2025-11-28T15:15:08Z

MIGRATION_GUIDE_V5.md

-3. **PythonBackend**: Pure Python implementations
-4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
+1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. 
+2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library


Suggested change

2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library

2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library. It inherits from `PythonBackend`.

ArthurZucker · 2025-11-28T15:15:26Z

MIGRATION_GUIDE_V5.md

-4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
+1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc. 
+2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library
+3. **PythonBackend**: pure Python implementations


Suggested change

3. **PythonBackend**: pure Python implementations

3. **PythonBackend**: a Python implementations of the features provided by `tokenizers`. Basically allows adding tokens.

ArthurZucker · 2025-11-28T15:16:28Z

MIGRATION_GUIDE_V5.md

--> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
-    684 clean_up_tokenization_spaces = (
-    685     clean_up_tokenization_spaces
-    686     if clean_up_tokenization_spaces is not None
-    687     else self.clean_up_tokenization_spaces
-    688 )
-    689 if clean_up_tokenization_spaces:
-
-TypeError: argument 'ids': 'list' object cannot be interpreted as an integer


I think this is fine to keep no?

itazap · 2025-11-28T18:37:05Z

MIGRATION_GUIDE_V5.md

+**Removed T5-specific workarounds**

 The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.



7. Deprecated / Modified Features

Special Tokens Structure:

SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.

special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.

# v4 tokenizer.special_tokens_map # Included 'additional_special_tokens' # v5 tokenizer.special_tokens_map # Only named tokens tokenizer.extra_special_tokens # Additional tokens

special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.

additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.

Deprecated Methods:

sanitize_special_tokens(): Already deprecated in v4, removed in v5.

prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.

# v4 model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128) # v5 model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt") model_inputs["labels"] = model_inputs.pop("input_ids_target")

BatchEncoding.words(): Deprecated; use word_ids() instead.

Removed Methods:

create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.

clean_up_tokenization(): Removed from base class. Now defined at model class level for models that need it (e.g., PLBart, CLVP, Wav2Vec2).

prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.

_switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.

# v4 with tokenizer.as_target_tokenizer(): labels = tokenizer(tgt_texts, ...) # v5 labels = tokenizer(text_target=tgt_texts, ...)

parse_response(): Removed from base class.

^ sorry I'm commenting instead of 'suggestion' bc as a suggestion it breaks the formatting

Adjust tokenization part

b8eaecd

LysandreJik requested review from ArthurZucker and itazap November 28, 2025 14:32

ArthurZucker approved these changes Nov 28, 2025

View reviewed changes

itazap reviewed Nov 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Migration guide: tokenization #42477

Migration guide: tokenization #42477

LysandreJik commented Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Uh oh!

itazap Nov 28, 2025 •

edited

Loading

Uh oh!

itazap Nov 28, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	2. SentencePieceBackend: for tokenizers requiring the `sentencepiece` library
	2. SentencePieceBackend: for tokenizers requiring the `sentencepiece` library. It inherits from `PythonBackend`.

	3. PythonBackend: pure Python implementations
	3. PythonBackend: a Python implementations of the features provided by `tokenizers`. Basically allows adding tokens.

		Removed T5-specific workarounds

		The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.

Migration guide: tokenization #42477

Are you sure you want to change the base?

Migration guide: tokenization #42477

Conversation

LysandreJik commented Nov 28, 2025

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 28, 2025

Choose a reason for hiding this comment

Uh oh!

itazap Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

7. Deprecated / Modified Features

Uh oh!

itazap Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

itazap Nov 28, 2025 •

edited

Loading

itazap Nov 28, 2025 •

edited

Loading