Skip to content

Conversation

@LysandreJik
Copy link
Member

No description provided.

2. **SentencePieceBackend**: For models requiring SentencePiece
3. **PythonBackend**: Pure Python implementations
4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc.
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
- handling additional tokens
- a full python API for setting and updating
- automatic parallelization,
- automatic offsets
- customization
- training

3. **PythonBackend**: Pure Python implementations
4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc.
2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library
2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library. It inherits from `PythonBackend`.

4. **MistralCommonBackend**: Relies on `MistralCommon`'s toknenization library. (Previously `MistralCommonTokenizer`)
1. **TokenizersBackend** (preferred): Rust-based tokenizers from the 🤗 [tokenizers](https://github.com/huggingface/tokenizers) library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem, like handling additional tokens, easily update the state of the tokenizer, automatic parallelization, and automatic offsets, etc.
2. **SentencePieceBackend**: for tokenizers requiring the `sentencepiece` library
3. **PythonBackend**: pure Python implementations
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
3. **PythonBackend**: pure Python implementations
3. **PythonBackend**: a Python implementations of the features provided by `tokenizers`. Basically allows adding tokens.

Comment on lines -208 to -216
--> 682 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
684 clean_up_tokenization_spaces = (
685 clean_up_tokenization_spaces
686 if clean_up_tokenization_spaces is not None
687 else self.clean_up_tokenization_spaces
688 )
689 if clean_up_tokenization_spaces:

TypeError: argument 'ids': 'list' object cannot be interpreted as an integer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine to keep no?

**Removed T5-specific workarounds**

The internal `_eventually_correct_t5_max_length` method has been removed. T5 tokenizers now handle max length consistently with other models.

Copy link
Collaborator

@itazap itazap Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7. Deprecated / Modified Features

Special Tokens Structure:

  • SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.
  • special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.
# v4
tokenizer.special_tokens_map  # Included 'additional_special_tokens'

# v5
tokenizer.special_tokens_map  # Only named tokens
tokenizer.extra_special_tokens  # Additional tokens
  • special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.
  • additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.

Deprecated Methods:

  • sanitize_special_tokens(): Already deprecated in v4, removed in v5.
  • prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.
# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)

# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
  • BatchEncoding.words(): Deprecated; use word_ids() instead.

Removed Methods:

  • create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.
  • clean_up_tokenization(): Removed from base class. Now defined at model class level for models that need it (e.g., PLBart, CLVP, Wav2Vec2).
  • prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.
  • _switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.
# v4
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)

# v5
labels = tokenizer(text_target=tgt_texts, ...)
  • parse_response(): Removed from base class.

Copy link
Collaborator

@itazap itazap Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ sorry I'm commenting instead of 'suggestion' bc as a suggestion it breaks the formatting

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants