[Misc] Update `TokenizerLike` interface and move `get_cached_tokenizer` #29730

DarkLight1337 · 2025-11-29T14:58:54Z

Purpose

Add pad_token_id to TokenizerLike interface to be used in Score API.
Use HF defaults for __call__, encode, decode and convert_ids_to_tokens; apply them to MistralTokenizer as well. cc @patrickvonplaten
Pass more arguments to from_pretrained to be in line with TokenizerRegistry.get_tokenizer.
Move get_cached_tokenizer to vllm.tokenizers.hf (with back-compatibility)
Try to run tokenizer tests on CPU.

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

chatgpt-codex-connector · 2025-11-29T14:59:00Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

mergify · 2025-11-29T14:59:29Z

Documentation preview: https://vllm--29730.org.readthedocs.build/en/29730/

gemini-code-assist

Code Review

This pull request refactors the tokenizer handling by updating the TokenizerLike interface, moving get_cached_tokenizer, and introducing HfTokenizer. The changes improve code structure and align tokenizer behavior with Hugging Face conventions. I've found a few issues related to type correctness in the protocol and a missing parameter in the MistralTokenizer implementation that should be addressed.

vllm/tokenizers/mistral.py

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…r` (vllm-project#29730) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

…r` (vllm-project#29730) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

DarkLight1337 added 2 commits November 29, 2025 14:55

[Misc] Update TokenizerLike interface and move get_cached_tokenizer

f747616

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Pass args

dca47ed

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 requested review from Isotr0py and patrickvonplaten November 29, 2025 14:58

DarkLight1337 requested review from aarnphm, chaunceyjiang and hmellor as code owners November 29, 2025 14:58

DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025

mergify bot added documentation Improvements or additions to documentation ci/build frontend v1 labels Nov 29, 2025

DarkLight1337 requested a review from njhill November 29, 2025 15:00

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

vllm/tokenizers/mistral.py Show resolved Hide resolved

vllm/tokenizers/mistral.py Show resolved Hide resolved

Isotr0py approved these changes Nov 29, 2025

View reviewed changes

DarkLight1337 added 4 commits November 29, 2025 16:40

Merge branch 'main' into cached-tokenizer

d40f5ec

Update tests

7045c02

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Merge branch 'main' into cached-tokenizer

8249a02

Fix

ef50006

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

DarkLight1337 merged commit 2afcec4 into vllm-project:main Nov 30, 2025
49 checks passed

DarkLight1337 deleted the cached-tokenizer branch November 30, 2025 06:59

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Misc] Update TokenizerLike interface and move `get_cached_tokenize…

0291cc0

…r` (vllm-project#29730) Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Misc] Update `TokenizerLike` interface and move `get_cached_tokenizer` #29730

[Misc] Update `TokenizerLike` interface and move `get_cached_tokenizer` #29730

Uh oh!

DarkLight1337 commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

mergify bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Misc] Update TokenizerLike interface and move get_cached_tokenizer #29730

[Misc] Update TokenizerLike interface and move get_cached_tokenizer #29730

Uh oh!

Conversation

DarkLight1337 commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

mergify bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[Misc] Update `TokenizerLike` interface and move `get_cached_tokenizer` #29730

[Misc] Update `TokenizerLike` interface and move `get_cached_tokenizer` #29730

DarkLight1337 commented Nov 29, 2025 •

edited by github-actions bot

Loading