Skip to content

Commit 2afcec4

Browse files
[Misc] Update TokenizerLike interface and move get_cached_tokenizer (#29730)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
1 parent 9381b5c commit 2afcec4

File tree

15 files changed

+260
-174
lines changed

15 files changed

+260
-174
lines changed

.buildkite/test-amd.yaml

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -61,8 +61,8 @@ steps:
6161
- pytest -v -s -m 'not cpu_test' multimodal
6262
- pytest -v -s utils_
6363

64-
- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 4 mins
65-
timeout_in_minutes: 10
64+
- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 15min
65+
timeout_in_minutes: 20
6666
mirror_hardwares: [amdexperimental, amdproduction]
6767
agent_pool: mi325_1
6868
# grade: Blocking
@@ -72,6 +72,7 @@ steps:
7272
- tests/test_outputs.py
7373
- tests/multimodal
7474
- tests/standalone_tests/lazy_imports.py
75+
- tests/tokenizers_
7576
- tests/transformers_utils
7677
- tests/config
7778
no_gpu: true
@@ -80,6 +81,7 @@ steps:
8081
- pytest -v -s test_inputs.py
8182
- pytest -v -s test_outputs.py
8283
- pytest -v -s -m 'cpu_test' multimodal
84+
- pytest -v -s tokenizers_
8385
- pytest -v -s transformers_utils
8486
- pytest -v -s config
8587

@@ -308,23 +310,20 @@ steps:
308310
- pytest -v -s test_regression.py
309311
working_dir: "/vllm-workspace/tests" # optional
310312

311-
- label: Engine Test # 25min
312-
timeout_in_minutes: 40
313+
- label: Engine Test # 9min
314+
timeout_in_minutes: 15
313315
mirror_hardwares: [amdexperimental, amdproduction]
314316
agent_pool: mi325_1
315317
# grade: Blocking
316318
source_file_dependencies:
317319
- vllm/
318320
- tests/engine
319-
- tests/tokenizers_
320321
- tests/test_sequence
321322
- tests/test_config
322323
- tests/test_logger
323324
- tests/test_vllm_port
324325
commands:
325326
- pytest -v -s engine test_sequence.py test_config.py test_logger.py test_vllm_port.py
326-
# OOM in the CI unless we run this separately
327-
- pytest -v -s tokenizers_
328327

329328
- label: V1 Test e2e + engine # 30min
330329
timeout_in_minutes: 45

.buildkite/test-pipeline.yaml

Lines changed: 6 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -57,14 +57,15 @@ steps:
5757
- pytest -v -s -m 'not cpu_test' multimodal
5858
- pytest -v -s utils_
5959

60-
- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 4 mins
61-
timeout_in_minutes: 10
60+
- label: Async Engine, Inputs, Utils, Worker, Config Test (CPU) # 15min
61+
timeout_in_minutes: 20
6262
source_file_dependencies:
6363
- vllm/
6464
- tests/test_inputs.py
6565
- tests/test_outputs.py
6666
- tests/multimodal
6767
- tests/standalone_tests/lazy_imports.py
68+
- tests/tokenizers_
6869
- tests/transformers_utils
6970
- tests/config
7071
no_gpu: true
@@ -73,6 +74,7 @@ steps:
7374
- pytest -v -s test_inputs.py
7475
- pytest -v -s test_outputs.py
7576
- pytest -v -s -m 'cpu_test' multimodal
77+
- pytest -v -s tokenizers_
7678
- pytest -v -s transformers_utils
7779
- pytest -v -s config
7880

@@ -276,21 +278,18 @@ steps:
276278
- pytest -v -s test_regression.py
277279
working_dir: "/vllm-workspace/tests" # optional
278280

279-
- label: Engine Test # 25min
280-
timeout_in_minutes: 40
281+
- label: Engine Test # 9min
282+
timeout_in_minutes: 15
281283
mirror_hardwares: [amdexperimental]
282284
source_file_dependencies:
283285
- vllm/
284286
- tests/engine
285-
- tests/tokenizers_
286287
- tests/test_sequence
287288
- tests/test_config
288289
- tests/test_logger
289290
- tests/test_vllm_port
290291
commands:
291292
- pytest -v -s engine test_sequence.py test_config.py test_logger.py test_vllm_port.py
292-
# OOM in the CI unless we run this separately
293-
- pytest -v -s tokenizers_
294293

295294
- label: V1 Test e2e + engine # 30min
296295
timeout_in_minutes: 45

docs/design/huggingface_integration.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ Let's say we want to serve the popular Qwen model by running `vllm serve Qwen/Qw
2121

2222
Beyond that, there are two more things vLLM depends on Hugging Face for.
2323

24-
1. **Tokenizer**: vLLM uses the tokenizer from Hugging Face to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check Hugging Face's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [get_cached_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L24).
24+
1. **Tokenizer**: vLLM uses the tokenizer from Hugging Face to tokenize the input text. The tokenizer is loaded using [AutoTokenizer.from_pretrained](https://huggingface.co/docs/transformers/en/model_doc/auto#transformers.AutoTokenizer.from_pretrained) with the `model` argument as the model name and the `--revision` argument as the revision. It is also possible to use a tokenizer from another model by specifying the `--tokenizer` argument in the `vllm serve` command. Other relevant arguments are `--tokenizer-revision` and `--tokenizer-mode`. Please check Hugging Face's documentation for the meaning of these arguments. This part of the logic can be found in the [get_tokenizer](https://github.com/vllm-project/vllm/blob/127c07480ecea15e4c2990820c457807ff78a057/vllm/transformers_utils/tokenizer.py#L87) function. After obtaining the tokenizer, notably, vLLM will cache some expensive attributes of the tokenizer in [vllm.tokenizers.hf.get_cached_tokenizer][].
2525

2626
2. **Model weight**: vLLM downloads the model weight from the Hugging Face model hub using the `model` argument as the model name and the `--revision` argument as the revision. vLLM provides the argument `--load-format` to control what files to download from the model hub. By default, it will try to load the weights in the safetensors format and fall back to the PyTorch bin format if the safetensors format is not available. We can also pass `--load-format dummy` to skip downloading the weights.
2727
- It is recommended to use the safetensors format, as it is efficient for loading in distributed inference and also safe from arbitrary code execution. See the [documentation](https://huggingface.co/docs/safetensors/en/index) for more information on the safetensors format. This part of the logic can be found [here](https://github.com/vllm-project/vllm/blob/10b67d865d92e376956345becafc249d4c3c0ab7/vllm/model_executor/model_loader/loader.py#L385). Please note that:

tests/tokenizers_/test_cached_tokenizer.py renamed to tests/tokenizers_/test_hf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
from transformers import AutoTokenizer
88

99
from vllm.tokenizers import TokenizerLike
10-
from vllm.transformers_utils.tokenizer import get_cached_tokenizer
10+
from vllm.tokenizers.hf import get_cached_tokenizer
1111

1212

1313
@pytest.mark.parametrize("model_id", ["gpt2", "zai-org/chatglm3-6b"])

tests/tokenizers_/test_mistral.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -356,8 +356,8 @@ def test_call(self, mistral_tokenizer: MistralTokenizer):
356356
)
357357
attn_mask = [1 for _ in range(len(token_ids))]
358358

359-
# Test 1: default
360-
assert mistral_tokenizer("Hello world !") == {
359+
# Test 1: no special tokens
360+
assert mistral_tokenizer("Hello world !", add_special_tokens=False) == {
361361
"attention_mask": attn_mask[1:],
362362
"input_ids": token_ids[1:],
363363
}
@@ -381,7 +381,7 @@ def test_call(self, mistral_tokenizer: MistralTokenizer):
381381
"input_ids": token_ids,
382382
}
383383
# Test 5: empty string
384-
assert mistral_tokenizer("") == {
384+
assert mistral_tokenizer("", add_special_tokens=False) == {
385385
"attention_mask": [],
386386
"input_ids": [],
387387
}

tests/tokenizers_/test_registry.py

Lines changed: 11 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -17,20 +17,26 @@ def bos_token_id(self) -> int:
1717
def eos_token_id(self) -> int:
1818
return 1
1919

20+
@property
21+
def pad_token_id(self) -> int:
22+
return 2
23+
24+
@property
25+
def is_fast(self) -> bool:
26+
return True
27+
2028

2129
def test_customized_tokenizer():
22-
TokenizerRegistry.register(
23-
"test_tokenizer",
24-
__name__,
25-
TestTokenizer.__name__,
26-
)
30+
TokenizerRegistry.register("test_tokenizer", __name__, TestTokenizer.__name__)
2731

2832
tokenizer = TokenizerRegistry.get_tokenizer("test_tokenizer")
2933
assert isinstance(tokenizer, TestTokenizer)
3034
assert tokenizer.bos_token_id == 0
3135
assert tokenizer.eos_token_id == 1
36+
assert tokenizer.pad_token_id == 2
3237

3338
tokenizer = get_tokenizer("test_tokenizer", tokenizer_mode="custom")
3439
assert isinstance(tokenizer, TestTokenizer)
3540
assert tokenizer.bos_token_id == 0
3641
assert tokenizer.eos_token_id == 1
42+
assert tokenizer.pad_token_id == 2

tools/pre_commit/check_pickle_imports.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@
2727
"vllm/distributed/device_communicators/shm_broadcast.py",
2828
"vllm/distributed/device_communicators/shm_object_storage.py",
2929
"vllm/utils/hashing.py",
30-
"tests/tokenizers_/test_cached_tokenizer.py",
30+
"tests/tokenizers_/test_hf.py",
3131
"tests/utils_/test_hashing.py",
3232
"benchmarks/kernels/graph_machete_bench.py",
3333
"benchmarks/kernels/benchmark_lora.py",

vllm/entrypoints/llm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@
7272
from vllm.sampling_params import BeamSearchParams, RequestOutputKind, SamplingParams
7373
from vllm.tasks import PoolingTask
7474
from vllm.tokenizers import MistralTokenizer, TokenizerLike
75-
from vllm.transformers_utils.tokenizer import get_cached_tokenizer
75+
from vllm.tokenizers.hf import get_cached_tokenizer
7676
from vllm.usage.usage_lib import UsageContext
7777
from vllm.utils.collection_utils import as_iter, is_list_of
7878
from vllm.utils.counter import Counter

vllm/entrypoints/score_utils.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,8 +51,8 @@ def _cosine_similarity(
5151
for emb_1, emb_2 in zip(embed_1, embed_2):
5252
pair_score = scorer(emb_1.outputs.data, emb_2.outputs.data)
5353

54-
padding = []
55-
if (pad_token_id := getattr(tokenizer, "pad_token_id", None)) is not None:
54+
padding: list[int] = []
55+
if (pad_token_id := tokenizer.pad_token_id) is not None:
5656
padding = [pad_token_id]
5757

5858
tokens = emb_1.prompt_token_ids + padding + emb_2.prompt_token_ids

vllm/tokenizers/__init__.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
11
# SPDX-License-Identifier: Apache-2.0
22
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
33

4+
from .hf import HfTokenizer
45
from .mistral import MistralTokenizer
56
from .protocol import TokenizerLike
67
from .registry import TokenizerRegistry
78

8-
__all__ = ["TokenizerLike", "MistralTokenizer", "TokenizerRegistry"]
9+
__all__ = ["TokenizerLike", "HfTokenizer", "MistralTokenizer", "TokenizerRegistry"]

0 commit comments

Comments
 (0)