Skip to content

Commit 7bfeb00

Browse files
committed
[Core] Add a random suffix to frontend-provided request IDs
Since #9550 and #10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See #27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See #15326 for an attempt to fix this. 2) We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed by the model runner. See #29355 3) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See #20139 Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend. We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific. A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
1 parent 86e178f commit 7bfeb00

File tree

11 files changed

+94
-18
lines changed

11 files changed

+94
-18
lines changed

tests/tokenizers_/test_detokenize.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ def _run_incremental_decode(
6262
)
6363
request = EngineCoreRequest(
6464
request_id="",
65+
external_request_id="",
6566
prompt_token_ids=prompt_token_ids,
6667
mm_features=None,
6768
sampling_params=params,

tests/v1/engine/test_process_multi_modal_uuids.py

Lines changed: 15 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
from vllm.assets.image import ImageAsset
77
from vllm.assets.video import VideoAsset
88
from vllm.config import CacheConfig, DeviceConfig, ModelConfig, VllmConfig
9+
from vllm.multimodal import MultiModalUUIDDict
910
from vllm.sampling_params import SamplingParams
1011
from vllm.v1.engine import input_processor as input_processor_mod
1112
from vllm.v1.engine.input_processor import InputProcessor
@@ -166,7 +167,7 @@ def test_multi_modal_uuids_ignored_when_caching_disabled(monkeypatch):
166167
monkeypatch, mm_cache_gb=0.0, enable_prefix_caching=False
167168
)
168169

169-
captured: dict[str, object] = {}
170+
captured: dict[str, MultiModalUUIDDict] = {}
170171

171172
def fake_preprocess(
172173
prompt, *, tokenization_kwargs=None, lora_request=None, mm_uuids=None
@@ -196,7 +197,16 @@ def fake_preprocess(
196197
)
197198

198199
# Expect request-id-based overrides are passed through
199-
assert captured["mm_uuids"] == {
200-
"image": [f"{request_id}-image-0", f"{request_id}-image-1"],
201-
"video": [f"{request_id}-video-0"],
202-
}
200+
mm_uuids = captured["mm_uuids"]
201+
assert set(mm_uuids.keys()) == {"image", "video"}
202+
assert len(mm_uuids["image"]) == 2
203+
assert len(mm_uuids["video"]) == 1
204+
assert mm_uuids["image"][0].startswith(f"{request_id}-") and mm_uuids["image"][
205+
0
206+
].endswith("-image-0")
207+
assert mm_uuids["image"][1].startswith(f"{request_id}-") and mm_uuids["image"][
208+
1
209+
].endswith("-image-1")
210+
assert mm_uuids["video"][0].startswith(f"{request_id}-") and mm_uuids["video"][
211+
0
212+
].endswith("-video-0")

vllm/entrypoints/llm.py

Lines changed: 19 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1700,15 +1700,30 @@ def _add_request(
17001700
)
17011701

17021702
self.llm_engine.add_request(
1703-
request_id,
1703+
engine_request.request_id,
17041704
engine_request,
17051705
params,
17061706
lora_request=lora_request,
17071707
tokenization_kwargs=tokenization_kwargs,
17081708
priority=priority,
17091709
prompt_text=prompt_text,
17101710
)
1711-
return request_id
1711+
return engine_request.request_id
1712+
1713+
@staticmethod
1714+
def _sort_outputs(
1715+
outputs: list[RequestOutput | PoolingRequestOutput],
1716+
) -> list[RequestOutput | PoolingRequestOutput]:
1717+
# Sort the outputs by request ID.
1718+
# This is necessary because some requests may be finished earlier than
1719+
# its previous requests.
1720+
1721+
# Extract the original request ID prefix for sorting.
1722+
# See how InputProcessor._generate_request_id() adds a random suffix
1723+
def extract_request_id_prefix(request_id: str) -> int:
1724+
return int(request_id.rsplit("-", 1)[0])
1725+
1726+
return sorted(outputs, key=lambda x: extract_request_id_prefix(x.request_id))
17121727

17131728
def _run_engine(
17141729
self, *, use_tqdm: bool | Callable[..., tqdm] = True
@@ -1756,7 +1771,5 @@ def _run_engine(
17561771

17571772
if use_tqdm:
17581773
pbar.close()
1759-
# Sort the outputs by request ID.
1760-
# This is necessary because some requests may be finished earlier than
1761-
# its previous requests.
1762-
return sorted(outputs, key=lambda x: int(x.request_id))
1774+
1775+
return self._sort_outputs(outputs)

vllm/entrypoints/openai/serving_chat.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -341,7 +341,7 @@ async def create_chat_completion(
341341
generator = self.engine_client.generate(
342342
engine_request,
343343
sampling_params,
344-
sub_request_id,
344+
engine_request.request_id,
345345
lora_request=lora_request,
346346
trace_headers=trace_headers,
347347
priority=request.priority,

vllm/entrypoints/openai/serving_completion.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -231,7 +231,7 @@ async def create_completion(
231231
generator = self.engine_client.generate(
232232
engine_request,
233233
sampling_params,
234-
request_id_item,
234+
engine_request.request_id,
235235
lora_request=lora_request,
236236
trace_headers=trace_headers,
237237
priority=request.priority,

vllm/entrypoints/openai/serving_engine.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1260,7 +1260,7 @@ async def _generate_with_builtin_tools(
12601260
generator = self.engine_client.generate(
12611261
engine_request,
12621262
sampling_params,
1263-
sub_request_id,
1263+
engine_request.request_id,
12641264
lora_request=lora_request,
12651265
priority=priority,
12661266
prompt_text=prompt_text,

vllm/v1/engine/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ class EngineCoreRequest(
4949
gc=False,
5050
): # type: ignore[call-arg]
5151
request_id: str
52+
external_req_id: str
5253
prompt_token_ids: list[int] | None
5354
mm_features: list[MultiModalFeatureSpec] | None
5455
sampling_params: SamplingParams | None

vllm/v1/engine/async_llm.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -304,6 +304,12 @@ async def add_request(
304304
# Convert Input --> Request.
305305
if isinstance(prompt, EngineCoreRequest):
306306
request = prompt
307+
if request_id != request.request_id:
308+
logger.warning_once(
309+
"AsyncLLM.add_request() was passed a request_id parameter that "
310+
"does not match the EngineCoreRequest.request_id attribute. The "
311+
"latter will be used, and the former will be ignored."
312+
)
307313
else:
308314
assert prompt_text is None
309315
request = self.input_processor.process_inputs(
@@ -333,7 +339,7 @@ async def add_request(
333339
assert isinstance(parent_params, SamplingParams)
334340

335341
# Fan out child requests (for n>1).
336-
parent_request = ParentRequest(request_id, parent_params)
342+
parent_request = ParentRequest(request.request_id, parent_params)
337343
for idx in range(parent_params.n):
338344
request_id, child_params = parent_request.get_child_info(idx)
339345
child_request = request if idx == parent_params.n - 1 else copy(request)

vllm/v1/engine/input_processor.py

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
from vllm.pooling_params import PoolingParams
2121
from vllm.sampling_params import SamplingParams
2222
from vllm.tokenizers import MistralTokenizer, TokenizerLike
23-
from vllm.utils import length_from_prompt_token_ids_or_embeds
23+
from vllm.utils import length_from_prompt_token_ids_or_embeds, random_uuid
2424
from vllm.v1.engine import EngineCoreRequest
2525
from vllm.v1.metrics.stats import MultiModalCacheStats
2626
from vllm.v1.structured_output.backend_guidance import validate_guidance_grammar
@@ -382,6 +382,12 @@ def _extract_mm_data(p: PromptType):
382382
mm_uuids[modality] = [f"{request_id}-{modality}-{i}" for i in range(n)]
383383
return mm_uuids
384384

385+
def _generate_request_id(self, request_id: str):
386+
"""Construct an internal request ID by adding 8 random characters
387+
to the supplied request ID in order to ensure uniquness.
388+
"""
389+
return f"{request_id}-{random_uuid()[:8]}"
390+
385391
def process_inputs(
386392
self,
387393
request_id: str,
@@ -409,6 +415,9 @@ def process_inputs(
409415
if arrival_time is None:
410416
arrival_time = time.time()
411417

418+
external_req_id = request_id
419+
request_id = self._generate_request_id(request_id)
420+
412421
# Optionally generate multimodal hash overrides to avoid hashing
413422
# multimodal data items by their content as their identifiers.
414423

@@ -509,6 +518,7 @@ def process_inputs(
509518

510519
return EngineCoreRequest(
511520
request_id=request_id,
521+
external_req_id=external_req_id,
512522
prompt_token_ids=prompt_token_ids,
513523
prompt_embeds=prompt_embeds,
514524
mm_features=mm_features,

vllm/v1/engine/llm_engine.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -239,6 +239,12 @@ def add_request(
239239
# Process raw inputs into the request.
240240
if isinstance(prompt, EngineCoreRequest):
241241
request = prompt
242+
if request_id != request.request_id:
243+
logger.warning_once(
244+
"AsyncLLM.add_request() was passed a request_id parameter that "
245+
"does not match the EngineCoreRequest.request_id attribute. The "
246+
"latter will be used, and the former will be ignored."
247+
)
242248
else:
243249
assert prompt_text is None
244250
request = self.input_processor.process_inputs(
@@ -269,7 +275,7 @@ def add_request(
269275
return
270276

271277
# Fan out child requests (for n>1).
272-
parent_req = ParentRequest(request_id, params)
278+
parent_req = ParentRequest(request.request_id, params)
273279
for idx in range(n):
274280
request_id, child_params = parent_req.get_child_info(idx)
275281
child_request = request if idx == n - 1 else copy(request)

0 commit comments

Comments
 (0)