Skip to content

Commit 17c62a8

Browse files
committed
[Core] Add a random suffix to frontend-provided request IDs
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed by the model runner. See vllm-project#29355 3) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See vllm-project#20139 Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend. We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific. A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
1 parent 2902c34 commit 17c62a8

25 files changed

+311
-98
lines changed

tests/entrypoints/openai/test_serving_chat.py

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -390,7 +390,9 @@ async def _fake_process_inputs(
390390
trace_headers,
391391
priority,
392392
):
393-
return dict(engine_prompt), {}
393+
mock_request = MagicMock()
394+
mock_request.request_id = request_id
395+
return mock_request, {}
394396

395397
serving_chat._process_inputs = AsyncMock(side_effect=_fake_process_inputs)
396398
return serving_chat
@@ -662,7 +664,11 @@ async def test_serving_chat_data_parallel_rank_extraction():
662664
mock_engine.get_tokenizer.return_value = get_tokenizer(MODEL_NAME)
663665
mock_engine.errored = False
664666
mock_engine.model_config = MockModelConfig()
667+
668+
mock_request = MagicMock()
669+
mock_request.request_id = "test-request-internal"
665670
mock_engine.input_processor = MagicMock()
671+
mock_engine.input_processor.process_inputs.return_value = mock_request
666672
mock_engine.io_processor = MagicMock()
667673

668674
# Mock the generate method to return an async generator
@@ -672,6 +678,7 @@ async def mock_generate(*args, **kwargs):
672678

673679
yield RequestOutput(
674680
request_id="test-request",
681+
internal_req_id="test-request-int",
675682
prompt="test prompt",
676683
prompt_token_ids=[1, 2, 3],
677684
prompt_logprobs=None,
@@ -689,7 +696,9 @@ async def mock_generate(*args, **kwargs):
689696
finished=True,
690697
)
691698

692-
mock_engine.generate = AsyncMock(side_effect=mock_generate)
699+
mock_engine.generate = MagicMock(
700+
side_effect=lambda *args, **kwargs: mock_generate()
701+
)
693702

694703
serving_chat = _build_serving_chat(mock_engine)
695704

tests/entrypoints/test_context.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ def create_mock_request_output(
3737

3838
return RequestOutput(
3939
request_id="test-id",
40+
internal_req_id="test-id-int",
4041
prompt="Test prompt",
4142
prompt_token_ids=prompt_token_ids,
4243
prompt_logprobs=None,

tests/test_outputs.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
def test_request_output_forward_compatible():
1212
output = RequestOutput(
1313
request_id="test_request_id",
14+
internal_req_id="test_request_id_internal",
1415
prompt="test prompt",
1516
prompt_token_ids=[1, 2, 3],
1617
prompt_logprobs=None,

tests/tokenizers_/test_detokenize.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ def _run_incremental_decode(
6262
)
6363
request = EngineCoreRequest(
6464
request_id="",
65+
external_req_id="",
6566
prompt_token_ids=prompt_token_ids,
6667
mm_features=None,
6768
sampling_params=params,

tests/v1/engine/test_async_llm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ async def test_multi_abort(output_kind: RequestOutputKind):
253253

254254
# Use multi-abort to abort multiple requests at once
255255
abort_request_ids = [request_ids[i] for i in REQUEST_IDS_TO_ABORT]
256-
await engine.abort(abort_request_ids)
256+
await engine.abort(abort_request_ids, internal=False)
257257

258258
# Wait for all tasks to complete
259259
results = await asyncio.gather(*tasks, return_exceptions=True)
@@ -548,7 +548,7 @@ async def test_abort_final_output(output_kind: RequestOutputKind):
548548
await asyncio.sleep(0.5)
549549

550550
# Abort the request
551-
await engine.abort(request_id)
551+
await engine.abort(request_id, internal=False)
552552

553553
# Wait for generation to complete and return final output
554554
final_output = await generated

tests/v1/engine/test_engine_core.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,16 @@
4040
PROMPT = "I am Gyoubu Masataka Oniwa"
4141
PROMPT_TOKENS = TOKENIZER(PROMPT).input_ids
4242

43+
_REQUEST_COUNTER = 0
44+
4345

4446
def make_request() -> EngineCoreRequest:
47+
global _REQUEST_COUNTER
48+
_REQUEST_COUNTER += 1
49+
request_id = f"request-{_REQUEST_COUNTER}"
4550
return EngineCoreRequest(
46-
request_id=str(uuid.uuid4()),
51+
request_id=request_id,
52+
external_req_id=f"{request_id}-{uuid.uuid4()}",
4753
prompt_token_ids=PROMPT_TOKENS,
4854
mm_features=None,
4955
sampling_params=SamplingParams(),

0 commit comments

Comments
 (0)