Skip to content

Commit 8f8c66d

Browse files
committed
[Core] Add a random suffix to frontend-provided request IDs
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed by the model runner. See vllm-project#29355 3) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See vllm-project#20139 Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend. A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. The key changes to support this are: 1. `InputProcessor.process_inputs()` - we add some randomness to the request ID just before creating an `EngineCoreRequest`, and store both the random "internal" request ID (as `request_id`) and the supplied "external" request ID (as `external_req_id`) in the `EngineCoreRequest`. 2. `RequestState.make_request_output()` - we ensure that `RequestOutput.request_id` continues to be the external request ID (for backwards compat) and add `internal_request_id`. 3. `OutputProcessor.abort_requests()` - we make `OutputProcessor` track a mapping from external request ID to internal request IDs, so `abort_requests()` can abort based on either ID. 4. `AsyncLLM` - we use `RequestOutputCollector` to track the internal request ID, so we can use the internal ID to abort an in-progress request. We also add an `internal` boolean flag to `abort()` so API users can abort based on either ID. 5. `ParentRequest` - in the case of parallel sampling, we need to track both the internal and external ID for the later creation of `RequestOutput` aggregating the child outputs. We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
1 parent 92c35ab commit 8f8c66d

23 files changed

+317
-159
lines changed

tests/detokenizer/test_min_tokens.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,6 +35,7 @@ def test_min_tokens_with_stop(min_tokens: int, stop: str, truth: str):
3535
)
3636
request = EngineCoreRequest(
3737
request_id="",
38+
external_req_id="",
3839
prompt_token_ids=prompt_token_ids,
3940
mm_features=None,
4041
sampling_params=params,

tests/detokenizer/test_stop_string_while_stop_model_terminates.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -31,6 +31,7 @@ def _make_request(stop, include_stop_str_in_output: bool, min_tokens: int = 0):
3131
# Keep other fields minimal for unit test purposes.
3232
req = EngineCoreRequest(
3333
request_id="test",
34+
external_req_id="test-ext",
3435
prompt_token_ids=[],
3536
mm_features=None,
3637
sampling_params=params,

tests/entrypoints/openai/test_serving_chat.py

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -390,7 +390,9 @@ async def _fake_process_inputs(
390390
trace_headers,
391391
priority,
392392
):
393-
return dict(engine_prompt), {}
393+
mock_request = MagicMock()
394+
mock_request.request_id = request_id
395+
return mock_request, {}
394396

395397
serving_chat._process_inputs = AsyncMock(side_effect=_fake_process_inputs)
396398
return serving_chat
@@ -662,7 +664,11 @@ async def test_serving_chat_data_parallel_rank_extraction():
662664
mock_engine.get_tokenizer.return_value = get_tokenizer(MODEL_NAME)
663665
mock_engine.errored = False
664666
mock_engine.model_config = MockModelConfig()
667+
668+
mock_request = MagicMock()
669+
mock_request.request_id = "test-request-internal"
665670
mock_engine.input_processor = MagicMock()
671+
mock_engine.input_processor.process_inputs.return_value = mock_request
666672
mock_engine.io_processor = MagicMock()
667673

668674
# Mock the generate method to return an async generator
@@ -689,7 +695,9 @@ async def mock_generate(*args, **kwargs):
689695
finished=True,
690696
)
691697

692-
mock_engine.generate = AsyncMock(side_effect=mock_generate)
698+
mock_engine.generate = MagicMock(
699+
side_effect=lambda *args, **kwargs: mock_generate()
700+
)
693701

694702
serving_chat = _build_serving_chat(mock_engine)
695703

tests/tokenizers_/test_detokenize.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,7 @@ def _run_incremental_decode(
6262
)
6363
request = EngineCoreRequest(
6464
request_id="",
65+
external_req_id="",
6566
prompt_token_ids=prompt_token_ids,
6667
mm_features=None,
6768
sampling_params=params,

tests/v1/engine/test_async_llm.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -253,7 +253,7 @@ async def test_multi_abort(output_kind: RequestOutputKind):
253253

254254
# Use multi-abort to abort multiple requests at once
255255
abort_request_ids = [request_ids[i] for i in REQUEST_IDS_TO_ABORT]
256-
await engine.abort(abort_request_ids)
256+
await engine.abort(abort_request_ids, internal=False)
257257

258258
# Wait for all tasks to complete
259259
results = await asyncio.gather(*tasks, return_exceptions=True)
@@ -548,7 +548,7 @@ async def test_abort_final_output(output_kind: RequestOutputKind):
548548
await asyncio.sleep(0.5)
549549

550550
# Abort the request
551-
await engine.abort(request_id)
551+
await engine.abort(request_id, internal=False)
552552

553553
# Wait for generation to complete and return final output
554554
final_output = await generated

tests/v1/engine/test_engine_core.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,16 @@
4040
PROMPT = "I am Gyoubu Masataka Oniwa"
4141
PROMPT_TOKENS = TOKENIZER(PROMPT).input_ids
4242

43+
_REQUEST_COUNTER = 0
44+
4345

4446
def make_request() -> EngineCoreRequest:
47+
global _REQUEST_COUNTER
48+
_REQUEST_COUNTER += 1
49+
request_id = f"request-{_REQUEST_COUNTER}"
4550
return EngineCoreRequest(
46-
request_id=str(uuid.uuid4()),
51+
request_id=request_id,
52+
external_req_id=f"{request_id}-{uuid.uuid4()}",
4753
prompt_token_ids=PROMPT_TOKENS,
4854
mm_features=None,
4955
sampling_params=SamplingParams(),

tests/v1/engine/test_engine_core_client.py

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,15 +39,21 @@
3939
PROMPT = "Hello my name is Robert and I love quantization kernels"
4040
PROMPT_TOKENS = TOKENIZER(PROMPT).input_ids
4141

42+
_REQUEST_COUNTER = 0
43+
4244

4345
def make_request(
4446
params: SamplingParams, prompt_tokens_ids: list[int] | None = None
4547
) -> EngineCoreRequest:
4648
if not prompt_tokens_ids:
4749
prompt_tokens_ids = PROMPT_TOKENS
4850

51+
global _REQUEST_COUNTER
52+
_REQUEST_COUNTER += 1
53+
request_id = f"request-{_REQUEST_COUNTER}"
4954
return EngineCoreRequest(
50-
request_id=str(uuid.uuid4()),
55+
request_id=request_id,
56+
external_req_id=f"{request_id}-{uuid.uuid4()}",
5157
prompt_token_ids=prompt_tokens_ids,
5258
mm_features=None,
5359
sampling_params=params,

tests/v1/engine/test_fast_incdec_prefix_err.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ def test_fast_inc_detok_invalid_utf8_err_case():
2727
params = SamplingParams(skip_special_tokens=True)
2828
request = EngineCoreRequest(
2929
request_id="test",
30+
external_req_id="test-ext",
3031
prompt_token_ids=prompt_token_ids,
3132
mm_features=None,
3233
sampling_params=params,

0 commit comments

Comments
 (0)