[Core] Add a random suffix to frontend-provided request IDs #27987

markmc · 2025-11-03T15:13:12Z

Since #9550 and #10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID.

We saw this happening recently when vllm serve bench started including a request ID and the request IDs from multiple concurrent instances caused collisions. See #27723

We try to guard against request ID collisions currently in the frontend in OutputProcessor:

    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")

however, this is not always effective:

We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See [Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326 for an attempt to fix this.
We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed
by the model runner. See [BugFix] Fix duplicate req id tool-call race condition #29355
With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See [PD][Nixl] Remote consumer READ timeout for clearing request blocks #20139

Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend.

We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific.

A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision:

>>> (k**2)/(2*N)
0.011641532182693481

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

gemini-code-assist

Code Review

This pull request addresses the important issue of request ID collisions when clients provide their own IDs. The solution of prepending a random prefix to client-provided IDs is sound and effectively mitigates the risk of collisions, especially in scenarios with race conditions or concurrent requests using the same ID. The implementation is well-executed, centralizing the new logic in the _internal_request_id method within OpenAIServing. This change is consistently applied across all relevant OpenAI entrypoints, replacing the old _base_request_id logic. The new method correctly handles precedence for request IDs from headers versus request bodies. I've reviewed the changes and found no critical or high-severity issues. The code is clean, correct, and a clear improvement for the system's robustness.

markmc · 2025-11-25T12:03:05Z

Rebased onto #29375

markmc · 2025-11-25T16:27:17Z

Discussed this with @njhill in the context of #27614 (enabling async scheduling by default) and a duplicate request ID issue it identified (#29355)

This PR currently address the issue at the OpenAI server frontend level - it ensures that any request which comes through this will have a unique request ID, while continuing to include the client-provided request ID for log correlation. However, this is probably something we should guard against even at the AsyncLLM and LLM API level as follows:

Every request has a unique request ID generated for it when it is received by e.g. generate() or add_request() - this would probably happen when we allocate the EngineCoreRequest in the Processor
This "internal request ID" would be the request_id we use everywhere as a unique key
The client-provided request ID would be retained and passed around so that it can be the one logged anywhere we log the request ID

Nick has a prototype and believes (3) isn't nearly as invasive as it sounds.

One concern is what to do in the abort() API - we will now be keying on an internal unique ID, and it may be the case that the request ID passed to abort() corresponds to many requests. We would have no choice but to abort all matching requests. And there isn't even a mechanism in the API to know what internal request ID was allocated and pass that instead. This is probably fine though - API users will learn to use unique request IDs.

markmc · 2025-11-25T16:37:29Z

One concern is what to do in the abort() API - we will now be keying on an internal unique ID, and it may be the case that the request ID passed to abort() corresponds to many requests. We would have no choice but to abort all matching requests. And there isn't even a mechanism in the API to know what internal request ID was allocated and pass that instead. This is probably fine though - API users will learn to use unique request IDs.

To avoid this abort() issue, we could choose to reject duplicate request IDs on submission, since we'll be maintaining a mapping of client-provided request ID to internal request ID. This sounds pretty circular though - if we can reject duplicate request IDs reliably, we don't need an internal request ID to disambiguate them.

However, the prefill case is where maintaining this mapping accurately is going to be difficult - the request ID continues to be used as a key long after the request is considered "finished" at the API level. But, in this case, abort() won't do anything anyway.

And async scheduling is another case where the request ID continues to be used (but much more briefly) after the request is finished?

(Conclusion - the internal request ID is important for async scheduling and P/D. We will need a mapping to implement abort(), and we can use that mapping to reject duplicate request IDs)

markmc · 2025-11-25T16:40:10Z

[...] we can use that mapping to reject duplicate request IDs

If we do reject duplicate IDs at the AsyncLLM level, it's arguably better if the OpenAI frontend allows duplicate client-provided request IDs and suffixes them with some randomness, as this PR does.

markmc · 2025-11-25T17:30:02Z

Wow, you were correct on the NIXL P/D case @njhill - D constructs the notification to send to P (see here) and so it is D's request_id that gets sent to P.

That's easy to miss when testing, because P will do:

DEBUG 11-25 12:06:15 [distributed/.../v1/nixl_connector.py:648] NIXLConnector request_finished(cmpl
-98f9b902-43400e8d-81eb-451a-906e-770ec6c7a376-0) waiting for 480 seconds for remote decode to fetch blocks
ERROR 11-25 12:06:16 [distributed/.../v1/nixl_connector.py:1805] Potentially invalid KV blocks for unrecognized request cmpl-8ea01eec-43400e8d-81eb-451a-906e-770ec6c7a376-0 were retrieved by a decode worker. They may have expired.
...
WARNING 11-25 12:14:51 [distributed/.../v1/nixl_connector.py:1778] Releasing expired KV blocks for request cmpl-98f9b902-43400e8d-81eb-451a-906e-770ec6c7a376-0 which were retrieved by 0 decode worker(s) within 480 seconds.

i.e. P waited for a notification for P_req_id, immediately got a notification for D_req_id, and 8 minutes later timed out waiting for the P_req_id notification

First thought ... wow, we should a test to catch this!

Second thought ... it seems we would need P to include its internal request ID as e.g. kv_transfer_params["notif_id"], and for D to use that to send the notification

njhill · 2025-11-25T19:39:50Z

Thanks @markmc! Yes I feel this is more critical to fix now, given that the id collision can result in pernicious correctness issues and that the exposure for this is greater with async scheduling.

I think we agree that the key thing is to always use an internally generated id, and this should be done in the LLM/AsyncLLM add_request level.
But we want to still accept an (optional) external request id for correlation and have this visible in logs, etc.

To summarize the two options we discussed:

Use internally generated id as the "primary" request id, plumb external req id through to appropriate places for logging etc. External id defaults to internal id if not provided.
Keep a single internally-generated request id, but augment it with external req id if provided (as a string prefix or suffix). Similar to what's currently implemented in this PR but within the llm engine.

In both cases we can allow duplicate concurrent external ids, but could also choose to reject them.

I was initially favoring (1) and started hacking here, but now thinking that (2) might be better.

A possible downside of (2) is that the id that we include in various places (such as structured log field, tracing span metadata, etc.) will be a superstring of the provided id, which might be inconvenient (and possibly a breaking change) from an observability tooling/correlation pov. The main advantage though is that it would be quicker/less complex/invasive to implement properly, and I think less fragile from an ongoing maintenance pov.

For aborting:

It's a non-issue for API server requests which are aborted by closing the http request (typical case), also non-issue for async generate() calls that are cancelled via standard asyncio mechanism.
For "out-of-band" aborts via the AsyncLLM/LLM abort() method we'll have to retain a mapping of external->internal req ids, and abort all internal ids corresponding to particular external id (I think this is fine, happy to elaborate on why)

As you covered above, some changes would also be needed for NIXL P/D and likely other kv connectors. I haven't thought through it completely yet but your suggestion of propagating the internal id via the kv_transfer_params sounds good to me.

However, the prefill case is where maintaining this mapping accurately is going to be difficult - the request ID continues to be used as a key long after the request is considered "finished" at the API level. But, in this case, abort() won't do anything anyway.

And async scheduling is another case where the request ID continues to be used (but much more briefly) after the request is finished?

I don't think either of these are problems since like you say the request is considered finished at the API level, it's only the internal id that can live on temporarily.

(Conclusion - the internal request ID is important for async scheduling and P/D. We will need a mapping to implement abort(), and we can use that mapping to reject duplicate request IDs)

Yes exactly, nicely summarized!

Include the internal request ID that the prefill instance is expecting the decode instance to send it in the NIXL notification. Right now, we rely on the proxy supplying the ID via X-Request-ID and that prefill and decode will mangle this ID in identical ways. This is obviously quite brittle, and P should be explicit about what ID it expects from D. Relates to vllm-project#27987 - adding a random prefix to client-provided request IDs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-12-01T15:16:53Z

Keep a single internally-generated request id, but augment it with external req id if provided (as a string prefix or suffix). Similar to what's currently implemented in this PR but within the llm engine.

[...] now thinking that (2) might be better.

Ok, updated now to do this!

As you covered above, some changes would also be needed for NIXL P/D and likely other kv connectors. I haven't thought through it completely yet but your suggestion of propagating the internal id via the kv_transfer_params sounds good to me.

See #29665

markmc · 2025-12-01T15:56:03Z

vllm/entrypoints/llm.py

+        # This is necessary because some requests may be finished earlier than
+        # its previous requests.
+
+        # Extract the original request ID prefix for sorting.


I considered adding external_request_id to RequestOutput instead of this hack ...

njhill

Thanks very much for this @markmc!

Some thoughts in addition to inline comments:

I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()
Wondering whether we should set the request id in the outputs from the engine to be the external id (if provided). This would be more backwards compatible and makes more sense to me logically. It also means we wouldn't need the new sorting "hack" in llm.py

vllm/v1/engine/input_processor.py

njhill · 2025-12-01T21:13:29Z

vllm/v1/engine/input_processor.py

        if arrival_time is None:
            arrival_time = time.time()

+        external_req_id = request_id


Perhaps we can do this as a follow-on, but it might be good to make the external req id optional so in the default case we can then have a more compact internal id.

We could then also avoid generating multiple uuids per request in the api server case.

Yeah, agree. I'm trying to limit the invasiveness of this PR though. Making request_id optional at the AsyncLLM level, then removing the UUID generation in the API server, and figuring out what to do about the per-endpoint prefix (e.g. "cmpl-") ... it's a relatively heavy lift that could slow down fixing the issue.

vllm/v1/engine/async_llm.py

vllm/v1/engine/output_processor.py

vllm/entrypoints/pooling/embed/serving.py

vllm/v1/engine/output_processor.py

Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed by the model runner. See vllm-project#29355 3) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See vllm-project#20139 Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend. We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific. A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc · 2025-12-02T14:28:36Z

Thanks very much for this @markmc!

Some thoughts in addition to inline comments:

I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()

Yeah, RequestOutputCollector has this ID, I've used it there

Wondering whether we should set the request id in the outputs from the engine to be the external id (if provided). This would be more backwards compatible and makes more sense to me logically. It also means we wouldn't need the new sorting "hack" in llm.py

ok, I'll take a look at that 👍

markmc · 2025-12-02T17:14:35Z

I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()

Yeah, RequestOutputCollector has this ID, I've used it there

Doh ... it only has it when it has output! So, that's not going to work. Will look for an alternative

markmc requested review from aarnphm and chaunceyjiang as code owners November 3, 2025 15:13

mergify bot added the frontend label Nov 3, 2025

gemini-code-assist bot reviewed Nov 3, 2025

View reviewed changes

markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 3, 2025

markmc requested a review from njhill November 17, 2025 07:54

njhill mentioned this pull request Nov 24, 2025

[BugFix] Fix duplicate req id tool-call race condition #29355

Merged

markmc force-pushed the random-request-id branch from f9de9e6 to b05df1b Compare November 25, 2025 12:02

markmc force-pushed the random-request-id branch from b05df1b to 5b2a479 Compare November 25, 2025 12:15

njhill mentioned this pull request Nov 25, 2025

Async Scheduling Plan #27679

Open

19 tasks

This was referenced Nov 28, 2025

[NIXL] Add remote_request_id to kv_transfer_params #29665

Open

Chat API: Force server-generated request_id to avoid collisions; improve clarity and safety #27189

Closed

markmc force-pushed the random-request-id branch from 5b2a479 to 649ef01 Compare December 1, 2025 15:12

markmc changed the title ~~[Frontend] Add a random prefix to client-provided request IDs~~ [Core] Add a random suffix to frontend-provided request IDs Dec 1, 2025

mergify bot added the v1 label Dec 1, 2025

markmc force-pushed the random-request-id branch from 649ef01 to 7bfeb00 Compare December 1, 2025 15:54

markmc commented Dec 1, 2025

View reviewed changes

markmc force-pushed the random-request-id branch from 7bfeb00 to 064a690 Compare December 1, 2025 18:24

markmc requested review from ApostaC and noooop as code owners December 1, 2025 18:24

mergify bot added the kv-connector label Dec 1, 2025

njhill reviewed Dec 2, 2025

View reviewed changes

markmc and others added 4 commits December 2, 2025 08:55

[Core] Address review feedback for random internal IDs

0878cf5

Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Handle parent request abort with internal ID

57a62b9

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

[Core] Add 'internal' param to AsyncLLM.abort()

f1edb91

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

markmc force-pushed the random-request-id branch from 064a690 to f1edb91 Compare December 2, 2025 14:19

[Core] Use defaultdict for external request ID mapping

4bfbcb9

Signed-off-by: Mark McLoughlin <markmc@redhat.com>

Uh oh!

[Core] Add a random suffix to frontend-provided request IDs #27987

Are you sure you want to change the base?

[Core] Add a random suffix to frontend-provided request IDs #27987

Conversation

markmc commented Nov 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

markmc commented Nov 25, 2025

Uh oh!

markmc commented Nov 25, 2025

Uh oh!

markmc commented Nov 25, 2025

Uh oh!

markmc commented Nov 25, 2025

Uh oh!

markmc commented Nov 25, 2025

Uh oh!

njhill commented Nov 25, 2025

Uh oh!

markmc commented Dec 1, 2025

Uh oh!

markmc Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill Dec 1, 2025

Choose a reason for hiding this comment

Uh oh!

markmc Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markmc commented Dec 2, 2025

Uh oh!

markmc commented Dec 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

markmc commented Nov 3, 2025 •

edited by github-actions bot

Loading