Skip to content

Conversation

@markmc
Copy link
Member

@markmc markmc commented Nov 3, 2025

Since #9550 and #10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID.

We saw this happening recently when vllm serve bench started including a request ID and the request IDs from multiple concurrent instances caused collisions. See #27723

We try to guard against request ID collisions currently in the frontend in OutputProcessor:

    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")

however, this is not always effective:

  1. We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See [Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326 for an attempt to fix this.
  2. We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed
    by the model runner. See [BugFix] Fix duplicate req id tool-call race condition #29355
  3. With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See [PD][Nixl] Remote consumer READ timeout for clearing request blocks  #20139

Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend.

We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific.

A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision:

>>> (k**2)/(2*N)
0.011641532182693481

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses the important issue of request ID collisions when clients provide their own IDs. The solution of prepending a random prefix to client-provided IDs is sound and effectively mitigates the risk of collisions, especially in scenarios with race conditions or concurrent requests using the same ID. The implementation is well-executed, centralizing the new logic in the _internal_request_id method within OpenAIServing. This change is consistently applied across all relevant OpenAI entrypoints, replacing the old _base_request_id logic. The new method correctly handles precedence for request IDs from headers versus request bodies. I've reviewed the changes and found no critical or high-severity issues. The code is clean, correct, and a clear improvement for the system's robustness.

@markmc markmc added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 3, 2025
@markmc markmc requested a review from njhill November 17, 2025 07:54
@markmc
Copy link
Member Author

markmc commented Nov 25, 2025

Rebased onto #29375

@markmc
Copy link
Member Author

markmc commented Nov 25, 2025

Discussed this with @njhill in the context of #27614 (enabling async scheduling by default) and a duplicate request ID issue it identified (#29355)

This PR currently address the issue at the OpenAI server frontend level - it ensures that any request which comes through this will have a unique request ID, while continuing to include the client-provided request ID for log correlation. However, this is probably something we should guard against even at the AsyncLLM and LLM API level as follows:

  1. Every request has a unique request ID generated for it when it is received by e.g. generate() or add_request() - this would probably happen when we allocate the EngineCoreRequest in the Processor
  2. This "internal request ID" would be the request_id we use everywhere as a unique key
  3. The client-provided request ID would be retained and passed around so that it can be the one logged anywhere we log the request ID

Nick has a prototype and believes (3) isn't nearly as invasive as it sounds.

One concern is what to do in the abort() API - we will now be keying on an internal unique ID, and it may be the case that the request ID passed to abort() corresponds to many requests. We would have no choice but to abort all matching requests. And there isn't even a mechanism in the API to know what internal request ID was allocated and pass that instead. This is probably fine though - API users will learn to use unique request IDs.

@markmc
Copy link
Member Author

markmc commented Nov 25, 2025

One concern is what to do in the abort() API - we will now be keying on an internal unique ID, and it may be the case that the request ID passed to abort() corresponds to many requests. We would have no choice but to abort all matching requests. And there isn't even a mechanism in the API to know what internal request ID was allocated and pass that instead. This is probably fine though - API users will learn to use unique request IDs.

To avoid this abort() issue, we could choose to reject duplicate request IDs on submission, since we'll be maintaining a mapping of client-provided request ID to internal request ID. This sounds pretty circular though - if we can reject duplicate request IDs reliably, we don't need an internal request ID to disambiguate them.

However, the prefill case is where maintaining this mapping accurately is going to be difficult - the request ID continues to be used as a key long after the request is considered "finished" at the API level. But, in this case, abort() won't do anything anyway.

And async scheduling is another case where the request ID continues to be used (but much more briefly) after the request is finished?

(Conclusion - the internal request ID is important for async scheduling and P/D. We will need a mapping to implement abort(), and we can use that mapping to reject duplicate request IDs)

@markmc
Copy link
Member Author

markmc commented Nov 25, 2025

[...] we can use that mapping to reject duplicate request IDs

If we do reject duplicate IDs at the AsyncLLM level, it's arguably better if the OpenAI frontend allows duplicate client-provided request IDs and suffixes them with some randomness, as this PR does.

@markmc
Copy link
Member Author

markmc commented Nov 25, 2025

Wow, you were correct on the NIXL P/D case @njhill - D constructs the notification to send to P (see here) and so it is D's request_id that gets sent to P.

That's easy to miss when testing, because P will do:

DEBUG 11-25 12:06:15 [distributed/.../v1/nixl_connector.py:648] NIXLConnector request_finished(cmpl
-98f9b902-43400e8d-81eb-451a-906e-770ec6c7a376-0) waiting for 480 seconds for remote decode to fetch blocks
ERROR 11-25 12:06:16 [distributed/.../v1/nixl_connector.py:1805] Potentially invalid KV blocks for unrecognized request cmpl-8ea01eec-43400e8d-81eb-451a-906e-770ec6c7a376-0 were retrieved by a decode worker. They may have expired.
...
WARNING 11-25 12:14:51 [distributed/.../v1/nixl_connector.py:1778] Releasing expired KV blocks for request cmpl-98f9b902-43400e8d-81eb-451a-906e-770ec6c7a376-0 which were retrieved by 0 decode worker(s) within 480 seconds.

i.e. P waited for a notification for P_req_id, immediately got a notification for D_req_id, and 8 minutes later timed out waiting for the P_req_id notification

First thought ... wow, we should a test to catch this!

Second thought ... it seems we would need P to include its internal request ID as e.g. kv_transfer_params["notif_id"], and for D to use that to send the notification

@njhill
Copy link
Member

njhill commented Nov 25, 2025

Thanks @markmc! Yes I feel this is more critical to fix now, given that the id collision can result in pernicious correctness issues and that the exposure for this is greater with async scheduling.

  • I think we agree that the key thing is to always use an internally generated id, and this should be done in the LLM/AsyncLLM add_request level.
  • But we want to still accept an (optional) external request id for correlation and have this visible in logs, etc.

To summarize the two options we discussed:

  1. Use internally generated id as the "primary" request id, plumb external req id through to appropriate places for logging etc. External id defaults to internal id if not provided.
  2. Keep a single internally-generated request id, but augment it with external req id if provided (as a string prefix or suffix). Similar to what's currently implemented in this PR but within the llm engine.

In both cases we can allow duplicate concurrent external ids, but could also choose to reject them.

I was initially favoring (1) and started hacking here, but now thinking that (2) might be better.

A possible downside of (2) is that the id that we include in various places (such as structured log field, tracing span metadata, etc.) will be a superstring of the provided id, which might be inconvenient (and possibly a breaking change) from an observability tooling/correlation pov. The main advantage though is that it would be quicker/less complex/invasive to implement properly, and I think less fragile from an ongoing maintenance pov.

For aborting:

  • It's a non-issue for API server requests which are aborted by closing the http request (typical case), also non-issue for async generate() calls that are cancelled via standard asyncio mechanism.
  • For "out-of-band" aborts via the AsyncLLM/LLM abort() method we'll have to retain a mapping of external->internal req ids, and abort all internal ids corresponding to particular external id (I think this is fine, happy to elaborate on why)

As you covered above, some changes would also be needed for NIXL P/D and likely other kv connectors. I haven't thought through it completely yet but your suggestion of propagating the internal id via the kv_transfer_params sounds good to me.

However, the prefill case is where maintaining this mapping accurately is going to be difficult - the request ID continues to be used as a key long after the request is considered "finished" at the API level. But, in this case, abort() won't do anything anyway.

And async scheduling is another case where the request ID continues to be used (but much more briefly) after the request is finished?

I don't think either of these are problems since like you say the request is considered finished at the API level, it's only the internal id that can live on temporarily.

(Conclusion - the internal request ID is important for async scheduling and P/D. We will need a mapping to implement abort(), and we can use that mapping to reject duplicate request IDs)

Yes exactly, nicely summarized!

@njhill njhill mentioned this pull request Nov 25, 2025
19 tasks
markmc added a commit to markmc/vllm that referenced this pull request Nov 28, 2025
Include the internal request ID that the prefill instance is
expecting the decode instance to send it in the NIXL notification.

Right now, we rely on the proxy supplying the ID via X-Request-ID
and that prefill and decode will mangle this ID in identical ways.
This is obviously quite brittle, and P should be explicit about what
ID it expects from D.

Relates to vllm-project#27987 - adding a random prefix to client-provided
request IDs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
markmc added a commit to markmc/vllm that referenced this pull request Nov 28, 2025
Include the internal request ID that the prefill instance is
expecting the decode instance to send it in the NIXL notification.

Right now, we rely on the proxy supplying the ID via X-Request-ID
and that prefill and decode will mangle this ID in identical ways.
This is obviously quite brittle, and P should be explicit about what
ID it expects from D.

Relates to vllm-project#27987 - adding a random prefix to client-provided
request IDs.

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the random-request-id branch from 5b2a479 to 649ef01 Compare December 1, 2025 15:12
@markmc markmc changed the title [Frontend] Add a random prefix to client-provided request IDs [Core] Add a random suffix to frontend-provided request IDs Dec 1, 2025
@mergify mergify bot added the v1 label Dec 1, 2025
@markmc
Copy link
Member Author

markmc commented Dec 1, 2025

  1. Keep a single internally-generated request id, but augment it with external req id if provided (as a string prefix or suffix). Similar to what's currently implemented in this PR but within the llm engine.

[...] now thinking that (2) might be better.

Ok, updated now to do this!

As you covered above, some changes would also be needed for NIXL P/D and likely other kv connectors. I haven't thought through it completely yet but your suggestion of propagating the internal id via the kv_transfer_params sounds good to me.

See #29665

@markmc markmc force-pushed the random-request-id branch from 649ef01 to 7bfeb00 Compare December 1, 2025 15:54
# This is necessary because some requests may be finished earlier than
# its previous requests.

# Extract the original request ID prefix for sorting.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered adding external_request_id to RequestOutput instead of this hack ...

Copy link
Member

@njhill njhill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks very much for this @markmc!

Some thoughts in addition to inline comments:

  • I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()
  • Wondering whether we should set the request id in the outputs from the engine to be the external id (if provided). This would be more backwards compatible and makes more sense to me logically. It also means we wouldn't need the new sorting "hack" in llm.py

if arrival_time is None:
arrival_time = time.time()

external_req_id = request_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can do this as a follow-on, but it might be good to make the external req id optional so in the default case we can then have a more compact internal id.

We could then also avoid generating multiple uuids per request in the api server case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, agree. I'm trying to limit the invasiveness of this PR though. Making request_id optional at the AsyncLLM level, then removing the UUID generation in the API server, and figuring out what to do about the per-endpoint prefix (e.g. "cmpl-") ... it's a relatively heavy lift that could slow down fixing the issue.

markmc and others added 4 commits December 2, 2025 08:55
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom
request ID. The motivation for this is that it can be very helpful
when you need to correlate vLLM logs with logs of a related service.

Since the request ID is used ubiquitously across vLLM as a unique
key, it obviously is problematic if we ever have multiple in-flight
requests using the same client-provided request ID.

We saw this happening recently when `vllm serve bench` started
including a request ID and the request IDs from multiple concurrent
instances caused collisions. See vllm-project#27723

We try to guard against request ID collisions currently in the
frontend in OutputProcessor:

```
    def add_request(...):
        if request_id in self.request_states:
            raise ValueError(f"Request id {request_id} already running.")
```

however, this is not always effective:

1) We can have abort race conditions where a request is no longer
   tracked by the frontend, but still not completed in the engine.
   See vllm-project#15326 for an attempt to fix this.
2) We can have async scheduling race conditions where a request
   ID is removed from the output processor and being scheduled
   while the older request with that ID is still being completed
   by the model runner. See vllm-project#29355
3) With P/D, a request will continue to be tracked by the prefill
   engine long after the prefill request has been completed in
   the frontend, while we wait for the decode side to fetch the
   KV blocks. See vllm-project#20139

Let's instead ensure we use a unique request ID internally, even
when a client provides a custom request ID. We can do this simply
by appending a short random suffix to any request ID provided
by the frontend.

We need to ensure we track the external->internal request ID
mapping because abort() will be supplied an external request ID.
In the case where an external request ID maps to multiple running
requests, we assume the caller requires all of those requests
to be aborted. The caller can use EngineCoreRequest.request_id
as the request ID if they want to be more specific.

A full 32 character random UUID would be overkill as a suffix,
so how many random characters would be sufficient? 8 characters
gives us 32 bits of entropy, or 16^8 possible prefixes.

Using the collision probability approximation from
https://preshing.com/20110504/hash-collision-probabilities:

N = 16^8 and k is the number of generated suffixes, then the
probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided
ID, then there would be a 1.16% chance of collision:

```
>>> (k**2)/(2*N)
0.011641532182693481
```

That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].

Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc markmc force-pushed the random-request-id branch from 064a690 to f1edb91 Compare December 2, 2025 14:19
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
@markmc
Copy link
Member Author

markmc commented Dec 2, 2025

Thanks very much for this @markmc!

Some thoughts in addition to inline comments:

  • I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()

Yeah, RequestOutputCollector has this ID, I've used it there

  • Wondering whether we should set the request id in the outputs from the engine to be the external id (if provided). This would be more backwards compatible and makes more sense to me logically. It also means we wouldn't need the new sorting "hack" in llm.py

ok, I'll take a look at that 👍

@markmc
Copy link
Member Author

markmc commented Dec 2, 2025

  • I was thinking in AsyncLLM we should have the add_request method return the internal request id as well as the output collector, and then use this in the aborts in the except blocks in generate()

Yeah, RequestOutputCollector has this ID, I've used it there

Doh ... it only has it when it has output! So, that's not going to work. Will look for an alternative

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants