-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Core] Add a random suffix to frontend-provided request IDs #27987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses the important issue of request ID collisions when clients provide their own IDs. The solution of prepending a random prefix to client-provided IDs is sound and effectively mitigates the risk of collisions, especially in scenarios with race conditions or concurrent requests using the same ID. The implementation is well-executed, centralizing the new logic in the _internal_request_id method within OpenAIServing. This change is consistently applied across all relevant OpenAI entrypoints, replacing the old _base_request_id logic. The new method correctly handles precedence for request IDs from headers versus request bodies. I've reviewed the changes and found no critical or high-severity issues. The code is clean, correct, and a clear improvement for the system's robustness.
f9de9e6 to
b05df1b
Compare
|
Rebased onto #29375 |
b05df1b to
5b2a479
Compare
|
Discussed this with @njhill in the context of #27614 (enabling async scheduling by default) and a duplicate request ID issue it identified (#29355) This PR currently address the issue at the OpenAI server frontend level - it ensures that any request which comes through this will have a unique request ID, while continuing to include the client-provided request ID for log correlation. However, this is probably something we should guard against even at the
Nick has a prototype and believes (3) isn't nearly as invasive as it sounds. One concern is what to do in the |
To avoid this However, the prefill case is where maintaining this mapping accurately is going to be difficult - the request ID continues to be used as a key long after the request is considered "finished" at the API level. But, in this case, And async scheduling is another case where the request ID continues to be used (but much more briefly) after the request is finished? (Conclusion - the internal request ID is important for async scheduling and P/D. We will need a mapping to implement |
If we do reject duplicate IDs at the |
|
Wow, you were correct on the NIXL P/D case @njhill - D constructs the notification to send to P (see here) and so it is D's That's easy to miss when testing, because P will do: i.e. P waited for a notification for First thought ... wow, we should a test to catch this! Second thought ... it seems we would need P to include its internal request ID as e.g. |
|
Thanks @markmc! Yes I feel this is more critical to fix now, given that the id collision can result in pernicious correctness issues and that the exposure for this is greater with async scheduling.
To summarize the two options we discussed:
In both cases we can allow duplicate concurrent external ids, but could also choose to reject them. I was initially favoring (1) and started hacking here, but now thinking that (2) might be better. A possible downside of (2) is that the id that we include in various places (such as structured log field, tracing span metadata, etc.) will be a superstring of the provided id, which might be inconvenient (and possibly a breaking change) from an observability tooling/correlation pov. The main advantage though is that it would be quicker/less complex/invasive to implement properly, and I think less fragile from an ongoing maintenance pov. For aborting:
As you covered above, some changes would also be needed for NIXL P/D and likely other kv connectors. I haven't thought through it completely yet but your suggestion of propagating the internal id via the kv_transfer_params sounds good to me.
I don't think either of these are problems since like you say the request is considered finished at the API level, it's only the internal id that can live on temporarily.
Yes exactly, nicely summarized! |
Include the internal request ID that the prefill instance is expecting the decode instance to send it in the NIXL notification. Right now, we rely on the proxy supplying the ID via X-Request-ID and that prefill and decode will mangle this ID in identical ways. This is obviously quite brittle, and P should be explicit about what ID it expects from D. Relates to vllm-project#27987 - adding a random prefix to client-provided request IDs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Include the internal request ID that the prefill instance is expecting the decode instance to send it in the NIXL notification. Right now, we rely on the proxy supplying the ID via X-Request-ID and that prefill and decode will mangle this ID in identical ways. This is obviously quite brittle, and P should be explicit about what ID it expects from D. Relates to vllm-project#27987 - adding a random prefix to client-provided request IDs. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
5b2a479 to
649ef01
Compare
Ok, updated now to do this!
See #29665 |
649ef01 to
7bfeb00
Compare
| # This is necessary because some requests may be finished earlier than | ||
| # its previous requests. | ||
|
|
||
| # Extract the original request ID prefix for sorting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I considered adding external_request_id to RequestOutput instead of this hack ...
7bfeb00 to
064a690
Compare
njhill
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much for this @markmc!
Some thoughts in addition to inline comments:
- I was thinking in
AsyncLLMwe should have theadd_requestmethod return the internal request id as well as the output collector, and then use this in the aborts in the except blocks ingenerate() - Wondering whether we should set the request id in the outputs from the engine to be the external id (if provided). This would be more backwards compatible and makes more sense to me logically. It also means we wouldn't need the new sorting "hack" in llm.py
| if arrival_time is None: | ||
| arrival_time = time.time() | ||
|
|
||
| external_req_id = request_id |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can do this as a follow-on, but it might be good to make the external req id optional so in the default case we can then have a more compact internal id.
We could then also avoid generating multiple uuids per request in the api server case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, agree. I'm trying to limit the invasiveness of this PR though. Making request_id optional at the AsyncLLM level, then removing the UUID generation in the API server, and figuring out what to do about the per-endpoint prefix (e.g. "cmpl-") ... it's a relatively heavy lift that could slow down fixing the issue.
Since vllm-project#9550 and vllm-project#10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service. Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID. We saw this happening recently when `vllm serve bench` started including a request ID and the request IDs from multiple concurrent instances caused collisions. See vllm-project#27723 We try to guard against request ID collisions currently in the frontend in OutputProcessor: ``` def add_request(...): if request_id in self.request_states: raise ValueError(f"Request id {request_id} already running.") ``` however, this is not always effective: 1) We can have abort race conditions where a request is no longer tracked by the frontend, but still not completed in the engine. See vllm-project#15326 for an attempt to fix this. 2) We can have async scheduling race conditions where a request ID is removed from the output processor and being scheduled while the older request with that ID is still being completed by the model runner. See vllm-project#29355 3) With P/D, a request will continue to be tracked by the prefill engine long after the prefill request has been completed in the frontend, while we wait for the decode side to fetch the KV blocks. See vllm-project#20139 Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend. We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific. A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes. Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities: N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision: ``` >>> (k**2)/(2*N) 0.011641532182693481 ``` That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt]. Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Nick Hill <nhill@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
064a690 to
f1edb91
Compare
Signed-off-by: Mark McLoughlin <markmc@redhat.com>
Yeah,
ok, I'll take a look at that 👍 |
Doh ... it only has it when it has output! So, that's not going to work. Will look for an alternative |
Since #9550 and #10968 we support client's supplying a custom request ID. The motivation for this is that it can be very helpful when you need to correlate vLLM logs with logs of a related service.
Since the request ID is used ubiquitously across vLLM as a unique key, it obviously is problematic if we ever have multiple in-flight requests using the same client-provided request ID.
We saw this happening recently when
vllm serve benchstarted including a request ID and the request IDs from multiple concurrent instances caused collisions. See #27723We try to guard against request ID collisions currently in the frontend in OutputProcessor:
however, this is not always effective:
by the model runner. See [BugFix] Fix duplicate req id tool-call race condition #29355
Let's instead ensure we use a unique request ID internally, even when a client provides a custom request ID. We can do this simply by appending a short random suffix to any request ID provided by the frontend.
We need to ensure we track the external->internal request ID mapping because abort() will be supplied an external request ID. In the case where an external request ID maps to multiple running requests, we assume the caller requires all of those requests to be aborted. The caller can use EngineCoreRequest.request_id as the request ID if they want to be more specific.
A full 32 character random UUID would be overkill as a suffix, so how many random characters would be sufficient? 8 characters gives us 32 bits of entropy, or 16^8 possible prefixes.
Using the collision probability approximation from https://preshing.com/20110504/hash-collision-probabilities:
N = 16^8 and k is the number of generated suffixes, then the probability of collision is (k^2)/(2N), so If a client somehow
caused vLLM to hold 10k requests that reuse the same client-provided ID, then there would be a 1.16% chance of collision:
That seems (super good enough)[https://hownot2.com/products/hownot2-super-good-enough-t-shirt].