[Bugfix][Spec Decode] Fix DP hang when some ranks do dummy runs #26217
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose
When using
EagleProposerwith DP, vllm hangs because of mismatch communication behavior between DP ranks. The communications for non-tree proposing have the pattern below:where
set_forward_contexteventually leads to anall_reduceamong the DP group.vllm/vllm/forward_context.py
Lines 108 to 114 in d3d649e
Therefore, the root cause is that
EagleProposerimplementsdummy_runas a single forward invocation, but the actual runner does itself.num_speculative_tokenstimes. We should simply align its behavior withproposemethod here.Since end-to-end tree sampler is still WIP #22752 , we don't eagerly integrate its logic into
dummy_runnow. An assersion about it is added. It should be easy to align with it similarily after its landing.Test Plan
vllm serve \ meta-llama/Meta-Llama-3-8B-Instruct \ --host 0.0.0.0 \ --port 7000 \ --seed 42 \ --disable-log-requests \ --no-enable-prefix-caching \ -dp 2 \ --max-model-len 8192 \ --max-num-seqs 64 \ --gpu_memory_utilization 0.8 \ --speculative-config '{"model":"yuhuili/EAGLE-LLaMA3-Instruct-8B","num_speculative_tokens":8,"max_model_len": 2048}'Basic functionality
Correctness
Test Result
Basic functionality
No more hang.
Correctness
Reasonable results for LLaMA3 8B.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.