Skip to content

Conversation

@Flechman
Copy link
Contributor

@Flechman Flechman commented Oct 2, 2025

Purpose

When using DP>1, the set_forward_context gathers the number of tokens across all DP ranks to get static batch size for all ranks for CudaGRAPH EP compatibility.
Even though CudaGRAPH is not yet implemented for the newly padded-speculative design, this DP-all-reduce/gather already serves for the target model and will serve for the draft model later.

When using EAGLE and DP>1 with num_speculative_tokens=k, and sending a single request, one DP rank will handle that request (going through propose()) and all other DP ranks will do a dummy forward. The dummy forward in the eagle module only performs a single forward-pass (so single DP all-reduce via set_forward_context), and we get k eagle forward-passes during propose(). This leads to a "phase shift" between the DP rank doing propose() (group A) and the DP ranks doing the dummy forward (group B): first step in group A will DP-all-reduce k times, and group B will DP-all-reduce a single time but doing k steps. If the finish-sync happens every k steps (in our case finish-sync happens every 32 steps), group B will enter a finish-sync all-reduce while group A is still doing the eagle DP-all-reduce because it's only entering step 2. This results in hanging.

Test Plan

Add an e2e correctness test with eagle and DP=2.

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@Flechman Flechman changed the title Add eagle dp>1 EAGLE Support DP>1 Oct 2, 2025
@Flechman Flechman marked this pull request as ready for review October 2, 2025 11:14
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@ekagra-ranjan
Copy link
Contributor

ekagra-ranjan commented Oct 2, 2025

Nice work! Can you also add a quality correctness test by using test_eagle_correctness() in DP > 1 setting? The added test just checks if the process hangs or not.

Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@mergify mergify bot added the ci/build label Oct 3, 2025
@Flechman
Copy link
Contributor Author

Flechman commented Oct 3, 2025

Nice work! Can you also add a quality correctness test by using test_eagle_correctness() in DP > 1 setting? The added test just checks if the process hangs or not.

Thank you for pointing this out! I've just modified the test to compare outputs with an engine without eagle.

@Flechman
Copy link
Contributor Author

Flechman commented Oct 5, 2025

Hey @benchislett @ekagra-ranjan @luccafong, any updates on this ?

@mergify
Copy link

mergify bot commented Oct 5, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Flechman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 5, 2025
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@mergify mergify bot removed the needs-rebase label Oct 5, 2025
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@youkaichao
Copy link
Member

cc @benchislett @luccafong

Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@mergify
Copy link

mergify bot commented Oct 10, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @Flechman.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 10, 2025
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
@mergify mergify bot removed the needs-rebase label Oct 13, 2025
Signed-off-by: Rémi Delacourt <remi@mistral.ai>
@Flechman
Copy link
Contributor Author

@benchislett this one is ready for review 🙏

Copy link
Collaborator

@benchislett benchislett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@benchislett benchislett added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 14, 2025
@benchislett
Copy link
Collaborator

@Flechman please fix the broken test.

Flechman and others added 2 commits November 1, 2025 11:08
Signed-off-by: remi <remi@mistral.ai>
Signed-off-by: Rémi Delacourt <54138269+Flechman@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build ready ONLY add when PR is ready to merge/full CI is needed speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants