-
Notifications
You must be signed in to change notification settings - Fork 247
Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP #831
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: realAsma <akuriparambi@nvidia.com>
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the
📝 WalkthroughWalkthroughThis PR refactors MoE calibration handling in the quantization module. It removes the specialized _QuantMoELayer class from the megatron plugin and introduces generalized MoE calibration validation helpers in the core calibration module. MOE amax synchronization is moved from the TP-specific synchronization path to a dedicated conditional sync location. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| # Sync amax across local experts within each rank (for SequentialMLP) | ||
| for name, module in model.named_modules(): | ||
| if hasattr(module, "sync_moe_local_experts_amax"): | ||
| module.sync_moe_local_experts_amax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see you remove the all expert routing patch above. So this function will make sure those un-calibrated experts all have amax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the amax is synced across experts, the incomplete calibration shouldn't happen, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another thing, can the sync count the amax of an unseen expert? Current logic seems, the weight_quantizer.amax will be the max of all seen experts.
And for GroupedMLP, we don't have this issue, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the amax is synced across experts, the incomplete calibration shouldn't happen,
I'm not entirely sure this is true because the amax is only "synced" between experts in a local layer and we ran into deadlocks before when _QuantMoELayer.forward was not introduced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is best to know what sync_moe_local_experts_amax does and doesn't.
- Does it create amax for those input quantizers that didn't have amax before? a.k.a not calibrated
- It sets all quantizers that "have amax" to the same max value.
It is doing 2) for sure. The question if this function is doing 1)? If it is not creating amax for those who didn't have amax before than likely we still have the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asking user to increase --calib-size.
Do we know if this will always succeed in producing non-null amax values for large MOEs or layers with lots of experts? The calibration time could be very high in these cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As long as at least one expert in the EP rank sees a token we should be good I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem becomes if EP distributed parallelism is so high such as 128 - then it is possible that some ranks might have experts without any tokens
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I guess we never do that extreme EP PTQ
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wdyt?
ChenhanYu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved; with some questions.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #831 +/- ##
==========================================
- Coverage 73.82% 73.34% -0.49%
==========================================
Files 193 193
Lines 19745 19913 +168
==========================================
+ Hits 14577 14605 +28
- Misses 5168 5308 +140 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
jenchen13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs more testing especially on nemotron workflows
| # Sync amax across local experts within each rank (for SequentialMLP) | ||
| for name, module in model.named_modules(): | ||
| if hasattr(module, "sync_moe_local_experts_amax"): | ||
| module.sync_moe_local_experts_amax() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the amax is synced across experts, the incomplete calibration shouldn't happen,
I'm not entirely sure this is true because the amax is only "synced" between experts in a local layer and we ran into deadlocks before when _QuantMoELayer.forward was not introduced
|
@jenchen13 This is why I added _check_moe_calibration_complete - it detects when some experts have amax and others don't, and raises a clear error before sync. We fail fast with an actionable message rather than deadlocking. Tested backward compatibility: saved model before this PR, restored after. After this PR (same checkpoint restored with this PR's code): Only difference: QuantMoELayer → MoELayer. All quantized children (QuantSequentialMLP, QuantMLP, QuantColumnParallelLinear, QuantRowParallelLinear) are preserved. QuantMoELayer had no quantizers or modelopt state - it was just a no-op wrapper that only modified behavior during calibration. The actual quantization happens in the child layers. |
|
@jenchen13 @ChenhanYu I have updated the PR description. Let me know if this makes sense, |
jenchen13
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
What does this PR do?
Type of change: Bug fix / Improvement
Overview:
Fix MoE quantization calibration sync by removing the force-routing workaround and adding explicit validation for incomplete calibration.
Problem: During MoE calibration, some experts may not receive tokens (router doesn't select them). This causes amax=None on some ranks while others have valid values, leading to hangs or failures during distributed amax sync.
Previous workaround: Force all tokens through all experts during calibration. This was however causing the following error:
This is probably because forcing all tokens through all experts, the inputs become garbage and are possibly inf/nan causing calibration to fail.
Solution:
Remove _QuantMoELayer force-routing workaround
Add validation before sync: detect if some ranks have amax=None while others have values
Raise clear error: "MoE calibration incomplete: increase --calib-size" - This is a cleaner solution. In case of under calibration we just raise a clear error.
Testing
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Improvements
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.