Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP #831

realAsma · 2026-01-30T00:06:23Z

What does this PR do?

Type of change: Bug fix / Improvement

Overview:

Fix MoE quantization calibration sync by removing the force-routing workaround and adding explicit validation for incomplete calibration.

Problem: During MoE calibration, some experts may not receive tokens (router doesn't select them). This causes amax=None on some ranks while others have valid values, leading to hangs or failures during distributed amax sync.

Previous workaround: Force all tokens through all experts during calibration. This was however causing the following error:

File "/opt/TensorRT-Model-Optimizer/modelopt/torch/quantization/plugins/transformer_engine.py", line 152, in te_grouped_quantized_linear_fn
    quantized_inputs = self.input_quantizer(inp)

File "/opt/TensorRT-Model-Optimizer/modelopt/torch/quantization/calib/max.py", line 69, in collect
    assert torch.all(local_amax >= 0), (

torch.AcceleratorError: CUDA error: an illegal memory access was encountered

This is probably because forcing all tokens through all experts, the inputs become garbage and are possibly inf/nan causing calibration to fail.

Solution:
Remove _QuantMoELayer force-routing workaround
Add validation before sync: detect if some ranks have amax=None while others have values
Raise clear error: "MoE calibration incomplete: increase --calib-size" - This is a cleaner solution. In case of under calibration we just raise a clear error.

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes (Verified backward compatibility by loading a MoE model saved before this change)
Did you write any new necessary tests?: No - existing MoE tests cover the sync behavior
Did you add or update any necessary documentation?: Not needed, low level change
Did you update Changelog?: No needed, low level change

Additional Information

Summary by CodeRabbit

Improvements
- Enhanced Mixture of Experts (MoE) quantization with comprehensive calibration validation to ensure consistent synchronization across distributed experts.
Refactor
- Streamlined MoE quantization architecture by consolidating internal handling mechanisms.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Signed-off-by: realAsma <akuriparambi@nvidia.com>

coderabbitai · 2026-01-30T00:06:40Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

🔍 Trigger a full review

📝 Walkthrough

Walkthrough

This PR refactors MoE calibration handling in the quantization module. It removes the specialized _QuantMoELayer class from the megatron plugin and introduces generalized MoE calibration validation helpers in the core calibration module. MOE amax synchronization is moved from the TP-specific synchronization path to a dedicated conditional sync location.

Changes

Cohort / File(s)	Summary
MoE Calibration Validation `modelopt/torch/quantization/model_calib.py`	Adds _is_moe_submodule() and _check_moe_calibration_complete() helpers to detect and validate MoE submodules. Integrates MoE amax synchronization via sync_moe_local_experts_amax() and performs validation pass in max_calibrate. Removes MOE sync from TP synchronization path.
MoE Plugin Removal `modelopt/torch/quantization/plugins/megatron.py`	Removes megatron_moe_layer import, deletes _QuantMoELayer class and its QuantModuleRegistry registration entry.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The PR title 'Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP' is partially related to the actual changes but does not clearly reflect the main modifications, which involve MoE calibration validation restructuring and removal of specialized MoE layer quantization handling.	Clarify the title to be more specific about the core changes: consider revising to highlight the MoE calibration validation refactoring and removal of the _QuantMoELayer class.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch asma/Megatron_EP_fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

modelopt/torch/quantization/model_calib.py

ChenhanYu · 2026-01-30T00:15:07Z

modelopt/torch/quantization/model_calib.py

+    # Sync amax across local experts within each rank (for SequentialMLP)
+    for name, module in model.named_modules():
+        if hasattr(module, "sync_moe_local_experts_amax"):
+            module.sync_moe_local_experts_amax()


I see you remove the all expert routing patch above. So this function will make sure those un-calibrated experts all have amax?

Since the amax is synced across experts, the incomplete calibration shouldn't happen, right?

And another thing, can the sync count the amax of an unseen expert? Current logic seems, the weight_quantizer.amax will be the max of all seen experts.
And for GroupedMLP, we don't have this issue, right?

Since the amax is synced across experts, the incomplete calibration shouldn't happen,

I'm not entirely sure this is true because the amax is only "synced" between experts in a local layer and we ran into deadlocks before when _QuantMoELayer.forward was not introduced

I think it is best to know what sync_moe_local_experts_amax does and doesn't.

Does it create amax for those input quantizers that didn't have amax before? a.k.a not calibrated

It sets all quantizers that "have amax" to the same max value.

It is doing 2) for sure. The question if this function is doing 1)? If it is not creating amax for those who didn't have amax before than likely we still have the problem.

asking user to increase --calib-size.

Do we know if this will always succeed in producing non-null amax values for large MOEs or layers with lots of experts? The calibration time could be very high in these cases

As long as at least one expert in the EP rank sees a token we should be good I think

The problem becomes if EP distributed parallelism is so high such as 128 - then it is possible that some ranks might have experts without any tokens

But I guess we never do that extreme EP PTQ

ChenhanYu

Approved; with some questions.

codecov · 2026-01-30T00:18:47Z

Codecov Report

❌ Patch coverage is 52.17391% with 11 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.34%. Comparing base (81b67dd) to head (8880392).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/quantization/model_calib.py	52.17%	11 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #831      +/-   ##
==========================================
- Coverage   73.82%   73.34%   -0.49%     
==========================================
  Files         193      193              
  Lines       19745    19913     +168     
==========================================
+ Hits        14577    14605      +28     
- Misses       5168     5308     +140

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jenchen13

needs more testing especially on nemotron workflows

modelopt/torch/quantization/plugins/megatron.py

modelopt/torch/quantization/model_calib.py

jenchen13 · 2026-01-30T16:39:29Z

modelopt/torch/quantization/model_calib.py

+    # Sync amax across local experts within each rank (for SequentialMLP)
+    for name, module in model.named_modules():
+        if hasattr(module, "sync_moe_local_experts_amax"):
+            module.sync_moe_local_experts_amax()


Since the amax is synced across experts, the incomplete calibration shouldn't happen,

I'm not entirely sure this is true because the amax is only "synced" between experts in a local layer and we ran into deadlocks before when _QuantMoELayer.forward was not introduced

realAsma · 2026-01-30T19:24:29Z

#831 (comment)

@jenchen13 This is why I added _check_moe_calibration_complete - it detects when some experts have amax and others don't, and raises a clear error before sync. We fail fast with an actionable message rather than deadlocking.

Tested backward compatibility: saved model before this PR, restored after.
Before this PR (checkpoint saved using main branch):


GPTModel(
  (embedding): LanguageModelEmbedding(
    (word_embeddings): VocabParallelEmbedding()
    (embedding_dropout): Dropout(p=0.1, inplace=False)
  )
  (rotary_pos_emb): RotaryEmbedding()
  (decoder): TransformerBlock(
    (layers): ModuleList(
      (0-3): 4 x TransformerLayer(
        (input_layernorm): FusedLayerNorm()
        (self_attention): SelfAttention(
          (core_attention): DotProductAttention(
            (scale_mask_softmax): FusedScaleMaskSoftmax()
            (attention_dropout): Dropout(p=0.1, inplace=False)
          )
          (linear_proj): QuantRowParallelLinear(in_features=256, out_features=256, bias=False, TP=1)
          (linear_qkv): QuantColumnParallelLinear(in_features=256, out_features=768, bias=False, TP=1)
          (q_layernorm): IdentityOp()
          (k_layernorm): IdentityOp()
        )
        (pre_cross_attn_layernorm): IdentityOp()
        (cross_attention): IdentityOp()
        (cross_attn_bda): IdentityFuncOp()
        (pre_mlp_layernorm): FusedLayerNorm()
        (mlp): QuantMoELayer(
          (router): TopKRouter()
          (experts): QuantSequentialMLP(
            (local_experts): ModuleList(
              (0-1): 2 x QuantMLP(
                (linear_fc1): QuantColumnParallelLinear(in_features=256, out_features=1024, bias=False, TP=1)
                (linear_fc2): QuantRowParallelLinear(in_features=1024, out_features=256, bias=False, TP=1)
              )
            )
          )
        )
      )
    )
    (final_layernorm): LayerNorm()
  )
  (output_layer): QuantColumnParallelLinear(in_features=256, out_features=256, bias=False, TP=1)
)

After this PR (same checkpoint restored with this PR's code):


GPTModel(
  (embedding): LanguageModelEmbedding(
    (word_embeddings): VocabParallelEmbedding()
    (embedding_dropout): Dropout(p=0.1, inplace=False)
  )
  (rotary_pos_emb): RotaryEmbedding()
  (decoder): TransformerBlock(
    (layers): ModuleList(
      (0-3): 4 x TransformerLayer(
        (input_layernorm): FusedLayerNorm()
        (self_attention): SelfAttention(
          (core_attention): DotProductAttention(
            (scale_mask_softmax): FusedScaleMaskSoftmax()
            (attention_dropout): Dropout(p=0.1, inplace=False)
          )
          (linear_proj): QuantRowParallelLinear(in_features=256, out_features=256, bias=False, TP=1)
          (linear_qkv): QuantColumnParallelLinear(in_features=256, out_features=768, bias=False, TP=1)
          (q_layernorm): IdentityOp()
          (k_layernorm): IdentityOp()
        )
        (pre_cross_attn_layernorm): IdentityOp()
        (cross_attention): IdentityOp()
        (cross_attn_bda): IdentityFuncOp()
        (pre_mlp_layernorm): FusedLayerNorm()
        (mlp): MoELayer(
          (router): TopKRouter()
          (experts): QuantSequentialMLP(
            (local_experts): ModuleList(
              (0-1): 2 x QuantMLP(
                (linear_fc1): QuantColumnParallelLinear(in_features=256, out_features=1024, bias=False, TP=1)
                (linear_fc2): QuantRowParallelLinear(in_features=1024, out_features=256, bias=False, TP=1)
              )
            )
          )
        )
      )
    )
    (final_layernorm): LayerNorm()
  )
  (output_layer): QuantColumnParallelLinear(in_features=256, out_features=256, bias=False, TP=1)
)

Only difference: QuantMoELayer → MoELayer. All quantized children (QuantSequentialMLP, QuantMLP, QuantColumnParallelLinear, QuantRowParallelLinear) are preserved. QuantMoELayer had no quantizers or modelopt state - it was just a no-op wrapper that only modified behavior during calibration. The actual quantization happens in the child layers.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma · 2026-01-30T19:33:45Z

@jenchen13 @ChenhanYu I have updated the PR description. Let me know if this makes sense,

jenchen13

LGTM

Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP

83f3b4e

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma requested a review from a team as a code owner January 30, 2026 00:06

realAsma requested review from ChenhanYu, jenchen13 and mxinO and removed request for jenchen13 January 30, 2026 00:06

realAsma requested review from jenchen13 and yueshen2016 January 30, 2026 00:06

realAsma commented Jan 30, 2026

View reviewed changes

modelopt/torch/quantization/model_calib.py Show resolved Hide resolved

ChenhanYu reviewed Jan 30, 2026

View reviewed changes

ChenhanYu approved these changes Jan 30, 2026

View reviewed changes

jenchen13 requested changes Jan 30, 2026

View reviewed changes

minor

8880392

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma requested a review from jenchen13 January 30, 2026 19:33

jenchen13 approved these changes Jan 30, 2026

View reviewed changes

Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP #831

Are you sure you want to change the base?

Fixes for Megatron Expert Parallel, GroupedMLP and SequentialMLP #831

Conversation

realAsma commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jenchen13 Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ChenhanYu left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jenchen13 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

jenchen13 Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

realAsma commented Jan 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

realAsma commented Jan 30, 2026

Uh oh!

jenchen13 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

realAsma commented Jan 30, 2026 •

edited

Loading

coderabbitai bot commented Jan 30, 2026 •

edited

Loading

jenchen13 Jan 30, 2026 •

edited

Loading

codecov bot commented Jan 30, 2026 •

edited

Loading

jenchen13 Jan 30, 2026 •

edited

Loading

realAsma commented Jan 30, 2026 •

edited

Loading