Skip to content

Conversation

@jenchen13
Copy link
Contributor

@jenchen13 jenchen13 commented Jan 29, 2026

What does this PR do?

Type of change: New feature and bug fix

Overview:

Support Latent MOE and Repeated MTP for NemotronH models

  • Enable latent MOE modules during megatron import/export
  • Fix KV cache quantization export: remove old qkv_layer.output_quantizer export & replace with proper k/v_bmm_quantizer logic
  • Remove KV quantization scale clamping to 1
  • Improvements to EP amax sync
  • Support repeated MTP import/export for NemotronH models

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for grouped MLP and self-attention scaling operations in model export workflows
    • Enhanced model parallel training capabilities with improved component mapping
    • Expanded quantization configuration handling with dynamic module exclusion across distributed ranks
    • Improved support for additional transformer engine components
  • Refactor

    • Reorganized internal export and import logic for improved maintainability and specialist model architecture support

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 requested review from a team as code owners January 29, 2026 17:59
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 29, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

  • 🔍 Trigger a full review
📝 Walkthrough

Walkthrough

This PR extends Megatron Core model export and quantization with support for grouped MLP merging, self-attention scaling, and MTP (model tensor parallel) training. It refactors the importer into modular helper methods, improves KV cache scaling factor handling, adds HuggingFace quantization config support, enhances MOE expert synchronization, and integrates TELinear quantization.

Changes

Cohort / File(s) Summary
Custom Module Mappings
modelopt/torch/export/plugins/mcore_custom.py, modelopt/torch/export/plugins/mcore_nemotron.py
Introduces two new custom module mapping classes: GroupedMLPMerging and SelfAttentionScaling. Integrates them into Nemotron export/import mappings alongside latent MoE projections, repeated MTP module entries, and grouped local expert configurations.
Importer Refactoring with MTP Support
modelopt/torch/export/plugins/megatron_importer.py
Replaces direct torch.distributed usage with dist alias. Adds is_mtp parameters to _name_remapping and _qkv_merging. Introduces _grouped_mlp_merging, _import_mamba_layer, and _import_transformer_layer helper methods. Expands MTP import logic to handle both repeated and non-repeated MTP layers with appropriate dispatch based on layer type.
KV Cache Scaling Utilities
modelopt/torch/export/quant_utils.py
Refactors get_kv_cache_scaling_factor: renames parameter from kv_module to self_attention_module, changes return type from list to single torch.Tensor, removes FP8-specific warning/clamping logic. Introduces new _compute_kv_cache_dtype helper to consolidate dtype detection from num_bits list.
HuggingFace Export and State Dict Handling
modelopt/torch/export/unified_export_megatron.py
Adds HF quant config generation with convert_hf_quant_config_format and dynamic exclude_modules aggregation. Introduces comprehensive _get_state_dict family of helpers (_get_transformer_layer_state_dict, _get_mtp_state_dict, _get_mamba_layer_state_dict, etc.). Implements _self_attention_scaling for KV cache scaling factor population. Changes moe_router_dtype parameter type from torch.dtype to str. Enhances save/export flow with per-prefix quantized state extraction.
MOE Expert Synchronization
modelopt/torch/quantization/model_calib.py
Adds post-stats-collection synchronization pass calling sync_moe_local_experts_amax() on modules that expose it. Reorganizes step numbering in data-parallel amax synchronization path. Removes previous in-loop MOE amax sync from TP synchronization block.
TELinear Quantization Support
modelopt/torch/quantization/plugins/megatron.py
Adds TELinear import from transformer_engine. Introduces new _QuantTEMCoreLinear quantization class registered via QuantModuleRegistry for TELinear mapping. Removes synchronization barrier in sync_moe_local_experts_amax.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main changes: latent MOE support, repeated MTP support for NemotronH, and KV cache quantization export fixes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch jennifchen/nmh_latent_moe

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modelopt/torch/export/quant_utils.py (1)

387-422: KV cache dtype now treats all FP4 as affine.
Line 396 computes is_affine, but _compute_kv_cache_dtype ignores it and defaults to True, so (2, 1) always maps to KV_CACHE_NVFP4_AFFINE even when bias is missing. This changes export semantics.

🛠️ Suggested fix
-    return _compute_kv_cache_dtype(num_bits_list)
+    return _compute_kv_cache_dtype(num_bits_list, is_affine)

-def _compute_kv_cache_dtype(num_bits_list: list[int]) -> str | None:
+def _compute_kv_cache_dtype(num_bits_list: list[int], is_affine: bool) -> str | None:
     """Returns the kv_cache dtype.

     If num_bits of output_quantizer is (4, 3) then returns FP8; if it is 8, returns int8,
     otherwise returns None.

     Args:
-        modules: The module or list of modules to inspect.
+        num_bits_list: List of num_bits from KV quantizers.
+        is_affine: Whether KV quantizers have bias/affine params.

     Returns:
         The kv_cache dtype.
     """
-    is_affine = True
-
     if (4, 3) in num_bits_list:
         return KV_CACHE_FP8
modelopt/torch/export/unified_export_megatron.py (1)

745-803: exclude_modules never populates due to qformat is None check.
get_quantization_format returns QUANTIZATION_NONE, not None, so Line 763 never triggers and the exclude list stays empty.

🛠️ Suggested fix
-        if qformat is None and "norm" not in prefix:  # Add exclude layers for hf_quant_config
+        if qformat == QUANTIZATION_NONE and "norm" not in prefix:  # Add exclude layers
             self.exclude_modules.append(prefix)
🤖 Fix all issues with AI agents
In `@modelopt/torch/export/plugins/mcore_nemotron.py`:
- Line 103: The inline comment "# TODO ADD MTP export" is stale because the
export mapping in this module now contains MTP entries; remove the TODO or
replace it with a precise note about any remaining work (e.g., which specific
MTP formats or tests are still missing) so readers aren’t misled—update the
comment near the export mapping and any related MTP export code references to
reflect current status.

In `@modelopt/torch/export/plugins/megatron_importer.py`:
- Around line 264-290: The grouped MLP loader (_grouped_mlp_merging) currently
skips any per-expert quantization scales (TODO weight_scale), so when importing
quantized grouped MLPs you must detect and copy the scale tensors into the
module state_dict before calling module.load_state_dict. Modify
_grouped_mlp_merging to (1) inspect the existing state_dict keys for the
quantizer scale key pattern used by TEGroupedMLP (e.g.,
"weight_quantizer._scale" or similar) and (2) for each expert_id call
self._get_safetensor with the matching safetensor key for that expert's scale
(e.g., use the same prefix.format(expert_id) + the scale suffix used in the
checkpoint) and assign that tensor into state_dict under the exact key name
expected by the module (mirroring how you already set
state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve
dtype/shape checks and then call module.load_state_dict(state_dict) as before.
- Around line 693-774: The non-repeated MTP branch never increments or
initializes layer_id so every mtp in the for mtp in model.mtp.layers loop reuses
the same prefix; fix by initializing layer_id (e.g., layer_id = 0) before the
loop and incrementing layer_id at the end of each iteration of the for mtp in
model.mtp.layers loop so subsequent calls to self.rules["mtp.*"](...) use unique
layer indices.

In `@modelopt/torch/export/quant_utils.py`:
- Around line 348-367: The function get_kv_cache_scaling_factor currently
returns a list of two scalars but is annotated as torch.Tensor; update its type
hint to reflect the actual return type (for example List[Optional[torch.Tensor]]
or Tuple[Optional[torch.Tensor], Optional[torch.Tensor]]) and add necessary
typing imports (List, Optional, Tuple) if not present, and adjust the docstring
return description accordingly; keep the implementation that calls
get_scaling_factor on attributes k_bmm_quantizer and v_bmm_quantizer unchanged.

In `@modelopt/torch/export/unified_export_megatron.py`:
- Around line 1248-1255: The type annotation for moe_router_dtype is
inconsistent: export_mcore_gpt_to_hf declares it as str | None but
GPTModelExporter.__init__ and import_mcore_gpt_from_hf use torch.dtype | None;
change GPTModelExporter.__init__ signature to use str | None to match
export_mcore_gpt_to_hf and the implementation that expects string values like
"fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for
moe_router_dtype, and add a brief docstring entry for moe_router_dtype in
export_mcore_gpt_to_hf describing allowed string values and meaning.
🧹 Nitpick comments (3)
modelopt/torch/quantization/plugins/megatron.py (1)

49-58: Make TELinear import optional to handle older Megatron-Core versions gracefully.

TELinear is available in Megatron-Core v0.9.0+ (Oct 2024), but the current import structure causes the entire TE try-except block to fail if any module is missing, disabling all TE quantization support. While TELinear is now stable, making it optional with a nested try-except is defensive programming that preserves TE support for environments using older Megatron-Core versions. Similarly, guard the _QuantTEMCoreLinear registration with an additional check for TELinear availability.

Suggested change
 try:
     from megatron.core.extensions.transformer_engine import (
         TEColumnParallelGroupedLinear,
         TEColumnParallelLinear,
         TEDotProductAttention,
         TELayerNormColumnParallelLinear,
-        TELinear,
         TERowParallelGroupedLinear,
         TERowParallelLinear,
     )
+    try:
+        from megatron.core.extensions.transformer_engine import TELinear
+    except ImportError:
+        TELinear = None
 
     from .transformer_engine import _QuantTEGroupedLinear, _QuantTELayerNormLinear, _QuantTELinear
 
     HAS_TE = True
 except ImportError:
     HAS_TE = False
+    TELinear = None
-if HAS_TE:
+if HAS_TE and TELinear is not None:
 
     `@QuantModuleRegistry.register`({TELinear: "te_mcore_Linear"})
     class _QuantTEMCoreLinear(_QuantTELinear):
         pass
modelopt/torch/export/unified_export_megatron.py (2)

113-158: Handle moe_router_dtype inputs consistently and avoid unconditional prints.
Line 113 still types torch.dtype, but the implementation only handles string inputs; callers passing torch.float32 now silently get None. Also, Line 129 prints unconditionally. Consider accepting both types and gating logs.

♻️ Suggested refactor
-        moe_router_dtype: torch.dtype | None = None,
+        moe_router_dtype: str | torch.dtype | None = None,
@@
-        self.moe_router_dtype = None
-        if moe_router_dtype == "fp32":
+        self.moe_router_dtype = None
+        if isinstance(moe_router_dtype, torch.dtype):
+            self.moe_router_dtype = moe_router_dtype
+        elif moe_router_dtype == "fp32":
             self.moe_router_dtype = torch.float32
         elif moe_router_dtype == "fp64":
             self.moe_router_dtype = torch.float64
-        print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
+        if torch.distributed.is_initialized():
+            if torch.distributed.get_rank() == 0:
+                print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")
+        else:
+            print(f"Exporting model with moe_router_dtype: {self.moe_router_dtype}")

835-870: Gate debug prints behind a logger/flag.
Lines 838–870 print per‑tensor dtype for every module; this is noisy and can slow large exports.

Comment on lines 264 to 289
def _grouped_mlp_merging(
self,
module,
prefix,
parallel_config: ParallelConfig | None = None,
is_mtp: bool = False,
init_expert_id: int = 0,
num_local_experts: int = 1,
):
if is_mtp:
if "backbone" in prefix:
prefix = prefix.replace("backbone", "mtp")
else:
prefix = prefix.replace("model", "mtp")

state_dict = module.state_dict()
# TODO handle weight_scale
# weight_scale = state_dict.get("weight_quantizer._scale", None)

assert module.num_gemms == num_local_experts, (
"num_gemms must be equal to num_local_experts in TEGroupedMLP"
)
for expert_id in range(init_expert_id, init_expert_id + num_local_experts):
tensor = self._get_safetensor(prefix.format(expert_id) + ".weight")
state_dict[f"weight{expert_id}"] = tensor

module.load_state_dict(state_dict)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Grouped MLP import skips weight scales.
Line 280 notes a TODO for weight_scale handling; if grouped MLPs can be quantized, this will leave scales stale.

🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 264 - 290,
The grouped MLP loader (_grouped_mlp_merging) currently skips any per-expert
quantization scales (TODO weight_scale), so when importing quantized grouped
MLPs you must detect and copy the scale tensors into the module state_dict
before calling module.load_state_dict. Modify _grouped_mlp_merging to (1)
inspect the existing state_dict keys for the quantizer scale key pattern used by
TEGroupedMLP (e.g., "weight_quantizer._scale" or similar) and (2) for each
expert_id call self._get_safetensor with the matching safetensor key for that
expert's scale (e.g., use the same prefix.format(expert_id) + the scale suffix
used in the checkpoint) and assign that tensor into state_dict under the exact
key name expected by the module (mirroring how you already set
state_dict[f"weight{expert_id}"] = tensor for weights); finally preserve
dtype/shape checks and then call module.load_state_dict(state_dict) as before.

Comment on lines 693 to 781
# MTP
if hasattr(model, "mtp"):
print("Importing MTP", flush=True)
# MTP is the last layer in DeepSeek V3/R1
layer_id += 1
for mtp in model.mtp:
self.rules["mtp.fc"](mtp.fc, layer_id)
if len(model.mtp.layers) == 1: # Repeated MTP
layer_id = 0 # reset layer_id for repeated MTP
mtp = model.mtp.layers[0]

self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
self.rules["mtp.enorm"](mtp.enorm, layer_id)
self.rules["mtp.hnorm"](mtp.hnorm, layer_id)
self.rules["mtp.input_layernorm"](mtp.decoder.layers[0].input_layernorm, layer_id)
if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"):
self.rules["mtp.linear_q_proj"](
mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id

mtp_model_layers = mtp.mtp_model_layer.layers
for mtp_model_layer in mtp_model_layers:
if isinstance(mtp_model_layer, MambaLayer):
self._import_mamba_layer(mtp_model_layer, layer_id, layer_pbar)
elif isinstance(mtp_model_layer, TransformerLayer):
self._import_transformer_layer(
mtp_model_layer, layer_id, layer_pbar, is_mtp=True
)
else:
raise ValueError(
f"Unsupported layer type during MTP import: {type(mtp_model_layer)}"
)

layer_id += 1
else: # non-repeated MTP
for mtp in model.mtp.layers:
self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
self.rules["mtp.enorm"](mtp.enorm, layer_id)
self.rules["mtp.hnorm"](mtp.hnorm, layer_id)
self.rules["mtp.input_layernorm"](
mtp.decoder.layers[0].input_layernorm, layer_id
)
else:
self.rules["mtp.linear_q_down_proj"](
mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id
if hasattr(mtp.decoder.layers[0].self_attention, "linear_q_proj"):
self.rules["mtp.linear_q_proj"](
mtp.decoder.layers[0].self_attention.linear_q_proj, layer_id
)
else:
self.rules["mtp.linear_q_down_proj"](
mtp.decoder.layers[0].self_attention.linear_q_down_proj, layer_id
)
self.rules["mtp.linear_q_layernorm"](
mtp.decoder.layers[0].self_attention.q_layernorm, layer_id
)
self.rules["mtp.linear_q_up_proj"](
mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id
)
self.rules["mtp.linear_kv_down_proj"](
mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id
)
self.rules["mtp.linear_q_layernorm"](
mtp.decoder.layers[0].self_attention.q_layernorm, layer_id
self.rules["mtp.linear_kv_layernorm"](
mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id
)
self.rules["mtp.linear_q_up_proj"](
mtp.decoder.layers[0].self_attention.linear_q_up_proj, layer_id
self.rules["mtp.linear_kv_up_proj"](
mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id
)
self.rules["mtp.linear_kv_down_proj"](
mtp.decoder.layers[0].self_attention.linear_kv_down_proj, layer_id
)
self.rules["mtp.linear_kv_layernorm"](
mtp.decoder.layers[0].self_attention.kv_layernorm, layer_id
)
self.rules["mtp.linear_kv_up_proj"](
mtp.decoder.layers[0].self_attention.linear_kv_up_proj, layer_id
)
self.rules["mtp.linear_proj"](
mtp.decoder.layers[0].self_attention.linear_proj, layer_id
)
self.rules["mtp.pre_mlp_layernorm"](
mtp.decoder.layers[0].pre_mlp_layernorm, layer_id
)
self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id)
self.rules["mtp.shared_experts.linear_fc1"](
mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id
)
self.rules["mtp.shared_experts.linear_fc2"](
mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id
)
for expert_id, expert in tqdm(
enumerate(mtp.decoder.layers[0].mlp.experts.local_experts),
desc="Importing MoE local experts",
leave=False,
disable=self.disable_tqdm,
):
self.rules["mtp.local_experts.linear_fc1"](
expert.linear_fc1, layer_id, expert_id
self.rules["mtp.linear_proj"](
mtp.decoder.layers[0].self_attention.linear_proj, layer_id
)
self.rules["mtp.local_experts.linear_fc2"](
expert.linear_fc2, layer_id, expert_id
self.rules["mtp.pre_mlp_layernorm"](
mtp.decoder.layers[0].pre_mlp_layernorm, layer_id
)
self.rules["mtp.router"](mtp.decoder.layers[0].mlp.router, layer_id)
self.rules["mtp.shared_experts.linear_fc1"](
mtp.decoder.layers[0].mlp.shared_experts.linear_fc1, layer_id
)
self.rules["mtp.shared_experts.linear_fc2"](
mtp.decoder.layers[0].mlp.shared_experts.linear_fc2, layer_id
)
for expert_id, expert in tqdm(
enumerate(mtp.decoder.layers[0].mlp.experts.local_experts),
desc="Importing MoE local experts",
leave=False,
disable=self.disable_tqdm,
):
self.rules["mtp.local_experts.linear_fc1"](
expert.linear_fc1, layer_id, expert_id
)
self.rules["mtp.local_experts.linear_fc2"](
expert.linear_fc2, layer_id, expert_id
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Non‑repeated MTP import reuses the same layer_id for every layer.
layer_id is never updated in the non‑repeated branch, so each iteration overwrites the same prefix.

🛠️ Suggested fix
-            else:  # non-repeated MTP
-                for mtp in model.mtp.layers:
+            else:  # non-repeated MTP
+                for mtp_id, mtp in enumerate(model.mtp.layers):
+                    layer_id = mtp_id
                     self.rules["mtp.eh_proj"](mtp.eh_proj, layer_id)
                     self.rules["mtp.enorm"](mtp.enorm, layer_id)
                     self.rules["mtp.hnorm"](mtp.hnorm, layer_id)
🤖 Prompt for AI Agents
In `@modelopt/torch/export/plugins/megatron_importer.py` around lines 693 - 774,
The non-repeated MTP branch never increments or initializes layer_id so every
mtp in the for mtp in model.mtp.layers loop reuses the same prefix; fix by
initializing layer_id (e.g., layer_id = 0) before the loop and incrementing
layer_id at the end of each iteration of the for mtp in model.mtp.layers loop so
subsequent calls to self.rules["mtp.*"](...) use unique layer indices.

Comment on lines 1248 to 1249
def export_mcore_gpt_to_hf(
model: torch.nn.Module,
pretrained_model_name_or_path: str | os.PathLike | None = None,
export_extra_modules: bool = False,
dtype: torch.dtype = torch.bfloat16,
export_dir: Path | str = tempfile.gettempdir(),
moe_router_dtype: torch.dtype | None = None,
moe_router_dtype: str | None = None,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify call sites and parameter usage for the signature change.
rg -n --type=py 'export_mcore_gpt_to_hf\('
rg -n --type=py 'GPTModelExporter\('

Repository: NVIDIA/Model-Optimizer

Length of output: 528


🏁 Script executed:

#!/bin/bash
# Examine the test call site
echo "=== Test call site (line 94) ==="
sed -n '85,105p' tests/gpu/torch/export/test_unified_export_megatron.py

# Examine the function and GPTModelExporter instantiation
echo "=== Function definition and GPTModelExporter instantiation ==="
sed -n '1248,1280p' modelopt/torch/export/unified_export_megatron.py

# Check if there's a docstring with parameter documentation
echo "=== Full function with docstring ==="
sed -n '1248,1300p' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 4205


🏁 Script executed:

#!/bin/bash
# Find GPTModelExporter class definition
echo "=== GPTModelExporter class definition ==="
rg -n 'class GPTModelExporter' --type=py

# Get the class definition and __init__ method
echo "=== GPTModelExporter __init__ ==="
rg -A 20 'class GPTModelExporter' --type=py

Repository: NVIDIA/Model-Optimizer

Length of output: 2142


🏁 Script executed:

#!/bin/bash
# Get the full __init__ method of GPTModelExporter
sed -n '100,150p' modelopt/torch/export/unified_export_megatron.py

Repository: NVIDIA/Model-Optimizer

Length of output: 2574


Fix type annotation inconsistency for moe_router_dtype parameter.

The export_mcore_gpt_to_hf function signature correctly uses str | None, but GPTModelExporter.__init__ incorrectly declares it as torch.dtype | None. The implementation in GPTModelExporter (lines 127-130) expects string values like "fp32" and "fp64", not torch.dtype objects. Update GPTModelExporter.__init__ to use str | None to match the actual implementation and the public API.

Additionally, the moe_router_dtype parameter is missing from the docstring of export_mcore_gpt_to_hf, and the type signature for moe_router_dtype in the related import_mcore_gpt_from_hf function uses torch.dtype | None, creating inconsistency across the export/import interface.

🤖 Prompt for AI Agents
In `@modelopt/torch/export/unified_export_megatron.py` around lines 1248 - 1255,
The type annotation for moe_router_dtype is inconsistent: export_mcore_gpt_to_hf
declares it as str | None but GPTModelExporter.__init__ and
import_mcore_gpt_from_hf use torch.dtype | None; change
GPTModelExporter.__init__ signature to use str | None to match
export_mcore_gpt_to_hf and the implementation that expects string values like
"fp32"/"fp64", update import_mcore_gpt_from_hf to also use str | None for
moe_router_dtype, and add a brief docstring entry for moe_router_dtype in
export_mcore_gpt_to_hf describing allowed string values and meaning.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
@jenchen13 jenchen13 force-pushed the jennifchen/nmh_latent_moe branch from a87e6a9 to e41783a Compare January 29, 2026 18:23
@codecov
Copy link

codecov bot commented Jan 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.38%. Comparing base (770962b) to head (9ad2d9f).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #830      +/-   ##
==========================================
- Coverage   73.82%   73.38%   -0.45%     
==========================================
  Files         193      193              
  Lines       19745    19894     +149     
==========================================
+ Hits        14577    14599      +22     
- Misses       5168     5295     +127     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Step 1: Sync amax across local experts in a SequentialMLP
for name, module in model.named_modules():
if hasattr(module, "sync_moe_local_experts_amax"):
module.sync_moe_local_experts_amax()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function says to sync across EP and ETP but I didn't see any collective communication happen in the function. The description needs to be changed I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change the name of the method. What it's actually doing is just looping through the local experts and assigning the same amax

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is syncing local experts for sequential moe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the name does not say or indicate it is collective comm.

@jenchen13 jenchen13 changed the title Latent MOE & Repeated MTP support; fix KV cache quant export Latent MOE & Repeated MTP support for NemotronH; fix KV cache quant export Jan 29, 2026
@cjluo-nv
Copy link
Collaborator

I have concerns on "Remove KV quantization scale clamping to 1" without accuracy data support. This also deviates with the HF ptq code path. I think we need to have a discussion about the impact.

Copy link
Contributor

@realAsma realAsma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I did not look at export code.

Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Signed-off-by: jenchen13 <jennifchen@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants