[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138
Open
whn09 wants to merge 16 commits intoaws-neuron:mainfrom
Open
[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138whn09 wants to merge 16 commits intoaws-neuron:mainfrom
whn09 wants to merge 16 commits intoaws-neuron:mainfrom
Conversation
MiniMax-M2 in NxDI contrib format. All code lives under
contrib/models/MiniMax-M2/, with zero changes to the upstream src/ tree.
Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE
experts (top-8), sigmoid routing with e_score_correction_bias, partial
RoPE (64/128 head dim), QK normalization (RMSNorm before reshape),
GQA 48Q/8KV heads, SwiGLU experts.
SDK 2.28 features used:
- Fused MoE NKI kernels (router_topk, moe_cte, moe_tkg)
- ModuleMarker wrappers for compiler optimization
- Fused QKV support
- Shard-on-intermediate padding for blockwise matmul
- RouterTopKWithBias preserving e_score_correction_bias for accuracy
Structure:
src/modeling_minimax_m2.py - full modeling code
src/configuration_minimax_m2.py - HF config class
src/config.json - bundled architecture snapshot
(used by unit tests w/o weights)
test/integration/test_model.py - config/import tests
perf_test/0_setup.sh - vllm-neuron install + weight fetch
perf_test/bench_minimax_m2.sh - vLLM serving benchmark
perf_test/vllm-neuron-patch.patch - hf_config + snapshot_download
plumbing in vllm-neuron
Import pattern: tests/examples add src/ to sys.path and import the
flat module name, matching upstream contrib/Qwen2-Audio-7B convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same refactor as the MiMo-V2-Flash PR: the previous patch targeted a stale
fork branch and could not apply to upstream vllm-neuron. Replaced with a
30-line patch against upstream vllm-project/vllm-neuron release-0.5.0 that
adds a `_register_contrib_models()` hook to `vllm_neuron.register()`:
- If NXDI_CONTRIB_MINIMAX_M2_SRC is set, import NeuronMiniMaxM2ForCausalLM
from that directory.
- Register it into NxDI's MODEL_TYPES under key "minimaxm2" (matching the
lowercased architecture that vllm-neuron's _get_neuron_model_cls
computes from "MiniMaxM2ForCausalLM").
No vLLM-side registration is needed: vLLM's ModelRegistry already
recognizes MiniMaxM2ForCausalLM as a supported architecture in 0.16+.
Updated accordingly:
- perf_test/0_setup.sh now clones release-0.5.0 and `git apply`s the patch.
- perf_test/bench_minimax_m2.sh exports NXDI_CONTRIB_MINIMAX_M2_SRC
defaulting to this package's own src/.
- README serving instructions document the new env var.
Verified on trn2.48xlarge (NxDI 2.29, vLLM 0.16, vllm-neuron 0.5.0):
MODEL_TYPES["minimaxm2"]["causal-lm"] lookup -> NeuronMiniMaxM2ForCausalLM
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous version hooked into vllm_neuron.register(), which only runs
in the parent APIServer process. vLLM V1 spawns EngineCore workers via
multiprocessing.spawn and those child processes start a fresh Python
interpreter; vLLM's plugin discovery does run there, but the
module-level state (in particular the NxDI MODEL_TYPES dict) is a fresh
copy so the parent's registration does not carry over.
Move _register_contrib_models() into the loader itself and call it at
the top of _get_neuron_model_cls(). Every process that tries to look up
an architecture now gets a fresh idempotent registration attempt driven
by NXDI_CONTRIB_MIMO_V2_FLASH_SRC / NXDI_CONTRIB_MINIMAX_M2_SRC.
Also correct the MODEL_TYPES key: release-0.5.0's loader does not have
the mimov2flash->mimo_v2_flash rewrite, so we must register under
"mimov2flash" (matches architecture.lower()) and "minimaxm2".
Verified on trn2.48xlarge:
_get_neuron_model_cls("MiMoV2FlashForCausalLM") -> NeuronMiMoV2ForCausalLM
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as on the MiMo-V2-Flash branch. NxDI's hf_adapter.load_config calls AutoConfig.from_pretrained(path) without trust_remote_code=True, which crashes on checkpoints that ship a configuration_*.py (both MiMo-V2-Flash and MiniMax-M2 ship one). Add a lazy _patch_autoconfig_trust_remote_code() inside the loader patch so every process hitting _get_neuron_model_cls installs the shim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same workaround as the MiMo-V2-Flash bench. Neuron SDK 2.29's neuronxcc does not export blockwise_mm_baseline_shard_hidden from any of the blockwise_mm module paths that neuronx_distributed 0.17 searches, so the MoE blockwise forward reaches _call_shard_hidden_kernel and raises NotImplementedError. Setting use_torch_block_wise=true routes through the torch reference implementation. Slower than the NKI path but unblocks end-to-end vLLM benchmarking. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as on the MiMo-V2-Flash branch. First-time compilation of a 256-expert MoE model takes 30-90 minutes; the previous 600s timeout aborts the benchmark while the background compile is still running. Bump to 7200s (2h) and emit a progress blip every minute. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the same two helpers just added on the MiMo-V2-Flash branch.
sanity_check.sh
- One-shot chat completion against localhost:$PORT.
- Prints the JSON response plus a one-line summary of the reply.
- Health-checks /health first and exits fast if the server isn't up.
run_bench_single.sh
- One 'vllm bench serve' pass with configurable
CONCURRENCY / NUM_PROMPTS / INPUT_LEN / OUTPUT_LEN.
- Does NOT launch or kill the server; you bring your own.
- Writes the transcript to $RESULTS_DIR/${CONFIG_NAME}_c${CONCURRENCY}.txt.
Useful when bench_minimax_m2.sh's driver times out during first-compile
but the server is still up and reachable.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port of the Flash preprocess design to MiniMax-M2 / M2.7. The HF
checkpoint ships as 130 sharded safetensors with weights of one
decoder layer scattered across several shards; this script keeps one
safe_open handle live at a time and emits one output file per layer
(plus model_extras.safetensors for embed/norm/lm_head). Peak RAM ~15 GB,
total runtime ~20 minutes on trn2.48xlarge.
Rescaling:
- Attention q/k/v/o: OCP FP8 blockwise -> Neuron FP8 per-row
(PER_CHANNEL_SYMMETRIC at inference time; per-rank out-dim < 128
would otherwise collapse the blockwise scale to a singleton).
- MoE expert w1/w3 fused into packed gate_up_proj
[num_experts, H, 2*IM]; w2 stacked into down_proj [num_experts, IM, H].
Scales stay blockwise (128x128), transposed in lockstep with the
weight so the modeling code can load them directly.
- Router gate.weight and e_score_correction_bias renamed into the
NxDI router namespace (block_sparse_moe.router.linear_router.weight
and block_sparse_moe.router.e_score_correction_bias).
- All norms (input_layernorm, post_attention_layernorm, q_norm,
k_norm, model.norm) and embed_tokens / lm_head passed through BF16.
Unlike Flash, M2 has uniform GQA (head_dim=128 for Q/K/V) and no
attention_sink_bias, so the preprocess is simpler — no asymmetric-V
handling, no sink bias routing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire the runtime pieces needed to run the preprocessed Neuron-FP8
checkpoint (via the new preprocess_minimax_m2_fp8.py streaming
preprocess). All additions are gated on neuron_config.quantized so the
existing BF16 path is untouched.
Changes:
- convert_minimax_m2_hf_to_neuron_state_dict:
* Skip maybe_dequantize_layer when quantized=True (preserving FP8
bytes + .scale tensors for NxDI's QuantizedColumnParallel layers).
* Guard router rename against re-pop'ing keys that the preprocess
already renamed under the NxDI router namespace.
* Skip expert stacking when the preprocessed checkpoint already has
the fused layout (block_sparse_moe.expert_mlps.mlp_op.gate_up_proj.weight).
* Rename Q/K/V .scale tensors into the nested qkv_proj namespace in
lockstep with the existing .weight rename.
* Expand MoE blockwise gate_up / down .scale along the TP-partitioned
dim so per_partition_size == 1 after NxDI sharding (ported from the
Flash fix; preserves the gate/up boundary by expanding each half
independently).
- NeuronMiniMaxM2ForCausalLM:
* Install three FP8 monkey-patches up-front in __init__ (plus again
in compile() / load() as a belt-and-braces) — all gated on
quantized=True and idempotent so the BF16 path is not affected:
- _apply_ep_scale_fix: don't EP-shard [1,1,W] singleton scales.
- _apply_blockwise_scale_stride_fix: force partition_stride=1 for
BLOCKWISE_SYMMETRIC when per-rank weight is below the 128-row
scale block.
- _apply_2d_per_channel_fix: flip q_config from
BLOCKWISE_SYMMETRIC to PER_CHANNEL_SYMMETRIC for 2D attention
weights (they use per-row scales, not block scales).
* save_quantized_state_dict override: skip HF's re-quantize path
(which requires CUDA for the finegrained_fp8 quantizer and
materializes a ~600 GB BF16 copy) when the Neuron-FP8 index is
already on disk.
Notes vs Flash:
- M2 doesn't need a router bias monkey-patch — the existing
RouterTopKWithBias in this file already initializes
e_score_correction_bias as torch.arange(num_experts, dtype=bfloat16),
which is the same XLA-constant-fold-surviving init Flash ultimately
landed on.
- M2 doesn't need attention_value_scale (no such field in the config).
- M2 doesn't need asymmetric V-head handling (Q=K=V head_dim=128).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port of the Flash smoke scripts, adapted to M2:
- import modeling_minimax_m2 / MiniMaxM2InferenceConfig / NeuronMiniMaxM2ForCausalLM
- default paths under /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8 /
/opt/dlami/nvme/compiled/minimax_m2_tp64_moetp1_ep64_fp8/
- modules_to_not_convert drops "o_proj" (unlike Flash, M2's o_proj is
part of the native FP8 quantization — HF's ignored_layers only lists
gate, e_score_correction_bias, lm_head).
Defaults match the Flash recipe: MOE_TP=1, MOE_EP=64 (the only moe_tp/ep
ratio that keeps per-rank expert intermediate above the 128-row
blockwise scale block on a 64-NC Trn2). BASE_COMPILE_WORK_DIR is pinned
to a per-COMPILED_PATH subdir so multiple compiles can run in parallel
without overwriting each other's HLO staging.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small bugs that showed up on first run: - The compile-time config summary borrowed Flash's attribute names (n_routed_experts, layer_uses_moe, layer_attention_types), which don't exist on MiniMaxM2InferenceConfig. Switch to the M2 field names (num_local_experts, num_key_value_heads, attn_type_list). - The M2 tokenizer doesn't ship a pad_token, so tokenizing [PROMPT] * BATCH_SIZE with padding=True raises in the adapter. Fall back to eos_token when pad_token is None. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous bench defaulted to a BF16 checkpoint and carried MoE-kernel workarounds (use_torch_block_wise=true) that are specific to the BF16 path. Switch to the FP8 recipe that actually produces fluent output on Trn2: - MODEL_PATH defaults to /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8 (the preprocess_minimax_m2_fp8.py output dir). - COMMON_MINIMAX_CONFIG carries all FP8 quantization fields inline so every config inherits them (quantized=true, blockwise_symmetric, 128x128 blocks, moe_mask_padded_tokens=true). Unlike Flash, M2's o_proj IS FP8 — do not add it to modules_to_not_convert. - Config 1: BS=32, moe_tp=1, moe_ep=64 (smallest BS the FP8 + EP path supports: NxDI refuses EP>1 under num_experts/top_k = 32). - Config 2: BS=128, moe_tp=1, moe_ep=64 (throughput-leaning). - Drops sequence_parallel_enabled=true from COMMON and switches the blockwise matmul kernel to use_shard_on_block_dynamic_while (the FP8 path uses the native NKI kernel instead of the torch fallback). 0_setup.sh: - Clone vllm-neuron into $HOME instead of /tmp so the install survives AMI /tmp wipes. - Step [2/2] now fetches the FP8 HF checkpoint from HuggingFace (no more private S3 BF16 dependency) and prints the preprocess command the user should run next, instead of assuming they already have a BF16 directory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rewrite to match the FP8-first posture the branch now supports: - Target M2.7 by default (validated checkpoint) while noting M2 / any minor MiniMaxM2ForCausalLM variant shares the same config schema and the same preprocess/inference recipe. - Add a 6-step Quick Start that walks from a fresh trn2.48xlarge to a working vLLM server — download, preprocess, smoke-verify, install vllm-neuron, bench — with approximate timings. - Document the three FP8 configuration constraints inherited from the Flash FP8 work (moe_tp=1/ep=64, batch_size>=32, outer ep_degree=1) and the underlying block-size-collapse reason. - Checkpoint Preparation section describes the per-layer streaming preprocess in detail (attention per-row, MoE blockwise, router rename, norms pass-through). - Key Implementation Notes enumerates the Neuron-native RmsNorm, the arange+bf16 router-bias init trick (with the two XLA traps it dodges), and the three FP8 monkey-patches. - Serving example uses the actual working config (BS=32, moe_ep=64, trust_remote_code, NEURON_COMPILED_ARTIFACTS isolation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Quantized)
Symptom: on the first vLLM run against the preprocessed FP8 checkpoint,
NxDI's shard_checkpoint pass logged
"Removing redundant keys from checkpoint:
['layers.0.self_attn.o_proj.scale', 'layers.0.self_attn.o_proj.weight',
...]"
for every one of the 62 decoder layers. The o_proj weights and scales
were dropped entirely, leaving the o_proj projection effectively zero
and attention output as garbage.
Root cause: NeuronMiniMaxM2Attention binds `self_attn.o_proj` to a
plain `RowParallelLinear` (no scale parameter), not the auto-swapped
`QuantizedRowParallel`. NxDI's convert() does not re-swap the layer
to the quantized class (the auto-swap skips modules that the modeling
code has already concretely instantiated as non-Quantized). When the
loader sees FP8 bytes + a .scale tensor landing on a module with no
.scale attribute, it flags them as "redundant" and drops them.
This is exactly the same trap MiMo-V2-Flash's preprocess hit for its
o_proj (listed in HF quantization_config.ignored_layers); M2.7's HF
o_proj is FP8, but the NxDI binding pattern is the same, so the fix
is the same:
- preprocess_minimax_m2_fp8.py: dequantize o_proj to BF16 (apply the
blockwise scale, cast to bf16) and emit only the BF16 .weight, no
.scale. q/k/v still go through the per-row FP8 path.
- smoke + bench configs: add "o_proj" to modules_to_not_convert so
NxDI's convert() does not try to swap the layer to
QuantizedRowParallel at load time.
- README: document the o_proj BF16 exception in the Checkpoint
Preparation section and update the example config.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RowParallelLinear's preshard hook usually renames self_attn.o_proj.weight -> self_attn.o_proj.o_proj.weight to match the traced module layout, but that hook only fires via maybe_dequantize_layer — which we skip when quantized=True. As a result, NxDI's shard_checkpoint reported the preprocessed o_proj.weight as a redundant key and dropped it, effectively zeroing the attention output. Do the rename explicitly in the FP8 branch of the converter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…8 path" This reverts commit d607f6c.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Adds MiniMax-M2 to
contrib/. All code lives undercontrib/models/MiniMax-M2/—git diff upstream/main..HEAD -- src/is empty.Model Information
Model Name: MiniMax-M2
Model Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE experts with top-8 routing, sigmoid router with
e_score_correction_bias, partial RoPE (64/128 head dim), QK normalization (RMSNorm before reshape), GQA with 48 query heads / 8 KV heads, SwiGLU experts.Purpose: Text generation.
HuggingFace Checkpoint: MiniMaxAI/MiniMax-M2
This port uses SDK 2.28 features: fused MoE NKI kernels (
router_topk,moe_cte,moe_tkg),ModuleMarkerwrappers for compiler optimization, fused QKV, shard-on-intermediate padding for blockwise matmul, and a customRouterTopKWithBiasthat preservese_score_correction_biasfor accuracy.Checklist
Required Components
contrib/models/MiniMax-M2/test/integration/test_model.py: import, required-attribute, andMoENeuronConfigclass tests. The required-attribute test loads the bundledsrc/config.jsonso it can validate the config pipeline without downloading weights.sys.path.insert+from modeling_minimax_m2 import ...), mirroring upstreamcontrib/Qwen2-Audio-7Bpytest contrib/models/MiniMax-M2/test/integration/test_model.pycontrib/models/MiniMax-M2/src/:modeling_minimax_m2.py— full modeling (1388 lines)configuration_minimax_m2.py— HF config classconfig.json— bundled architecture snapshot so tests can run without weightsOptional Components
perf_test/0_setup.sh(install vllm-neuron, fetch BF16 weights),perf_test/bench_minimax_m2.sh(BS=1/256),perf_test/vllm-neuron-patch.patch(plumbshf_config+snapshot_downloadthrough vllm-neuron).Folder Structure
Testing
How did you test this change?
Validated on
trn2.48xlargewith Neuron SDK 2.28 / PyTorch 2.9 / Python 3.12:MoENeuronConfigresolution, required attributes derived from the bundledconfig.json)bench_minimax_m2.shcovers two vLLM configurations:fused_qkv=true(baseline)use_shard_on_intermediateoptimizationsTest Results:
Compatibility
Tested with:
Additional Information
Import convention follows upstream
contrib/Qwen2-Audio-7B: tests and examples add this model'ssrc/directory tosys.pathand import modeling files by their flat module name. This keeps the contrib package self-contained (no registration inutils/constants.py) and leaves the upstream package untouched.For the
BS=1config,fused_qkv=trueanduse_shard_on_intermediate_dynamic_while=falseare required — withEP=1, the intermediate dimension is not divisible by 256, so shard-on-intermediate padding would blow up memory. ForBS=256theEP=64split makesI_TP=1536/1=1536which%256 == 0, so shard-on-intermediate is safe.Related Issues
Part of a cleanup of my earlier
whn09:contrib/llm-modelsbranch, which originally combined MiMo-V2-Flash and MiniMax-M2 and also touchedsrc/neuronx_distributed_inference/utils/constants.py. The branch has been split into two zero-invasion PRs (this one, and a companion MiMo-V2-Flash PR).vLLM Integration
perf_test/vllm-neuron-patch.patch)By submitting this PR, I confirm that: