[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64) by whn09 · Pull Request #138 · aws-neuron/neuronx-distributed-inference

whn09 · 2026-04-22T04:00:57Z

Description

Adds MiniMax-M2 to contrib/. All code lives under contrib/models/MiniMax-M2/ — git diff upstream/main..HEAD -- src/ is empty.

Model Information

Model Name: MiniMax-M2

Model Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE experts with top-8 routing, sigmoid router with e_score_correction_bias, partial RoPE (64/128 head dim), QK normalization (RMSNorm before reshape), GQA with 48 query heads / 8 KV heads, SwiGLU experts.

Purpose: Text generation.

HuggingFace Checkpoint: MiniMaxAI/MiniMax-M2

This port uses SDK 2.28 features: fused MoE NKI kernels (router_topk, moe_cte, moe_tkg), ModuleMarker wrappers for compiler optimization, fused QKV, shard-on-intermediate padding for blockwise matmul, and a custom RouterTopKWithBias that preserves e_score_correction_bias for accuracy.

Checklist

Required Components

Accuracy Test — contrib/models/MiniMax-M2/test/integration/test_model.py: import, required-attribute, and MoENeuronConfig class tests. The required-attribute test loads the bundled src/config.json so it can validate the config pipeline without downloading weights.
README.md
- Usage Example — uses the flat-import bootstrap (sys.path.insert + from modeling_minimax_m2 import ...), mirroring upstream contrib/Qwen2-Audio-7B
- Compatibility Matrix — Trn2 (trn2.48xlarge), Neuron SDK 2.28, PyTorch 2.9
- Example Checkpoints — HuggingFace link
- Testing Instructions — pytest contrib/models/MiniMax-M2/test/integration/test_model.py
Source Code under contrib/models/MiniMax-M2/src/:
- modeling_minimax_m2.py — full modeling (1388 lines)
- configuration_minimax_m2.py — HF config class
- config.json — bundled architecture snapshot so tests can run without weights

Optional Components

Performance benchmarks — perf_test/0_setup.sh (install vllm-neuron, fetch BF16 weights), perf_test/bench_minimax_m2.sh (BS=1/256), perf_test/vllm-neuron-patch.patch (plumbs hf_config + snapshot_download through vllm-neuron).

Folder Structure

contrib/models/MiniMax-M2/
  README.md
  src/
    __init__.py
    modeling_minimax_m2.py
    configuration_minimax_m2.py
    config.json
  test/
    integration/
      test_model.py
    unit/
  perf_test/
    0_setup.sh
    bench_minimax_m2.sh
    vllm-neuron-patch.patch

Testing

How did you test this change?

Validated on trn2.48xlarge with Neuron SDK 2.28 / PyTorch 2.9 / Python 3.12:

Imports and config — integration tests pass on the NxDI venv (import, MoENeuronConfig resolution, required attributes derived from the bundled config.json)
End-to-end serving — vllm-neuron with the included patch boots the model at TP=64/EP=64 and responds to chat-completion requests
Throughput benchmarking — bench_minimax_m2.sh covers two vLLM configurations:
- BS=1, TP=64/EP=1, non-CB, fused_qkv=true (baseline)
- BS=256, TP=1/EP=64, CB + use_shard_on_intermediate optimizations

Test Results:

[ok] modeling_minimax_m2 imported
[ok] configuration_minimax_m2 imported

Compatibility

Tested with:

Neuron SDK Version(s): 2.28
Instance Type(s): Trn2 (trn2.48xlarge, logical_nc_config=2 → 64 logical cores)
PyTorch Version: 2.9
Python Version: 3.12

Additional Information

Import convention follows upstream contrib/Qwen2-Audio-7B: tests and examples add this model's src/ directory to sys.path and import modeling files by their flat module name. This keeps the contrib package self-contained (no registration in utils/constants.py) and leaves the upstream package untouched.

For the BS=1 config, fused_qkv=true and use_shard_on_intermediate_dynamic_while=false are required — with EP=1, the intermediate dimension is not divisible by 256, so shard-on-intermediate padding would blow up memory. For BS=256 the EP=64 split makes I_TP=1536/1=1536 which %256 == 0, so shard-on-intermediate is safe.

Related Issues

Part of a cleanup of my earlier whn09:contrib/llm-models branch, which originally combined MiMo-V2-Flash and MiniMax-M2 and also touched src/neuronx_distributed_inference/utils/constants.py. The branch has been split into two zero-invasion PRs (this one, and a companion MiMo-V2-Flash PR).

vLLM Integration

This model is intended for use with vLLM
Documentation includes vLLM registration instructions (see README "vLLM Integration" section and perf_test/vllm-neuron-patch.patch)

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

MiniMax-M2 in NxDI contrib format. All code lives under contrib/models/MiniMax-M2/, with zero changes to the upstream src/ tree. Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE experts (top-8), sigmoid routing with e_score_correction_bias, partial RoPE (64/128 head dim), QK normalization (RMSNorm before reshape), GQA 48Q/8KV heads, SwiGLU experts. SDK 2.28 features used: - Fused MoE NKI kernels (router_topk, moe_cte, moe_tkg) - ModuleMarker wrappers for compiler optimization - Fused QKV support - Shard-on-intermediate padding for blockwise matmul - RouterTopKWithBias preserving e_score_correction_bias for accuracy Structure: src/modeling_minimax_m2.py - full modeling code src/configuration_minimax_m2.py - HF config class src/config.json - bundled architecture snapshot (used by unit tests w/o weights) test/integration/test_model.py - config/import tests perf_test/0_setup.sh - vllm-neuron install + weight fetch perf_test/bench_minimax_m2.sh - vLLM serving benchmark perf_test/vllm-neuron-patch.patch - hf_config + snapshot_download plumbing in vllm-neuron Import pattern: tests/examples add src/ to sys.path and import the flat module name, matching upstream contrib/Qwen2-Audio-7B convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same refactor as the MiMo-V2-Flash PR: the previous patch targeted a stale fork branch and could not apply to upstream vllm-neuron. Replaced with a 30-line patch against upstream vllm-project/vllm-neuron release-0.5.0 that adds a `_register_contrib_models()` hook to `vllm_neuron.register()`: - If NXDI_CONTRIB_MINIMAX_M2_SRC is set, import NeuronMiniMaxM2ForCausalLM from that directory. - Register it into NxDI's MODEL_TYPES under key "minimaxm2" (matching the lowercased architecture that vllm-neuron's _get_neuron_model_cls computes from "MiniMaxM2ForCausalLM"). No vLLM-side registration is needed: vLLM's ModelRegistry already recognizes MiniMaxM2ForCausalLM as a supported architecture in 0.16+. Updated accordingly: - perf_test/0_setup.sh now clones release-0.5.0 and `git apply`s the patch. - perf_test/bench_minimax_m2.sh exports NXDI_CONTRIB_MINIMAX_M2_SRC defaulting to this package's own src/. - README serving instructions document the new env var. Verified on trn2.48xlarge (NxDI 2.29, vLLM 0.16, vllm-neuron 0.5.0): MODEL_TYPES["minimaxm2"]["causal-lm"] lookup -> NeuronMiniMaxM2ForCausalLM Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous version hooked into vllm_neuron.register(), which only runs in the parent APIServer process. vLLM V1 spawns EngineCore workers via multiprocessing.spawn and those child processes start a fresh Python interpreter; vLLM's plugin discovery does run there, but the module-level state (in particular the NxDI MODEL_TYPES dict) is a fresh copy so the parent's registration does not carry over. Move _register_contrib_models() into the loader itself and call it at the top of _get_neuron_model_cls(). Every process that tries to look up an architecture now gets a fresh idempotent registration attempt driven by NXDI_CONTRIB_MIMO_V2_FLASH_SRC / NXDI_CONTRIB_MINIMAX_M2_SRC. Also correct the MODEL_TYPES key: release-0.5.0's loader does not have the mimov2flash->mimo_v2_flash rewrite, so we must register under "mimov2flash" (matches architecture.lower()) and "minimaxm2". Verified on trn2.48xlarge: _get_neuron_model_cls("MiMoV2FlashForCausalLM") -> NeuronMiMoV2ForCausalLM Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same fix as on the MiMo-V2-Flash branch. NxDI's hf_adapter.load_config calls AutoConfig.from_pretrained(path) without trust_remote_code=True, which crashes on checkpoints that ship a configuration_*.py (both MiMo-V2-Flash and MiniMax-M2 ship one). Add a lazy _patch_autoconfig_trust_remote_code() inside the loader patch so every process hitting _get_neuron_model_cls installs the shim. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same workaround as the MiMo-V2-Flash bench. Neuron SDK 2.29's neuronxcc does not export blockwise_mm_baseline_shard_hidden from any of the blockwise_mm module paths that neuronx_distributed 0.17 searches, so the MoE blockwise forward reaches _call_shard_hidden_kernel and raises NotImplementedError. Setting use_torch_block_wise=true routes through the torch reference implementation. Slower than the NKI path but unblocks end-to-end vLLM benchmarking. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Same fix as on the MiMo-V2-Flash branch. First-time compilation of a 256-expert MoE model takes 30-90 minutes; the previous 600s timeout aborts the benchmark while the background compile is still running. Bump to 7200s (2h) and emit a progress blip every minute. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Mirrors the same two helpers just added on the MiMo-V2-Flash branch. sanity_check.sh - One-shot chat completion against localhost:$PORT. - Prints the JSON response plus a one-line summary of the reply. - Health-checks /health first and exits fast if the server isn't up. run_bench_single.sh - One 'vllm bench serve' pass with configurable CONCURRENCY / NUM_PROMPTS / INPUT_LEN / OUTPUT_LEN. - Does NOT launch or kill the server; you bring your own. - Writes the transcript to $RESULTS_DIR/${CONFIG_NAME}_c${CONCURRENCY}.txt. Useful when bench_minimax_m2.sh's driver times out during first-compile but the server is still up and reachable. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Port of the Flash preprocess design to MiniMax-M2 / M2.7. The HF checkpoint ships as 130 sharded safetensors with weights of one decoder layer scattered across several shards; this script keeps one safe_open handle live at a time and emits one output file per layer (plus model_extras.safetensors for embed/norm/lm_head). Peak RAM ~15 GB, total runtime ~20 minutes on trn2.48xlarge. Rescaling: - Attention q/k/v/o: OCP FP8 blockwise -> Neuron FP8 per-row (PER_CHANNEL_SYMMETRIC at inference time; per-rank out-dim < 128 would otherwise collapse the blockwise scale to a singleton). - MoE expert w1/w3 fused into packed gate_up_proj [num_experts, H, 2*IM]; w2 stacked into down_proj [num_experts, IM, H]. Scales stay blockwise (128x128), transposed in lockstep with the weight so the modeling code can load them directly. - Router gate.weight and e_score_correction_bias renamed into the NxDI router namespace (block_sparse_moe.router.linear_router.weight and block_sparse_moe.router.e_score_correction_bias). - All norms (input_layernorm, post_attention_layernorm, q_norm, k_norm, model.norm) and embed_tokens / lm_head passed through BF16. Unlike Flash, M2 has uniform GQA (head_dim=128 for Q/K/V) and no attention_sink_bias, so the preprocess is simpler — no asymmetric-V handling, no sink bias routing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Wire the runtime pieces needed to run the preprocessed Neuron-FP8 checkpoint (via the new preprocess_minimax_m2_fp8.py streaming preprocess). All additions are gated on neuron_config.quantized so the existing BF16 path is untouched. Changes: - convert_minimax_m2_hf_to_neuron_state_dict: * Skip maybe_dequantize_layer when quantized=True (preserving FP8 bytes + .scale tensors for NxDI's QuantizedColumnParallel layers). * Guard router rename against re-pop'ing keys that the preprocess already renamed under the NxDI router namespace. * Skip expert stacking when the preprocessed checkpoint already has the fused layout (block_sparse_moe.expert_mlps.mlp_op.gate_up_proj.weight). * Rename Q/K/V .scale tensors into the nested qkv_proj namespace in lockstep with the existing .weight rename. * Expand MoE blockwise gate_up / down .scale along the TP-partitioned dim so per_partition_size == 1 after NxDI sharding (ported from the Flash fix; preserves the gate/up boundary by expanding each half independently). - NeuronMiniMaxM2ForCausalLM: * Install three FP8 monkey-patches up-front in __init__ (plus again in compile() / load() as a belt-and-braces) — all gated on quantized=True and idempotent so the BF16 path is not affected: - _apply_ep_scale_fix: don't EP-shard [1,1,W] singleton scales. - _apply_blockwise_scale_stride_fix: force partition_stride=1 for BLOCKWISE_SYMMETRIC when per-rank weight is below the 128-row scale block. - _apply_2d_per_channel_fix: flip q_config from BLOCKWISE_SYMMETRIC to PER_CHANNEL_SYMMETRIC for 2D attention weights (they use per-row scales, not block scales). * save_quantized_state_dict override: skip HF's re-quantize path (which requires CUDA for the finegrained_fp8 quantizer and materializes a ~600 GB BF16 copy) when the Neuron-FP8 index is already on disk. Notes vs Flash: - M2 doesn't need a router bias monkey-patch — the existing RouterTopKWithBias in this file already initializes e_score_correction_bias as torch.arange(num_experts, dtype=bfloat16), which is the same XLA-constant-fold-surviving init Flash ultimately landed on. - M2 doesn't need attention_value_scale (no such field in the config). - M2 doesn't need asymmetric V-head handling (Q=K=V head_dim=128). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Port of the Flash smoke scripts, adapted to M2: - import modeling_minimax_m2 / MiniMaxM2InferenceConfig / NeuronMiniMaxM2ForCausalLM - default paths under /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8 / /opt/dlami/nvme/compiled/minimax_m2_tp64_moetp1_ep64_fp8/ - modules_to_not_convert drops "o_proj" (unlike Flash, M2's o_proj is part of the native FP8 quantization — HF's ignored_layers only lists gate, e_score_correction_bias, lm_head). Defaults match the Flash recipe: MOE_TP=1, MOE_EP=64 (the only moe_tp/ep ratio that keeps per-rank expert intermediate above the 128-row blockwise scale block on a 64-NC Trn2). BASE_COMPILE_WORK_DIR is pinned to a per-COMPILED_PATH subdir so multiple compiles can run in parallel without overwriting each other's HLO staging. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Two small bugs that showed up on first run: - The compile-time config summary borrowed Flash's attribute names (n_routed_experts, layer_uses_moe, layer_attention_types), which don't exist on MiniMaxM2InferenceConfig. Switch to the M2 field names (num_local_experts, num_key_value_heads, attn_type_list). - The M2 tokenizer doesn't ship a pad_token, so tokenizing [PROMPT] * BATCH_SIZE with padding=True raises in the adapter. Fall back to eos_token when pad_token is None. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The previous bench defaulted to a BF16 checkpoint and carried MoE-kernel workarounds (use_torch_block_wise=true) that are specific to the BF16 path. Switch to the FP8 recipe that actually produces fluent output on Trn2: - MODEL_PATH defaults to /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8 (the preprocess_minimax_m2_fp8.py output dir). - COMMON_MINIMAX_CONFIG carries all FP8 quantization fields inline so every config inherits them (quantized=true, blockwise_symmetric, 128x128 blocks, moe_mask_padded_tokens=true). Unlike Flash, M2's o_proj IS FP8 — do not add it to modules_to_not_convert. - Config 1: BS=32, moe_tp=1, moe_ep=64 (smallest BS the FP8 + EP path supports: NxDI refuses EP>1 under num_experts/top_k = 32). - Config 2: BS=128, moe_tp=1, moe_ep=64 (throughput-leaning). - Drops sequence_parallel_enabled=true from COMMON and switches the blockwise matmul kernel to use_shard_on_block_dynamic_while (the FP8 path uses the native NKI kernel instead of the torch fallback). 0_setup.sh: - Clone vllm-neuron into $HOME instead of /tmp so the install survives AMI /tmp wipes. - Step [2/2] now fetches the FP8 HF checkpoint from HuggingFace (no more private S3 BF16 dependency) and prints the preprocess command the user should run next, instead of assuming they already have a BF16 directory. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Rewrite to match the FP8-first posture the branch now supports: - Target M2.7 by default (validated checkpoint) while noting M2 / any minor MiniMaxM2ForCausalLM variant shares the same config schema and the same preprocess/inference recipe. - Add a 6-step Quick Start that walks from a fresh trn2.48xlarge to a working vLLM server — download, preprocess, smoke-verify, install vllm-neuron, bench — with approximate timings. - Document the three FP8 configuration constraints inherited from the Flash FP8 work (moe_tp=1/ep=64, batch_size>=32, outer ep_degree=1) and the underlying block-size-collapse reason. - Checkpoint Preparation section describes the per-layer streaming preprocess in detail (attention per-row, MoE blockwise, router rename, norms pass-through). - Key Implementation Notes enumerates the Neuron-native RmsNorm, the arange+bf16 router-bias init trick (with the two XLA traps it dodges), and the three FP8 monkey-patches. - Serving example uses the actual working config (BS=32, moe_ep=64, trust_remote_code, NEURON_COMPILED_ARTIFACTS isolation). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…Quantized) Symptom: on the first vLLM run against the preprocessed FP8 checkpoint, NxDI's shard_checkpoint pass logged "Removing redundant keys from checkpoint: ['layers.0.self_attn.o_proj.scale', 'layers.0.self_attn.o_proj.weight', ...]" for every one of the 62 decoder layers. The o_proj weights and scales were dropped entirely, leaving the o_proj projection effectively zero and attention output as garbage. Root cause: NeuronMiniMaxM2Attention binds `self_attn.o_proj` to a plain `RowParallelLinear` (no scale parameter), not the auto-swapped `QuantizedRowParallel`. NxDI's convert() does not re-swap the layer to the quantized class (the auto-swap skips modules that the modeling code has already concretely instantiated as non-Quantized). When the loader sees FP8 bytes + a .scale tensor landing on a module with no .scale attribute, it flags them as "redundant" and drops them. This is exactly the same trap MiMo-V2-Flash's preprocess hit for its o_proj (listed in HF quantization_config.ignored_layers); M2.7's HF o_proj is FP8, but the NxDI binding pattern is the same, so the fix is the same: - preprocess_minimax_m2_fp8.py: dequantize o_proj to BF16 (apply the blockwise scale, cast to bf16) and emit only the BF16 .weight, no .scale. q/k/v still go through the per-row FP8 path. - smoke + bench configs: add "o_proj" to modules_to_not_convert so NxDI's convert() does not try to swap the layer to QuantizedRowParallel at load time. - README: document the o_proj BF16 exception in the Checkpoint Preparation section and update the example config. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

RowParallelLinear's preshard hook usually renames self_attn.o_proj.weight -> self_attn.o_proj.o_proj.weight to match the traced module layout, but that hook only fires via maybe_dequantize_layer — which we skip when quantized=True. As a result, NxDI's shard_checkpoint reported the preprocessed o_proj.weight as a redundant key and dropped it, effectively zeroing the attention output. Do the rename explicitly in the FP8 branch of the converter. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…8 path" This reverts commit d607f6c.

whn09 and others added 16 commits April 22, 2026 11:58

Revert "converter: rename o_proj.weight -> o_proj.o_proj.weight on FP…

1638fba

…8 path" This reverts commit d607f6c.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138

[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138
whn09 wants to merge 16 commits intoaws-neuron:mainfrom
whn09:contrib/MiniMax-M2

whn09 commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

whn09 commented Apr 22, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant