Skip to content

[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138

Open
whn09 wants to merge 16 commits intoaws-neuron:mainfrom
whn09:contrib/MiniMax-M2
Open

[contrib] Add MiniMax-M2 (229B / ~10B active MoE, TP=64, EP=64)#138
whn09 wants to merge 16 commits intoaws-neuron:mainfrom
whn09:contrib/MiniMax-M2

Conversation

@whn09
Copy link
Copy Markdown

@whn09 whn09 commented Apr 22, 2026

Description

Adds MiniMax-M2 to contrib/. All code lives under contrib/models/MiniMax-M2/git diff upstream/main..HEAD -- src/ is empty.

Model Information

Model Name: MiniMax-M2

Model Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE experts with top-8 routing, sigmoid router with e_score_correction_bias, partial RoPE (64/128 head dim), QK normalization (RMSNorm before reshape), GQA with 48 query heads / 8 KV heads, SwiGLU experts.

Purpose: Text generation.

HuggingFace Checkpoint: MiniMaxAI/MiniMax-M2

This port uses SDK 2.28 features: fused MoE NKI kernels (router_topk, moe_cte, moe_tkg), ModuleMarker wrappers for compiler optimization, fused QKV, shard-on-intermediate padding for blockwise matmul, and a custom RouterTopKWithBias that preserves e_score_correction_bias for accuracy.

Checklist

Required Components

  • Accuracy Testcontrib/models/MiniMax-M2/test/integration/test_model.py: import, required-attribute, and MoENeuronConfig class tests. The required-attribute test loads the bundled src/config.json so it can validate the config pipeline without downloading weights.
  • README.md
    • Usage Example — uses the flat-import bootstrap (sys.path.insert + from modeling_minimax_m2 import ...), mirroring upstream contrib/Qwen2-Audio-7B
    • Compatibility Matrix — Trn2 (trn2.48xlarge), Neuron SDK 2.28, PyTorch 2.9
    • Example Checkpoints — HuggingFace link
    • Testing Instructionspytest contrib/models/MiniMax-M2/test/integration/test_model.py
  • Source Code under contrib/models/MiniMax-M2/src/:
    • modeling_minimax_m2.py — full modeling (1388 lines)
    • configuration_minimax_m2.py — HF config class
    • config.json — bundled architecture snapshot so tests can run without weights

Optional Components

  • Performance benchmarksperf_test/0_setup.sh (install vllm-neuron, fetch BF16 weights), perf_test/bench_minimax_m2.sh (BS=1/256), perf_test/vllm-neuron-patch.patch (plumbs hf_config + snapshot_download through vllm-neuron).

Folder Structure

contrib/models/MiniMax-M2/
  README.md
  src/
    __init__.py
    modeling_minimax_m2.py
    configuration_minimax_m2.py
    config.json
  test/
    integration/
      test_model.py
    unit/
  perf_test/
    0_setup.sh
    bench_minimax_m2.sh
    vllm-neuron-patch.patch

Testing

How did you test this change?

Validated on trn2.48xlarge with Neuron SDK 2.28 / PyTorch 2.9 / Python 3.12:

  1. Imports and config — integration tests pass on the NxDI venv (import, MoENeuronConfig resolution, required attributes derived from the bundled config.json)
  2. End-to-end serving — vllm-neuron with the included patch boots the model at TP=64/EP=64 and responds to chat-completion requests
  3. Throughput benchmarkingbench_minimax_m2.sh covers two vLLM configurations:
    • BS=1, TP=64/EP=1, non-CB, fused_qkv=true (baseline)
    • BS=256, TP=1/EP=64, CB + use_shard_on_intermediate optimizations

Test Results:

[ok] modeling_minimax_m2 imported
[ok] configuration_minimax_m2 imported

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.28
  • Instance Type(s): Trn2 (trn2.48xlarge, logical_nc_config=2 → 64 logical cores)
  • PyTorch Version: 2.9
  • Python Version: 3.12

Additional Information

Import convention follows upstream contrib/Qwen2-Audio-7B: tests and examples add this model's src/ directory to sys.path and import modeling files by their flat module name. This keeps the contrib package self-contained (no registration in utils/constants.py) and leaves the upstream package untouched.

For the BS=1 config, fused_qkv=true and use_shard_on_intermediate_dynamic_while=false are required — with EP=1, the intermediate dimension is not divisible by 256, so shard-on-intermediate padding would blow up memory. For BS=256 the EP=64 split makes I_TP=1536/1=1536 which %256 == 0, so shard-on-intermediate is safe.

Related Issues

Part of a cleanup of my earlier whn09:contrib/llm-models branch, which originally combined MiMo-V2-Flash and MiniMax-M2 and also touched src/neuronx_distributed_inference/utils/constants.py. The branch has been split into two zero-invasion PRs (this one, and a companion MiMo-V2-Flash PR).

vLLM Integration

  • This model is intended for use with vLLM
  • Documentation includes vLLM registration instructions (see README "vLLM Integration" section and perf_test/vllm-neuron-patch.patch)

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

whn09 and others added 16 commits April 22, 2026 11:58
MiniMax-M2 in NxDI contrib format. All code lives under
contrib/models/MiniMax-M2/, with zero changes to the upstream src/ tree.

Architecture: 229B total / ~10B active. 62 decoder layers, 256 MoE
experts (top-8), sigmoid routing with e_score_correction_bias, partial
RoPE (64/128 head dim), QK normalization (RMSNorm before reshape),
GQA 48Q/8KV heads, SwiGLU experts.

SDK 2.28 features used:
  - Fused MoE NKI kernels (router_topk, moe_cte, moe_tkg)
  - ModuleMarker wrappers for compiler optimization
  - Fused QKV support
  - Shard-on-intermediate padding for blockwise matmul
  - RouterTopKWithBias preserving e_score_correction_bias for accuracy

Structure:
  src/modeling_minimax_m2.py           - full modeling code
  src/configuration_minimax_m2.py      - HF config class
  src/config.json                      - bundled architecture snapshot
                                          (used by unit tests w/o weights)
  test/integration/test_model.py       - config/import tests
  perf_test/0_setup.sh                 - vllm-neuron install + weight fetch
  perf_test/bench_minimax_m2.sh        - vLLM serving benchmark
  perf_test/vllm-neuron-patch.patch    - hf_config + snapshot_download
                                          plumbing in vllm-neuron

Import pattern: tests/examples add src/ to sys.path and import the
flat module name, matching upstream contrib/Qwen2-Audio-7B convention.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same refactor as the MiMo-V2-Flash PR: the previous patch targeted a stale
fork branch and could not apply to upstream vllm-neuron. Replaced with a
30-line patch against upstream vllm-project/vllm-neuron release-0.5.0 that
adds a `_register_contrib_models()` hook to `vllm_neuron.register()`:

  - If NXDI_CONTRIB_MINIMAX_M2_SRC is set, import NeuronMiniMaxM2ForCausalLM
    from that directory.
  - Register it into NxDI's MODEL_TYPES under key "minimaxm2" (matching the
    lowercased architecture that vllm-neuron's _get_neuron_model_cls
    computes from "MiniMaxM2ForCausalLM").

No vLLM-side registration is needed: vLLM's ModelRegistry already
recognizes MiniMaxM2ForCausalLM as a supported architecture in 0.16+.

Updated accordingly:
  - perf_test/0_setup.sh now clones release-0.5.0 and `git apply`s the patch.
  - perf_test/bench_minimax_m2.sh exports NXDI_CONTRIB_MINIMAX_M2_SRC
    defaulting to this package's own src/.
  - README serving instructions document the new env var.

Verified on trn2.48xlarge (NxDI 2.29, vLLM 0.16, vllm-neuron 0.5.0):
  MODEL_TYPES["minimaxm2"]["causal-lm"] lookup -> NeuronMiniMaxM2ForCausalLM

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous version hooked into vllm_neuron.register(), which only runs
in the parent APIServer process. vLLM V1 spawns EngineCore workers via
multiprocessing.spawn and those child processes start a fresh Python
interpreter; vLLM's plugin discovery does run there, but the
module-level state (in particular the NxDI MODEL_TYPES dict) is a fresh
copy so the parent's registration does not carry over.

Move _register_contrib_models() into the loader itself and call it at
the top of _get_neuron_model_cls(). Every process that tries to look up
an architecture now gets a fresh idempotent registration attempt driven
by NXDI_CONTRIB_MIMO_V2_FLASH_SRC / NXDI_CONTRIB_MINIMAX_M2_SRC.

Also correct the MODEL_TYPES key: release-0.5.0's loader does not have
the mimov2flash->mimo_v2_flash rewrite, so we must register under
"mimov2flash" (matches architecture.lower()) and "minimaxm2".

Verified on trn2.48xlarge:
  _get_neuron_model_cls("MiMoV2FlashForCausalLM") -> NeuronMiMoV2ForCausalLM
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as on the MiMo-V2-Flash branch. NxDI's hf_adapter.load_config
calls AutoConfig.from_pretrained(path) without trust_remote_code=True,
which crashes on checkpoints that ship a configuration_*.py (both
MiMo-V2-Flash and MiniMax-M2 ship one). Add a lazy
_patch_autoconfig_trust_remote_code() inside the loader patch so every
process hitting _get_neuron_model_cls installs the shim.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same workaround as the MiMo-V2-Flash bench. Neuron SDK 2.29's neuronxcc
does not export blockwise_mm_baseline_shard_hidden from any of the
blockwise_mm module paths that neuronx_distributed 0.17 searches, so the
MoE blockwise forward reaches _call_shard_hidden_kernel and raises
NotImplementedError. Setting use_torch_block_wise=true routes through
the torch reference implementation. Slower than the NKI path but unblocks
end-to-end vLLM benchmarking.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Same fix as on the MiMo-V2-Flash branch. First-time compilation of a
256-expert MoE model takes 30-90 minutes; the previous 600s timeout
aborts the benchmark while the background compile is still running.
Bump to 7200s (2h) and emit a progress blip every minute.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Mirrors the same two helpers just added on the MiMo-V2-Flash branch.

sanity_check.sh
  - One-shot chat completion against localhost:$PORT.
  - Prints the JSON response plus a one-line summary of the reply.
  - Health-checks /health first and exits fast if the server isn't up.

run_bench_single.sh
  - One 'vllm bench serve' pass with configurable
    CONCURRENCY / NUM_PROMPTS / INPUT_LEN / OUTPUT_LEN.
  - Does NOT launch or kill the server; you bring your own.
  - Writes the transcript to $RESULTS_DIR/${CONFIG_NAME}_c${CONCURRENCY}.txt.

Useful when bench_minimax_m2.sh's driver times out during first-compile
but the server is still up and reachable.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port of the Flash preprocess design to MiniMax-M2 / M2.7. The HF
checkpoint ships as 130 sharded safetensors with weights of one
decoder layer scattered across several shards; this script keeps one
safe_open handle live at a time and emits one output file per layer
(plus model_extras.safetensors for embed/norm/lm_head). Peak RAM ~15 GB,
total runtime ~20 minutes on trn2.48xlarge.

Rescaling:
  - Attention q/k/v/o: OCP FP8 blockwise -> Neuron FP8 per-row
    (PER_CHANNEL_SYMMETRIC at inference time; per-rank out-dim < 128
    would otherwise collapse the blockwise scale to a singleton).
  - MoE expert w1/w3 fused into packed gate_up_proj
    [num_experts, H, 2*IM]; w2 stacked into down_proj [num_experts, IM, H].
    Scales stay blockwise (128x128), transposed in lockstep with the
    weight so the modeling code can load them directly.
  - Router gate.weight and e_score_correction_bias renamed into the
    NxDI router namespace (block_sparse_moe.router.linear_router.weight
    and block_sparse_moe.router.e_score_correction_bias).
  - All norms (input_layernorm, post_attention_layernorm, q_norm,
    k_norm, model.norm) and embed_tokens / lm_head passed through BF16.

Unlike Flash, M2 has uniform GQA (head_dim=128 for Q/K/V) and no
attention_sink_bias, so the preprocess is simpler — no asymmetric-V
handling, no sink bias routing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wire the runtime pieces needed to run the preprocessed Neuron-FP8
checkpoint (via the new preprocess_minimax_m2_fp8.py streaming
preprocess). All additions are gated on neuron_config.quantized so the
existing BF16 path is untouched.

Changes:
- convert_minimax_m2_hf_to_neuron_state_dict:
  * Skip maybe_dequantize_layer when quantized=True (preserving FP8
    bytes + .scale tensors for NxDI's QuantizedColumnParallel layers).
  * Guard router rename against re-pop'ing keys that the preprocess
    already renamed under the NxDI router namespace.
  * Skip expert stacking when the preprocessed checkpoint already has
    the fused layout (block_sparse_moe.expert_mlps.mlp_op.gate_up_proj.weight).
  * Rename Q/K/V .scale tensors into the nested qkv_proj namespace in
    lockstep with the existing .weight rename.
  * Expand MoE blockwise gate_up / down .scale along the TP-partitioned
    dim so per_partition_size == 1 after NxDI sharding (ported from the
    Flash fix; preserves the gate/up boundary by expanding each half
    independently).

- NeuronMiniMaxM2ForCausalLM:
  * Install three FP8 monkey-patches up-front in __init__ (plus again
    in compile() / load() as a belt-and-braces) — all gated on
    quantized=True and idempotent so the BF16 path is not affected:
      - _apply_ep_scale_fix: don't EP-shard [1,1,W] singleton scales.
      - _apply_blockwise_scale_stride_fix: force partition_stride=1 for
        BLOCKWISE_SYMMETRIC when per-rank weight is below the 128-row
        scale block.
      - _apply_2d_per_channel_fix: flip q_config from
        BLOCKWISE_SYMMETRIC to PER_CHANNEL_SYMMETRIC for 2D attention
        weights (they use per-row scales, not block scales).
  * save_quantized_state_dict override: skip HF's re-quantize path
    (which requires CUDA for the finegrained_fp8 quantizer and
    materializes a ~600 GB BF16 copy) when the Neuron-FP8 index is
    already on disk.

Notes vs Flash:
- M2 doesn't need a router bias monkey-patch — the existing
  RouterTopKWithBias in this file already initializes
  e_score_correction_bias as torch.arange(num_experts, dtype=bfloat16),
  which is the same XLA-constant-fold-surviving init Flash ultimately
  landed on.
- M2 doesn't need attention_value_scale (no such field in the config).
- M2 doesn't need asymmetric V-head handling (Q=K=V head_dim=128).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Port of the Flash smoke scripts, adapted to M2:
  - import modeling_minimax_m2 / MiniMaxM2InferenceConfig / NeuronMiniMaxM2ForCausalLM
  - default paths under /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8 /
    /opt/dlami/nvme/compiled/minimax_m2_tp64_moetp1_ep64_fp8/
  - modules_to_not_convert drops "o_proj" (unlike Flash, M2's o_proj is
    part of the native FP8 quantization — HF's ignored_layers only lists
    gate, e_score_correction_bias, lm_head).

Defaults match the Flash recipe: MOE_TP=1, MOE_EP=64 (the only moe_tp/ep
ratio that keeps per-rank expert intermediate above the 128-row
blockwise scale block on a 64-NC Trn2). BASE_COMPILE_WORK_DIR is pinned
to a per-COMPILED_PATH subdir so multiple compiles can run in parallel
without overwriting each other's HLO staging.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two small bugs that showed up on first run:

- The compile-time config summary borrowed Flash's attribute names
  (n_routed_experts, layer_uses_moe, layer_attention_types), which don't
  exist on MiniMaxM2InferenceConfig. Switch to the M2 field names
  (num_local_experts, num_key_value_heads, attn_type_list).
- The M2 tokenizer doesn't ship a pad_token, so tokenizing
  [PROMPT] * BATCH_SIZE with padding=True raises in the adapter. Fall
  back to eos_token when pad_token is None.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The previous bench defaulted to a BF16 checkpoint and carried
MoE-kernel workarounds (use_torch_block_wise=true) that are specific to
the BF16 path. Switch to the FP8 recipe that actually produces fluent
output on Trn2:

- MODEL_PATH defaults to /opt/dlami/nvme/models/MiniMax-M2.7-Neuron-FP8
  (the preprocess_minimax_m2_fp8.py output dir).
- COMMON_MINIMAX_CONFIG carries all FP8 quantization fields inline so
  every config inherits them (quantized=true, blockwise_symmetric,
  128x128 blocks, moe_mask_padded_tokens=true). Unlike Flash, M2's
  o_proj IS FP8 — do not add it to modules_to_not_convert.
- Config 1: BS=32, moe_tp=1, moe_ep=64 (smallest BS the FP8 + EP path
  supports: NxDI refuses EP>1 under num_experts/top_k = 32).
- Config 2: BS=128, moe_tp=1, moe_ep=64 (throughput-leaning).
- Drops sequence_parallel_enabled=true from COMMON and switches the
  blockwise matmul kernel to use_shard_on_block_dynamic_while (the
  FP8 path uses the native NKI kernel instead of the torch fallback).

0_setup.sh:
- Clone vllm-neuron into $HOME instead of /tmp so the install survives
  AMI /tmp wipes.
- Step [2/2] now fetches the FP8 HF checkpoint from HuggingFace (no
  more private S3 BF16 dependency) and prints the preprocess command
  the user should run next, instead of assuming they already have a
  BF16 directory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Rewrite to match the FP8-first posture the branch now supports:

- Target M2.7 by default (validated checkpoint) while noting M2 / any
  minor MiniMaxM2ForCausalLM variant shares the same config schema and
  the same preprocess/inference recipe.
- Add a 6-step Quick Start that walks from a fresh trn2.48xlarge to a
  working vLLM server — download, preprocess, smoke-verify, install
  vllm-neuron, bench — with approximate timings.
- Document the three FP8 configuration constraints inherited from the
  Flash FP8 work (moe_tp=1/ep=64, batch_size>=32, outer ep_degree=1)
  and the underlying block-size-collapse reason.
- Checkpoint Preparation section describes the per-layer streaming
  preprocess in detail (attention per-row, MoE blockwise, router
  rename, norms pass-through).
- Key Implementation Notes enumerates the Neuron-native RmsNorm, the
  arange+bf16 router-bias init trick (with the two XLA traps it
  dodges), and the three FP8 monkey-patches.
- Serving example uses the actual working config (BS=32, moe_ep=64,
  trust_remote_code, NEURON_COMPILED_ARTIFACTS isolation).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…Quantized)

Symptom: on the first vLLM run against the preprocessed FP8 checkpoint,
NxDI's shard_checkpoint pass logged
  "Removing redundant keys from checkpoint:
   ['layers.0.self_attn.o_proj.scale', 'layers.0.self_attn.o_proj.weight',
    ...]"
for every one of the 62 decoder layers. The o_proj weights and scales
were dropped entirely, leaving the o_proj projection effectively zero
and attention output as garbage.

Root cause: NeuronMiniMaxM2Attention binds `self_attn.o_proj` to a
plain `RowParallelLinear` (no scale parameter), not the auto-swapped
`QuantizedRowParallel`. NxDI's convert() does not re-swap the layer
to the quantized class (the auto-swap skips modules that the modeling
code has already concretely instantiated as non-Quantized). When the
loader sees FP8 bytes + a .scale tensor landing on a module with no
.scale attribute, it flags them as "redundant" and drops them.

This is exactly the same trap MiMo-V2-Flash's preprocess hit for its
o_proj (listed in HF quantization_config.ignored_layers); M2.7's HF
o_proj is FP8, but the NxDI binding pattern is the same, so the fix
is the same:

- preprocess_minimax_m2_fp8.py: dequantize o_proj to BF16 (apply the
  blockwise scale, cast to bf16) and emit only the BF16 .weight, no
  .scale. q/k/v still go through the per-row FP8 path.
- smoke + bench configs: add "o_proj" to modules_to_not_convert so
  NxDI's convert() does not try to swap the layer to
  QuantizedRowParallel at load time.
- README: document the o_proj BF16 exception in the Checkpoint
  Preparation section and update the example config.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RowParallelLinear's preshard hook usually renames self_attn.o_proj.weight ->
self_attn.o_proj.o_proj.weight to match the traced module layout, but that
hook only fires via maybe_dequantize_layer — which we skip when quantized=True.
As a result, NxDI's shard_checkpoint reported the preprocessed o_proj.weight
as a redundant key and dropped it, effectively zeroing the attention output.

Do the rename explicitly in the FP8 branch of the converter.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant