Add GLM-5 (754B MoE) contrib model for trn2.48xlarge by jimburtoft · Pull Request #143 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-24T14:35:49Z

Summary

Adds NxDI contrib model for GLM-5 (zai-org/GLM-5-FP8), a 754B parameter Mixture-of-Experts model with 40B active parameters per token
Targets trn2.48xlarge with TP=64, FP8 expert weights, BF16 attention/dense layers, SDK 2.29
Includes modeling code (2541 lines), README with usage/benchmarks/compatibility matrix, and integration tests with logit validation

Model Architecture

GLM-5 is architecturally identical to DeepSeek-V3 with:

78 layers (3 dense + 75 MoE), 256 routed experts, top-8 sigmoid routing
MLA (Multi-head Latent Attention) with compressed KV cache (576 values/token)
e_score_correction_bias + routed_scaling_factor=2.5
FP8 blockwise quantized checkpoint (~705 GB)

Benchmark Results (trn2.48xlarge, SDK 2.29)

Batch Size	CTE seq_len	Total tok/s	Per-req tok/s
1	2048	2.1	2.1
4	512	12.3	3.1
8	256	23.4	2.9

Key Implementation Details

Fused MoE NKI kernel: Uses nkilib's fused TKG mega-kernel with a runtime patch (_patch_fused_tkg_with_nkilib) to inject GLM-5's sigmoid routing with selection_bias and routed_scaling_factor
FP8 handling: Blockwise FP8 dequantization for non-expert weights; per-tensor symmetric FP8 re-quantization for expert weights with NaN clamping (Neuron max=240)
Single-process SPMD: Compiled with local_ranks_size=64, loaded in single-process mode (not torchrun)
DSA (Sparse Attention): Architecture defined but uses full-attention fallback
Shared expert: Separate module outside the fused NKI kernel

Dependencies

Neuron SDK 2.29 (NxD Inference 0.9.17334+, neuronx-cc 2.24, NKI 0.3.0)
Modified nkilib with GLM-5 routing support (3 files: router_topk.py, router_topk_torch.py, moe_block_tkg.py)

Test Plan

Integration tests in test/integration/test_model.py:

test_model_loads — Smoke test for model loading
test_model_generates — Verifies non-empty token generation
test_output_coherence — Anti-repetition check on generated text
test_logit_validation — Validates logit distributions (no NaN/Inf, reasonable entropy)

Requires trn2.48xlarge with pre-compiled model and pre-sharded weights.

GLM-5 (zai-org/GLM-5-FP8) is a 754B parameter MoE model with 40B active per token, 256 routed experts (top-8), MLA attention, and sigmoid routing with selection_bias. This contrib adds NxDI support targeting trn2.48xlarge (TP=64, FP8 experts, BF16 attention/dense layers). Benchmarks on trn2.48xlarge (SDK 2.29): - BS=1: 2.1 tok/s, BS=4: 12.3 tok/s, BS=8: 23.4 tok/s Includes modeling code (2541 lines), README with usage/benchmarks/compatibility, and integration tests with logit validation.

Integrate nkilib SwiGLU MLP kernel into GLM5DenseMLP for the 3 dense decoder layers. Weights are transposed at init and the kernel handles both CTE and TKG dispatch internally. Benchmarked at 2.18 tok/s (BS=1) vs 2.1 baseline, a 4% improvement from a config flag change.

…docs - Update nkilib clone instructions to use fork with NKI 0.3.0 fixes (feature/selection-bias-routing branch, includes tensor_reduce axis fix) - Fix benchmark results: 2.27 tok/s verified on clean instance (was 2.18) - Fix compile docs: single-process SPMD (not torchrun) - Fix inference example: pad prompt to (seq_len - max_new_tokens) - Add validated-on section with exact SDK/instance/date - Update test default COMPILED_MODEL_PATH to match README

… path Disable moe_fused_nki_kernel_enabled to use the non-fused MoEFusedTKG fallback path (RMSNorm -> patched PyTorch router -> ExpertMLPsV2), matching the approach that gives DeepSeek-V3 48.7 tok/s on the same hardware. Task-018 profiling showed the fused NKI kernel is overhead-bound at TP=64 (per-core intermediate dim=32, 300x gap between raw DMA+compute vs observed latency). The non-fused path gives the compiler full cross-layer visibility. Changes: - Set moe_fused_nki_kernel_enabled=False in GLM5InferenceConfig - Remove _patch_fused_tkg_with_nkilib call from NeuronGLM5Model.init_model() - Fix weight conversion guard: on_cpu check replaces kernel-enabled check

Benchmark on trn2.48xlarge (SDK 2.29, TP=64, BS=1, FP8): - Fused NKI mega-kernel: 2.27 tok/s, 440 ms TPOT (baseline) - Non-fused ExpertMLPsV2: 1.89 tok/s, 529 ms TPOT (-17%) The non-fused path produced coherent output (routing logic works correctly via patched PyTorch router), but the compiler's global optimization of the ExpertMLPsV2 blockwise matmul did not overcome the overhead of separate RMSNorm + router + expert dispatch calls at TP=64 with per-core intermediate dim=32. The DeepSeek-V3 non-fused path achieves 48.7 tok/s, but DS-V3 has different architecture parameters that may benefit more from the compiler's cross-layer optimization. This reverts commit 22dafc1.

…weights)

jimburtoft added 8 commits April 24, 2026 10:35

Remove NKI kernel impact comparison table from README

2758a23

Restore NKI kernel impact table (Neuron-only, no GPU comparisons)

82ee4f6

Support reduced num_hidden_layers for profiling (remove excess layer …

f4bd745

…weights)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143

Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143
jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
jimburtoft:contrib/GLM-5

jimburtoft commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 24, 2026

Summary

Model Architecture

Benchmark Results (trn2.48xlarge, SDK 2.29)

Key Implementation Details

Dependencies

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant