Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143
Open
jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
Open
Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
Conversation
GLM-5 (zai-org/GLM-5-FP8) is a 754B parameter MoE model with 40B active per token, 256 routed experts (top-8), MLA attention, and sigmoid routing with selection_bias. This contrib adds NxDI support targeting trn2.48xlarge (TP=64, FP8 experts, BF16 attention/dense layers). Benchmarks on trn2.48xlarge (SDK 2.29): - BS=1: 2.1 tok/s, BS=4: 12.3 tok/s, BS=8: 23.4 tok/s Includes modeling code (2541 lines), README with usage/benchmarks/compatibility, and integration tests with logit validation.
Integrate nkilib SwiGLU MLP kernel into GLM5DenseMLP for the 3 dense decoder layers. Weights are transposed at init and the kernel handles both CTE and TKG dispatch internally. Benchmarked at 2.18 tok/s (BS=1) vs 2.1 baseline, a 4% improvement from a config flag change.
…docs - Update nkilib clone instructions to use fork with NKI 0.3.0 fixes (feature/selection-bias-routing branch, includes tensor_reduce axis fix) - Fix benchmark results: 2.27 tok/s verified on clean instance (was 2.18) - Fix compile docs: single-process SPMD (not torchrun) - Fix inference example: pad prompt to (seq_len - max_new_tokens) - Add validated-on section with exact SDK/instance/date - Update test default COMPILED_MODEL_PATH to match README
… path Disable moe_fused_nki_kernel_enabled to use the non-fused MoEFusedTKG fallback path (RMSNorm -> patched PyTorch router -> ExpertMLPsV2), matching the approach that gives DeepSeek-V3 48.7 tok/s on the same hardware. Task-018 profiling showed the fused NKI kernel is overhead-bound at TP=64 (per-core intermediate dim=32, 300x gap between raw DMA+compute vs observed latency). The non-fused path gives the compiler full cross-layer visibility. Changes: - Set moe_fused_nki_kernel_enabled=False in GLM5InferenceConfig - Remove _patch_fused_tkg_with_nkilib call from NeuronGLM5Model.init_model() - Fix weight conversion guard: on_cpu check replaces kernel-enabled check
Benchmark on trn2.48xlarge (SDK 2.29, TP=64, BS=1, FP8): - Fused NKI mega-kernel: 2.27 tok/s, 440 ms TPOT (baseline) - Non-fused ExpertMLPsV2: 1.89 tok/s, 529 ms TPOT (-17%) The non-fused path produced coherent output (routing logic works correctly via patched PyTorch router), but the compiler's global optimization of the ExpertMLPsV2 blockwise matmul did not overcome the overhead of separate RMSNorm + router + expert dispatch calls at TP=64 with per-core intermediate dim=32. The DeepSeek-V3 non-fused path achieves 48.7 tok/s, but DS-V3 has different architecture parameters that may benefit more from the compiler's cross-layer optimization. This reverts commit 22dafc1.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
zai-org/GLM-5-FP8), a 754B parameter Mixture-of-Experts model with 40B active parameters per tokenModel Architecture
GLM-5 is architecturally identical to DeepSeek-V3 with:
e_score_correction_bias+routed_scaling_factor=2.5Benchmark Results (trn2.48xlarge, SDK 2.29)
Key Implementation Details
_patch_fused_tkg_with_nkilib) to inject GLM-5's sigmoid routing with selection_bias and routed_scaling_factorlocal_ranks_size=64, loaded in single-process mode (not torchrun)Dependencies
router_topk.py,router_topk_torch.py,moe_block_tkg.py)Test Plan
Integration tests in
test/integration/test_model.py:test_model_loads— Smoke test for model loadingtest_model_generates— Verifies non-empty token generationtest_output_coherence— Anti-repetition check on generated texttest_logit_validation— Validates logit distributions (no NaN/Inf, reasonable entropy)Requires trn2.48xlarge with pre-compiled model and pre-sharded weights.