Skip to content

Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143

Open
jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
jimburtoft:contrib/GLM-5
Open

Add GLM-5 (754B MoE) contrib model for trn2.48xlarge#143
jimburtoft wants to merge 8 commits intoaws-neuron:mainfrom
jimburtoft:contrib/GLM-5

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Summary

  • Adds NxDI contrib model for GLM-5 (zai-org/GLM-5-FP8), a 754B parameter Mixture-of-Experts model with 40B active parameters per token
  • Targets trn2.48xlarge with TP=64, FP8 expert weights, BF16 attention/dense layers, SDK 2.29
  • Includes modeling code (2541 lines), README with usage/benchmarks/compatibility matrix, and integration tests with logit validation

Model Architecture

GLM-5 is architecturally identical to DeepSeek-V3 with:

  • 78 layers (3 dense + 75 MoE), 256 routed experts, top-8 sigmoid routing
  • MLA (Multi-head Latent Attention) with compressed KV cache (576 values/token)
  • e_score_correction_bias + routed_scaling_factor=2.5
  • FP8 blockwise quantized checkpoint (~705 GB)

Benchmark Results (trn2.48xlarge, SDK 2.29)

Batch Size CTE seq_len Total tok/s Per-req tok/s
1 2048 2.1 2.1
4 512 12.3 3.1
8 256 23.4 2.9

Key Implementation Details

  • Fused MoE NKI kernel: Uses nkilib's fused TKG mega-kernel with a runtime patch (_patch_fused_tkg_with_nkilib) to inject GLM-5's sigmoid routing with selection_bias and routed_scaling_factor
  • FP8 handling: Blockwise FP8 dequantization for non-expert weights; per-tensor symmetric FP8 re-quantization for expert weights with NaN clamping (Neuron max=240)
  • Single-process SPMD: Compiled with local_ranks_size=64, loaded in single-process mode (not torchrun)
  • DSA (Sparse Attention): Architecture defined but uses full-attention fallback
  • Shared expert: Separate module outside the fused NKI kernel

Dependencies

  • Neuron SDK 2.29 (NxD Inference 0.9.17334+, neuronx-cc 2.24, NKI 0.3.0)
  • Modified nkilib with GLM-5 routing support (3 files: router_topk.py, router_topk_torch.py, moe_block_tkg.py)

Test Plan

Integration tests in test/integration/test_model.py:

  1. test_model_loads — Smoke test for model loading
  2. test_model_generates — Verifies non-empty token generation
  3. test_output_coherence — Anti-repetition check on generated text
  4. test_logit_validation — Validates logit distributions (no NaN/Inf, reasonable entropy)

Requires trn2.48xlarge with pre-compiled model and pre-sharded weights.

GLM-5 (zai-org/GLM-5-FP8) is a 754B parameter MoE model with 40B active
per token, 256 routed experts (top-8), MLA attention, and sigmoid routing
with selection_bias. This contrib adds NxDI support targeting trn2.48xlarge
(TP=64, FP8 experts, BF16 attention/dense layers).

Benchmarks on trn2.48xlarge (SDK 2.29):
- BS=1: 2.1 tok/s, BS=4: 12.3 tok/s, BS=8: 23.4 tok/s

Includes modeling code (2541 lines), README with usage/benchmarks/compatibility,
and integration tests with logit validation.
Integrate nkilib SwiGLU MLP kernel into GLM5DenseMLP for the 3 dense
decoder layers. Weights are transposed at init and the kernel handles
both CTE and TKG dispatch internally. Benchmarked at 2.18 tok/s (BS=1)
vs 2.1 baseline, a 4% improvement from a config flag change.
…docs

- Update nkilib clone instructions to use fork with NKI 0.3.0 fixes
  (feature/selection-bias-routing branch, includes tensor_reduce axis fix)
- Fix benchmark results: 2.27 tok/s verified on clean instance (was 2.18)
- Fix compile docs: single-process SPMD (not torchrun)
- Fix inference example: pad prompt to (seq_len - max_new_tokens)
- Add validated-on section with exact SDK/instance/date
- Update test default COMPILED_MODEL_PATH to match README
… path

Disable moe_fused_nki_kernel_enabled to use the non-fused MoEFusedTKG
fallback path (RMSNorm -> patched PyTorch router -> ExpertMLPsV2), matching
the approach that gives DeepSeek-V3 48.7 tok/s on the same hardware.

Task-018 profiling showed the fused NKI kernel is overhead-bound at TP=64
(per-core intermediate dim=32, 300x gap between raw DMA+compute vs observed
latency). The non-fused path gives the compiler full cross-layer visibility.

Changes:
- Set moe_fused_nki_kernel_enabled=False in GLM5InferenceConfig
- Remove _patch_fused_tkg_with_nkilib call from NeuronGLM5Model.init_model()
- Fix weight conversion guard: on_cpu check replaces kernel-enabled check
Benchmark on trn2.48xlarge (SDK 2.29, TP=64, BS=1, FP8):
- Fused NKI mega-kernel: 2.27 tok/s, 440 ms TPOT (baseline)
- Non-fused ExpertMLPsV2: 1.89 tok/s, 529 ms TPOT (-17%)

The non-fused path produced coherent output (routing logic works
correctly via patched PyTorch router), but the compiler's global
optimization of the ExpertMLPsV2 blockwise matmul did not overcome
the overhead of separate RMSNorm + router + expert dispatch calls
at TP=64 with per-core intermediate dim=32.

The DeepSeek-V3 non-fused path achieves 48.7 tok/s, but DS-V3 has
different architecture parameters that may benefit more from the
compiler's cross-layer optimization.

This reverts commit 22dafc1.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant