Add Sarvam-30B (sarvam_moe) contrib model by jimburtoft · Pull Request #144 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-24T22:09:30Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

NeuronX Distributed Inference implementation of sarvamai/sarvam-30b, a 32B total / 2.4B active parameter Mixture-of-Experts model with 128 routed experts + 1 shared expert, top-6 sigmoid routing, and GQA (64Q/4KV heads).

Key implementation details:

Shared expert extraction (Trinity pattern): SDK 2.29's moe_block_tkg NKI kernel doesn't support shared experts, so they're extracted as a standalone module on the decoder layer
Sigmoid routing patch: ISA fallback for sigmoid + expert bias routing (NKI router expects softmax)
Dense first layer: Layer 0 uses standard MLP instead of MoE
NKI fused TKG: +21.5% throughput improvement (100.0 vs 82.3 tok/s)

Model Information

Model Name: Sarvam-30B (SarvamMoEForCausalLM)

Model Architecture: Decoder-only transformer with Mixture-of-Experts (128 routed + 1 shared, top-6 sigmoid routing, GQA)

Purpose: Text generation (multilingual, instruction-tuned)

Checklist

Required Components

Accuracy Test (test/integration/test_sarvam_moe.py)
- Uses check_accuracy_logits_v2 with teacher forcing against CPU reference logits
- 5/5 prompts PASS with 20 tokens validated per prompt
- Tolerance: rtol=1.2, divergence_difference_tol=0.30 (wider than dense models due to 128-expert BF16 MoE routing)
README.md with the following sections:
- Usage Example: Complete code example showing compilation and generation
- Compatibility Matrix: Tested on trn2.3xlarge TP=4 with SDK 2.29
- Example Checkpoints: Link to sarvamai/sarvam-30b on HuggingFace
- Testing Instructions: Commands for running tests with environment variables
Source Code (src/)
- modeling_sarvam_moe.py: Full NxDI model implementation (~1180 lines)
- __init__.py: Exports all public classes

Optional Components

Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/sarvam-30b/
  README.md
  /src
    __init__.py
    modeling_sarvam_moe.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_sarvam_moe.py

Testing

How did you test this change?

Tested on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29. The model was compiled, loaded, and validated through:

Logit validation: check_accuracy_logits_v2 with 5 diverse prompts × 20 tokens — 5/5 PASS
Sequence length sweep: Tested 256 to 16384 — all compile and run successfully
Batch size sweep: BS=1,2,4,8 at seq_len=4096 and BS=1,2,4 at seq_len=8192
NKI fused TKG: Validated +21.5% throughput improvement with shared expert extraction
GPU comparison: Compared against 1x H100 (vLLM 0.19.1, BF16)

Test Results:

Test	Result
Logit validation (5 prompts × 20 tokens)	5/5 PASS
Generation coherence	PASS
TTFT (seq_len=256, BS=1)	152.8ms
Throughput (seq_len=256, BS=1, NKI)	102.2 tok/s
Max seq_len (compiles)	16384
Max batch size tested	8

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (neuronx-cc 2.24.5133, NxDI 0.9.17334)
Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
PyTorch Version: 2.9.0
Python Version: 3.12

Additional Information

Model requires trust_remote_code=True for HuggingFace loading
BF16 weights total ~62 GB, fitting on trn2.3xlarge TP=4 (~15.5 GB/core)
The implementation follows the Trinity contrib model pattern for shared expert extraction
Sigmoid routing uses ISA fallback (router_topk_nki_kernel_enabled=False) since the NKI router kernel expects softmax

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

- Set neuron_config.fused_qkv = True so NxDI GQA preshard_hook correctly handles the fused query_key_value checkpoint layout - Set blockwise_matmul_config.use_shard_on_intermediate_dynamic_while = True to work around missing shard_hidden NKI kernel in SDK 2.29

jimburtoft added 2 commits April 24, 2026 18:08

Add Sarvam-30B (sarvam_moe) contrib model for NxDI on trn2.3xlarge

dac8f38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Sarvam-30B (sarvam_moe) contrib model#144

Add Sarvam-30B (sarvam_moe) contrib model#144
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-30b

jimburtoft commented Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 24, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant