Skip to content

Add Sarvam-30B (sarvam_moe) contrib model#144

Open
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-30b
Open

Add Sarvam-30B (sarvam_moe) contrib model#144
jimburtoft wants to merge 2 commits intoaws-neuron:mainfrom
jimburtoft:contrib/sarvam-30b

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

NeuronX Distributed Inference implementation of sarvamai/sarvam-30b, a 32B total / 2.4B active parameter Mixture-of-Experts model with 128 routed experts + 1 shared expert, top-6 sigmoid routing, and GQA (64Q/4KV heads).

Key implementation details:

  • Shared expert extraction (Trinity pattern): SDK 2.29's moe_block_tkg NKI kernel doesn't support shared experts, so they're extracted as a standalone module on the decoder layer
  • Sigmoid routing patch: ISA fallback for sigmoid + expert bias routing (NKI router expects softmax)
  • Dense first layer: Layer 0 uses standard MLP instead of MoE
  • NKI fused TKG: +21.5% throughput improvement (100.0 vs 82.3 tok/s)

Model Information

Model Name: Sarvam-30B (SarvamMoEForCausalLM)

Model Architecture: Decoder-only transformer with Mixture-of-Experts (128 routed + 1 shared, top-6 sigmoid routing, GQA)

Purpose: Text generation (multilingual, instruction-tuned)

Checklist

Required Components

  • Accuracy Test (test/integration/test_sarvam_moe.py)

    • Uses check_accuracy_logits_v2 with teacher forcing against CPU reference logits
    • 5/5 prompts PASS with 20 tokens validated per prompt
    • Tolerance: rtol=1.2, divergence_difference_tol=0.30 (wider than dense models due to 128-expert BF16 MoE routing)
  • README.md with the following sections:

    • Usage Example: Complete code example showing compilation and generation
    • Compatibility Matrix: Tested on trn2.3xlarge TP=4 with SDK 2.29
    • Example Checkpoints: Link to sarvamai/sarvam-30b on HuggingFace
    • Testing Instructions: Commands for running tests with environment variables
  • Source Code (src/)

    • modeling_sarvam_moe.py: Full NxDI model implementation (~1180 lines)
    • __init__.py: Exports all public classes

Optional Components

  • Unit Tests (CPU or Neuron-based)

Folder Structure

/contrib/models/sarvam-30b/
  README.md
  /src
    __init__.py
    modeling_sarvam_moe.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_sarvam_moe.py

Testing

How did you test this change?

Tested on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29. The model was compiled, loaded, and validated through:

  1. Logit validation: check_accuracy_logits_v2 with 5 diverse prompts × 20 tokens — 5/5 PASS
  2. Sequence length sweep: Tested 256 to 16384 — all compile and run successfully
  3. Batch size sweep: BS=1,2,4,8 at seq_len=4096 and BS=1,2,4 at seq_len=8192
  4. NKI fused TKG: Validated +21.5% throughput improvement with shared expert extraction
  5. GPU comparison: Compared against 1x H100 (vLLM 0.19.1, BF16)

Test Results:

Test Result
Logit validation (5 prompts × 20 tokens) 5/5 PASS
Generation coherence PASS
TTFT (seq_len=256, BS=1) 152.8ms
Throughput (seq_len=256, BS=1, NKI) 102.2 tok/s
Max seq_len (compiles) 16384
Max batch size tested 8

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (neuronx-cc 2.24.5133, NxDI 0.9.17334)
  • Instance Type(s): trn2.3xlarge (TP=4, LNC=2)
  • PyTorch Version: 2.9.0
  • Python Version: 3.12

Additional Information

  • Model requires trust_remote_code=True for HuggingFace loading
  • BF16 weights total ~62 GB, fitting on trn2.3xlarge TP=4 (~15.5 GB/core)
  • The implementation follows the Trinity contrib model pattern for shared expert extraction
  • Sigmoid routing uses ISA fallback (router_topk_nki_kernel_enabled=False) since the NKI router kernel expects softmax

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

- Set neuron_config.fused_qkv = True so NxDI GQA preshard_hook
  correctly handles the fused query_key_value checkpoint layout
- Set blockwise_matmul_config.use_shard_on_intermediate_dynamic_while = True
  to work around missing shard_hidden NKI kernel in SDK 2.29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant