diff --git a/contrib/models/Mistral-Small-4-119B-2603/README.md b/contrib/models/Mistral-Small-4-119B-2603/README.md
new file mode 100644
index 00000000..0380a087
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/README.md
@@ -0,0 +1,146 @@
+# Contrib Model: Mistral-Small-4-119B-2603
+
+Custom NeuronX Distributed Inference implementation for Mistral-Small-4-119B on trn2.48xlarge.
+
+## Model Information
+
+- **HuggingFace ID:** `mistralai/Mistral-Small-4-119B-2603`
+- **Model Type:** MoE + Multi-Latent Attention (MLA), text decoder extracted from multimodal
+- **Parameters:** 119B total, ~4B active per token (4/128 experts)
+- **License:** Check HuggingFace model card
+
+## Architecture Details
+
+This model uses DeepSeek-V3 architecture with:
+- **Multi-Latent Attention (MLA)**: kv_lora_rank=256, q_lora_rank=1024, qk_rope_head_dim=64, v_head_dim=128
+- **Mixture of Experts**: 128 routed experts + 1 shared expert, top-4 routing (sigmoid)
+- 36 transformer layers, hidden_size=4096
+- 32 attention heads, 32 KV heads
+- Compressed KV cache: 320-dim per head (256 latent + 64 rope)
+- Original: FP8 multimodal model, dequantized to BF16 text-only (238 GB)
+
+## Performance (SDK 2.29, vLLM 0.16.0, trn2.48xlarge, TP=16)
+
+| Workload | tok/s (conc=1) | TPOT | TTFT |
+|----------|:--------------:|:----:|:----:|
+| short-short (128/128) | **74.5** | 13.5ms | 307ms |
+| short-long (128/512) | **68.5** | 14.6ms | 308ms |
+| long-short (2048/128) | **67.9** | 14.7ms | 474ms |
+| long-long (2048/512) | **63.0** | 15.9ms | 474ms |
+
+**GPU baseline**: BLOCKED (transformers 5.x `mistral4` config type incompatibility)
+
+## Bug Fix: MLA Attention
+
+**CRITICAL**: NxDI's stock `DeepseekV3Attention` has a bug that affects this model.
+
+The `out_absorb` slicing in `modeling_deepseek.py` line 230 uses `wkv_b[:, self.v_head_dim:, :]`.
+This is only correct when `v_head_dim == qk_nope_head_dim` (as in stock DeepSeek V3 where both = 128).
+For Mistral-Small-4 (`v_head_dim=128, qk_nope_head_dim=64`), it causes a shape mismatch crash.
+
+Without this fix, the model either crashes or produces garbage output with ~10 tok/s.
+With the fix, performance is **74.5 tok/s** (6.9x improvement).
+
+## Setup Steps
+
+### 1. Download FP8 Model
+
+```bash
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+
+# Download (113 GB FP8, skip consolidated format)
+huggingface-cli download mistralai/Mistral-Small-4-119B-2603 \
+    --token YOUR_HF_TOKEN \
+    --local-dir /mnt/nvme/models/Mistral-Small-4-119B-2603 \
+    --include "model-*.safetensors" "*.json" "tokenizer*"
+```
+
+### 2. Extract Text Model + Dequantize FP8→BF16
+
+```bash
+# Edit src/extract_text_model.py to set correct SRC_DIR and DST_DIR
+python src/extract_text_model.py
+# Output: 238 GB BF16, 543 tensors, 35 shards (~5 minutes)
+```
+
+### 3. Fix Tokenizer
+
+```bash
+python src/fix_tokenizer.py /mnt/nvme/models/Mistral-Small-4-119B-text-only
+```
+
+### 4. Apply Patches
+
+```bash
+# Fix MLA attention bug
+python src/fix_mla_attention.py
+
+# Fix MoE torch_block_wise forwarding (SDK 2.29)
+python src/patch_moe.py
+
+# Register custom model class
+python src/register_model.py
+
+# Install custom model class
+SITE_PKGS=$(python -c "import neuronx_distributed_inference; print(neuronx_distributed_inference.__path__[0])")
+cp src/modeling_deepseekv3_full.py $SITE_PKGS/models/deepseek/modeling_deepseekv3_full.py
+```
+
+### 5. Start vLLM Server
+
+```bash
+python -m vllm.entrypoints.openai.api_server \
+    --model /mnt/nvme/models/Mistral-Small-4-119B-text-only \
+    --tensor-parallel-size 16 \
+    --max-model-len 4096 \
+    --max-num-seqs 4 \
+    --no-enable-prefix-caching \
+    --trust-remote-code \
+    --additional-config '{"override_neuron_config": {"blockwise_matmul_config": {"use_torch_block_wise": true}}}' \
+    --port 8000
+```
+
+Compilation takes ~28 minutes.
+
+## Instance Requirements
+
+| Resource | Minimum |
+|----------|---------|
+| Instance type | trn2.48xlarge |
+| TP degree | 16 |
+| LNC | 2 (default) |
+| HBM | ~238 GB (model) + KV cache |
+| Storage | 1.5 TB (NVMe required for model + extraction workspace) |
+| EBS | 300 GB |
+| Compile time | ~28 minutes |
+
+## Files
+
+| File | Purpose |
+|------|---------|
+| `src/modeling_deepseekv3_full.py` | Custom 429-line NeuronDeepseekV3ForCausalLM model class |
+| `src/extract_text_model.py` | FP8→BF16 dequantization + text extraction from multimodal |
+| `src/fix_mla_attention.py` | Patch for MLA out_absorb slicing bug |
+| `src/fix_tokenizer.py` | Tokenizer compatibility fix |
+| `src/patch_moe.py` | MoE torch_block_wise forwarding (SDK 2.29) |
+| `src/register_model.py` | Register deepseek_v3 model type in NxDI constants |
+
+## Compatibility
+
+| Instance/SDK | SDK 2.29 | SDK 2.28 |
+|--------------|----------|----------|
+| trn2.48xlarge (TP=16) | ✅ Working (all patches) | ✅ Working (no moe patch needed) |
+| trn2.48xlarge (TP=32) | ✅ Working but slower | ✅ Working but slower |
+
+## Known Issues
+
+1. **HuggingFace token required**: Model is gated, needs `--token` for download
+2. **Model name changed**: Was `Mistral-Small-4-119B-Instruct-2507`, now `-2603`
+3. **NVMe required**: 238 GB model + 113 GB source = 351 GB minimum working space
+4. **`--trust-remote-code` needed**: For tokenizer loading compatibility
+
+## Maintainer
+
+Agent Andretti - Mistral Family Benchmark Project
+
+**Last Updated:** 2026-04-20
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/extract_text_model.py b/contrib/models/Mistral-Small-4-119B-2603/src/extract_text_model.py
new file mode 100644
index 00000000..9f226314
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/extract_text_model.py
@@ -0,0 +1,328 @@
+#!/usr/bin/env python3
+"""
+Extract text-only weights from Mistral-Small-4-119B (multimodal) and dequantize FP8 -> BF16.
+Creates a standalone DeepSeek-V3-compatible text model directory.
+
+Input:  /home/ubuntu/models/Mistral-Small-4-119B (HF format with model-*.safetensors)
+Output: /home/ubuntu/models/Mistral-Small-4-119B-text-only/
+
+Weight key mapping (HF Mistral-Small-4 -> standalone text model):
+  language_model.model.layers.N.* -> model.layers.N.*
+  language_model.model.embed_tokens.weight -> model.embed_tokens.weight
+  language_model.model.norm.weight -> model.norm.weight
+  language_model.lm_head.weight -> lm_head.weight
+  (vision_tower.* and multi_modal_projector.* are dropped)
+
+FP8 dequantization:
+  weight_bf16 = weight_fp8.float() * weight_scale_inv.float()
+  then .to(bfloat16)
+  Scale keys (*_scale_inv, *activation_scale) are dropped after dequant.
+"""
+
+import json
+import os
+import shutil
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+from pathlib import Path
+
+SRC_DIR = "/mnt/nvme/models/Mistral-Small-4-119B-2603"
+DST_DIR = "/mnt/nvme/models/Mistral-Small-4-119B-text-only"
+
+
+def dequant_fp8(weight_fp8, scale_inv):
+    """Dequantize FP8 weight to BF16 using per-tensor scale."""
+    return (weight_fp8.float() * scale_inv.float()).to(torch.bfloat16)
+
+
+def process_weights():
+    os.makedirs(DST_DIR, exist_ok=True)
+
+    # Load HF-format index
+    with open(f"{SRC_DIR}/model.safetensors.index.json") as f:
+        idx = json.load(f)
+    weight_map = idx["weight_map"]
+
+    # Group keys by source file
+    file_keys = {}
+    for key, fname in weight_map.items():
+        if fname not in file_keys:
+            file_keys[fname] = []
+        file_keys[fname].append(key)
+
+    new_weight_map = {}
+    shard_idx = 0
+    current_shard = {}
+    current_shard_size = 0
+    MAX_SHARD_SIZE = 10 * 1024 * 1024 * 1024  # 10 GB per shard
+
+    def flush_shard():
+        nonlocal shard_idx, current_shard, current_shard_size
+        if not current_shard:
+            return
+        shard_idx += 1
+        shard_name = f"model-{shard_idx:05d}-of-PLACEHOLDER.safetensors"
+        print(
+            f"  Saving shard {shard_idx}: {len(current_shard)} tensors, {current_shard_size / 1e9:.2f} GB"
+        )
+        save_file(current_shard, f"{DST_DIR}/{shard_name}")
+        for k in current_shard:
+            new_weight_map[k] = shard_name
+        current_shard = {}
+        current_shard_size = 0
+
+    def add_tensor(new_key, tensor):
+        nonlocal current_shard, current_shard_size
+        tensor_size = tensor.numel() * tensor.element_size()
+        if current_shard_size + tensor_size > MAX_SHARD_SIZE and current_shard:
+            flush_shard()
+        current_shard[new_key] = tensor
+        current_shard_size += tensor_size
+
+    # Process each source file
+    for fname in sorted(file_keys.keys()):
+        keys = file_keys[fname]
+        print(f"\nProcessing {fname} ({len(keys)} keys)...")
+
+        f = safe_open(f"{SRC_DIR}/{fname}", framework="pt")
+
+        # Build lookup of all tensors in this file for dequant
+        # We need to handle several patterns:
+        # 1. Regular FP8 weight: key.weight + key.weight_scale_inv -> dequant
+        # 2. Expert grouped FP8: key (no .weight suffix) + key_scale_inv -> dequant
+        # 3. BF16 weight: key.weight -> keep as-is
+        # 4. Scale-only keys: *_scale_inv, *activation_scale -> skip (consumed by dequant)
+
+        # First pass: identify all keys and their roles
+        processed = set()
+
+        for key in sorted(keys):
+            if key in processed:
+                continue
+
+            # Skip vision and multimodal keys
+            if key.startswith("vision_tower.") or key.startswith(
+                "multi_modal_projector."
+            ):
+                processed.add(key)
+                continue
+
+            # Skip scale keys (will be consumed during dequant)
+            if key.endswith("_scale_inv") or key.endswith("activation_scale"):
+                processed.add(key)
+                continue
+
+            # Map key: strip language_model prefix
+            if key.startswith("language_model.model."):
+                new_key = key[len("language_model.") :]  # keep "model.layers.N..."
+            elif key.startswith("language_model."):
+                new_key = key[len("language_model.") :]  # "lm_head.weight"
+            else:
+                print(f"  WARNING: unexpected key prefix: {key}")
+                processed.add(key)
+                continue
+
+            tensor = f.get_tensor(key)
+            processed.add(key)
+
+            if tensor.dtype == torch.float8_e4m3fn:
+                # Find the scale
+                # Pattern 1: key ends with .weight -> scale is key_base.weight_scale_inv
+                # Pattern 2: expert grouped (no .weight) -> scale is key + "_scale_inv"
+                if key.endswith(".weight"):
+                    scale_key = key.replace(".weight", ".weight_scale_inv")
+                else:
+                    # Grouped expert tensors: e.g., mlp.experts.gate_up_proj
+                    scale_key = key + "_scale_inv"
+
+                if scale_key in weight_map:
+                    scale_fname = weight_map[scale_key]
+                    if scale_fname == fname:
+                        scale = f.get_tensor(scale_key)
+                    else:
+                        sf = safe_open(f"{SRC_DIR}/{scale_fname}", framework="pt")
+                        scale = sf.get_tensor(scale_key)
+                    processed.add(scale_key)
+
+                    # Also mark activation_scale as processed
+                    if key.endswith(".weight"):
+                        act_scale_key = key.replace(".weight", ".activation_scale")
+                    else:
+                        act_scale_key = key + "_activation_scale"
+                    if act_scale_key in weight_map:
+                        processed.add(act_scale_key)
+
+                    tensor = dequant_fp8(tensor, scale)
+                    print(f"  Dequant: {key} {list(tensor.shape)} -> {new_key}")
+                else:
+                    print(
+                        f"  WARNING: no scale found for FP8 weight {key}, casting directly"
+                    )
+                    tensor = tensor.to(torch.bfloat16)
+            else:
+                print(
+                    f"  Copy: {key} {list(tensor.shape)} dtype={tensor.dtype} -> {new_key}"
+                )
+
+            # Ensure BF16
+            if tensor.dtype != torch.bfloat16:
+                tensor = tensor.to(torch.bfloat16)
+
+            add_tensor(new_key, tensor)
+
+    # Flush remaining
+    flush_shard()
+
+    # Fix shard names (replace PLACEHOLDER with actual count)
+    total_shards = shard_idx
+    final_weight_map = {}
+    for k, shard_name in new_weight_map.items():
+        final_name = shard_name.replace("PLACEHOLDER", f"{total_shards:05d}")
+        if shard_name != final_name:
+            old_path = f"{DST_DIR}/{shard_name}"
+            new_path = f"{DST_DIR}/{final_name}"
+            if os.path.exists(old_path):
+                os.rename(old_path, new_path)
+        final_weight_map[k] = final_name
+
+    # Write new index
+    new_index = {
+        "metadata": {
+            "total_size": sum(
+                t.numel() * t.element_size()
+                for shard_name in set(final_weight_map.values())
+                for k, t in [
+                    (
+                        k2,
+                        safe_open(f"{DST_DIR}/{shard_name}", framework="pt").get_tensor(
+                            k2
+                        ),
+                    )
+                    for k2 in [
+                        k3 for k3, v in final_weight_map.items() if v == shard_name
+                    ][:1]
+                ]
+            )
+        },
+        "weight_map": final_weight_map,
+    }
+    # Simpler: just compute total from what we know
+    # We'll compute it properly after saving
+    new_index = {
+        "metadata": {},
+        "weight_map": final_weight_map,
+    }
+    with open(f"{DST_DIR}/model.safetensors.index.json", "w") as f:
+        json.dump(new_index, f, indent=2)
+
+    print(f"\n=== Done! {len(final_weight_map)} tensors in {total_shards} shards ===")
+    return total_shards
+
+
+def create_config():
+    """Create a DeepSeek-V3 compatible config for the text model."""
+    with open(f"{SRC_DIR}/config.json") as f:
+        orig_config = json.load(f)
+
+    text_config = orig_config["text_config"]
+
+    # Build DeepSeek-V3 compatible config
+    # The NxDI DeepseekV3InferenceConfig needs these fields
+    config = {
+        "architectures": ["DeepseekV3ForCausalLM"],
+        "model_type": "deepseek_v3",
+        "torch_dtype": "bfloat16",
+        # From text_config
+        "hidden_size": text_config["hidden_size"],  # 4096
+        "intermediate_size": text_config["intermediate_size"],  # 12288 (shared expert)
+        "num_hidden_layers": text_config["num_hidden_layers"],  # 36
+        "num_attention_heads": text_config["num_attention_heads"],  # 32
+        "num_key_value_heads": text_config.get(
+            "num_key_value_heads", text_config["num_attention_heads"]
+        ),  # 32
+        "head_dim": text_config.get("head_dim", 128),  # 128
+        "vocab_size": text_config["vocab_size"],  # 131072
+        "max_position_embeddings": text_config["max_position_embeddings"],  # 1048576
+        "rms_norm_eps": text_config["rms_norm_eps"],  # 1e-6
+        "hidden_act": text_config.get("hidden_act", "silu"),
+        "tie_word_embeddings": text_config.get("tie_word_embeddings", False),
+        "attention_bias": text_config.get("attention_bias", False),
+        "attention_dropout": text_config.get("attention_dropout", 0.0),
+        "bos_token_id": text_config.get("bos_token_id", 1),
+        "eos_token_id": text_config.get("eos_token_id", 2),
+        "pad_token_id": text_config.get("pad_token_id", 11),
+        # MLA (Multi-head Latent Attention) config
+        "q_lora_rank": text_config["q_lora_rank"],  # 1024
+        "qk_rope_head_dim": text_config["qk_rope_head_dim"],  # 64
+        "qk_nope_head_dim": text_config["qk_nope_head_dim"],  # 64
+        "kv_lora_rank": text_config["kv_lora_rank"],  # 256
+        "v_head_dim": text_config["v_head_dim"],  # 128
+        # MoE config
+        "n_routed_experts": text_config["n_routed_experts"],  # 128
+        "num_experts_per_tok": text_config["num_experts_per_tok"],  # 4
+        "n_shared_experts": text_config["n_shared_experts"],  # 1
+        "moe_intermediate_size": text_config["moe_intermediate_size"],  # 2048
+        "first_k_dense_replace": text_config.get("first_k_dense_replace", 0),  # 0
+        "norm_topk_prob": text_config.get("norm_topk_prob", True),
+        "routed_scaling_factor": text_config.get("routed_scaling_factor", 1.0),
+        "n_group": text_config.get("n_group", 1),
+        "topk_group": text_config.get("topk_group", 1),
+        # RoPE config
+        "rope_theta": text_config.get("rope_parameters", {}).get("rope_theta", 10000.0),
+        "rope_scaling": {
+            "type": "yarn",
+            "rope_type": "yarn",
+            "factor": text_config["rope_parameters"]["factor"],  # 128.0
+            "original_max_position_embeddings": text_config["rope_parameters"][
+                "original_max_position_embeddings"
+            ],  # 8192
+            "beta_fast": text_config["rope_parameters"]["beta_fast"],  # 32.0
+            "beta_slow": text_config["rope_parameters"]["beta_slow"],  # 1.0
+            "mscale": text_config["rope_parameters"].get("mscale", 1.0),
+            "mscale_all_dim": text_config["rope_parameters"].get("mscale_all_dim", 1.0),
+        },
+    }
+
+    with open(f"{DST_DIR}/config.json", "w") as f:
+        json.dump(config, f, indent=2)
+    print(f"Config written to {DST_DIR}/config.json")
+
+
+def copy_tokenizer():
+    """Copy tokenizer files from source."""
+    tokenizer_files = [
+        "tokenizer_config.json",
+        "generation_config.json",
+        "chat_template.jinja",
+        "SYSTEM_PROMPT.txt",
+    ]
+    # Also copy any tokenizer model files
+    for fname in os.listdir(SRC_DIR):
+        if fname.startswith("tokenizer") or fname == "special_tokens_map.json":
+            tokenizer_files.append(fname)
+
+    for fname in set(tokenizer_files):
+        src = f"{SRC_DIR}/{fname}"
+        if os.path.exists(src):
+            shutil.copy2(src, f"{DST_DIR}/{fname}")
+            print(f"Copied {fname}")
+
+
+if __name__ == "__main__":
+    print("=== Mistral-Small-4-119B Text Extraction + FP8 Dequantization ===")
+    print(f"Source: {SRC_DIR}")
+    print(f"Destination: {DST_DIR}")
+    print()
+
+    os.makedirs(DST_DIR, exist_ok=True)
+    create_config()
+    copy_tokenizer()
+    total_shards = process_weights()
+
+    # Calculate total size
+    total_bytes = 0
+    for shard_file in Path(DST_DIR).glob("model-*.safetensors"):
+        total_bytes += shard_file.stat().st_size
+    print(f"\nTotal model size: {total_bytes / 1e9:.2f} GB")
+    print(f"Output directory: {DST_DIR}")
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/fix_mla_attention.py b/contrib/models/Mistral-Small-4-119B-2603/src/fix_mla_attention.py
new file mode 100644
index 00000000..2141825d
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/fix_mla_attention.py
@@ -0,0 +1,68 @@
+#!/usr/bin/env python3
+"""
+Fix MLA attention out_absorb slicing bug in NxDI's stock DeepseekV3Attention.
+
+Bug: Line 230 of modeling_deepseek.py uses:
+    out_absorb = wkv_b[:, self.v_head_dim:, :]
+
+This is correct ONLY when v_head_dim == qk_nope_head_dim (as in stock DeepSeek V3).
+For Mistral-Small-4-119B (v_head_dim=128, qk_nope_head_dim=64), it produces a shape
+mismatch causing RuntimeError during attention output reshape.
+
+Fix: Change to:
+    out_absorb = wkv_b[:, self.qk_nope_head_dim:, :]
+
+The kv_b_proj weight is structured as [qk_nope_head_dim | v_head_dim] per head.
+- First qk_nope_head_dim elements: used for Q nope absorption
+- Remaining v_head_dim elements: used for V output absorption
+
+Usage:
+    python fix_mla_attention.py
+"""
+
+import os
+import sys
+
+
+def get_modeling_deepseek_path():
+    """Find the modeling_deepseek.py file."""
+    try:
+        import neuronx_distributed_inference.models.deepseek.modeling_deepseek as mod
+
+        return mod.__file__
+    except ImportError:
+        venv_paths = [
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/lib/python3.12/site-packages/neuronx_distributed_inference/models/deepseek/modeling_deepseek.py",
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/lib/python3.12/site-packages/neuronx_distributed_inference/models/deepseek/modeling_deepseek.py",
+        ]
+        for p in venv_paths:
+            if os.path.exists(p):
+                return p
+        raise FileNotFoundError("Cannot find modeling_deepseek.py")
+
+
+def fix_mla():
+    fpath = get_modeling_deepseek_path()
+    content = open(fpath).read()
+
+    old = "out_absorb = wkv_b[:, self.v_head_dim:, :]"
+    new = "out_absorb = wkv_b[:, self.qk_nope_head_dim:, :]"
+
+    if new in content:
+        print(f"Already fixed: {fpath}")
+        return True
+
+    if old in content:
+        content = content.replace(old, new)
+        open(fpath, "w").write(content)
+        print(f"Fixed MLA out_absorb slicing: {fpath}")
+        return True
+    else:
+        print(f"ERROR: Pattern not found in {fpath}")
+        print("  Expected: out_absorb = wkv_b[:, self.v_head_dim:, :]")
+        return False
+
+
+if __name__ == "__main__":
+    success = fix_mla()
+    sys.exit(0 if success else 1)
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/fix_tokenizer.py b/contrib/models/Mistral-Small-4-119B-2603/src/fix_tokenizer.py
new file mode 100644
index 00000000..3483eb3b
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/fix_tokenizer.py
@@ -0,0 +1,60 @@
+#!/usr/bin/env python3
+"""
+Fix tokenizer_config.json for Mistral-Small-4-119B-2603 compatibility.
+
+The HuggingFace tokenizer uses:
+- tokenizer_class: "TokenizersBackend" (not available in transformers <5.3)
+- extra_special_tokens: [...] (list format, but transformers expects dict)
+
+This script fixes both issues for compatibility with SDK 2.29's transformers version.
+
+Usage:
+    python fix_tokenizer.py /path/to/model-dir
+"""
+
+import json
+import os
+import sys
+
+
+def fix_tokenizer(model_dir):
+    fpath = os.path.join(model_dir, "tokenizer_config.json")
+    if not os.path.exists(fpath):
+        print(f"ERROR: {fpath} not found")
+        return False
+
+    with open(fpath) as f:
+        cfg = json.load(f)
+
+    modified = False
+
+    # Fix tokenizer_class
+    if cfg.get("tokenizer_class") == "TokenizersBackend":
+        cfg["tokenizer_class"] = "PreTrainedTokenizerFast"
+        cfg.pop("backend", None)
+        modified = True
+        print("  Fixed tokenizer_class: TokenizersBackend -> PreTrainedTokenizerFast")
+
+    # Fix extra_special_tokens (list -> remove)
+    if "extra_special_tokens" in cfg and isinstance(cfg["extra_special_tokens"], list):
+        cfg.pop("extra_special_tokens")
+        modified = True
+        print("  Removed incompatible extra_special_tokens list")
+
+    if modified:
+        with open(fpath, "w") as f:
+            json.dump(cfg, f, indent=2)
+        print(f"Tokenizer config fixed: {fpath}")
+    else:
+        print(f"Tokenizer config already compatible: {fpath}")
+
+    return True
+
+
+if __name__ == "__main__":
+    if len(sys.argv) < 2:
+        print("Usage: python fix_tokenizer.py /path/to/model-dir")
+        sys.exit(1)
+
+    success = fix_tokenizer(sys.argv[1])
+    sys.exit(0 if success else 1)
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/modeling_deepseekv3_full.py b/contrib/models/Mistral-Small-4-119B-2603/src/modeling_deepseekv3_full.py
new file mode 100644
index 00000000..9842dbfd
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/modeling_deepseekv3_full.py
@@ -0,0 +1,429 @@
+# coding=utf-8
+# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+NeuronDeepseekV3ForCausalLM - Full model implementation for DeepSeek-V3 / Mistral-Small-4
+on AWS Neuron (NxDI).
+
+Combines:
+- MLA (Multi-head Latent Attention) from existing NxDI DeepseekV3Attention
+- MoE with shared experts from NxDI moe_v2 (same as Qwen3_moe/Llama4 pattern)
+- NeuronBaseForCausalLM framework
+
+Based on:
+- neuronx_distributed_inference/models/deepseek/modeling_deepseek.py (MLA attention)
+- neuronx_distributed_inference/models/qwen3_moe/modeling_qwen3_moe.py (MoE + shared experts pattern)
+- neuronx_distributed_inference/models/mixtral/modeling_mixtral.py (base MoE pattern)
+"""
+
+import gc
+import logging
+import warnings
+from typing import List, Optional, Tuple, Union
+
+import torch
+
+from neuronx_distributed_inference.models.model_base import (
+    NeuronBaseForCausalLM,
+    NeuronBaseModel,
+)
+from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
+
+from neuronx_distributed.parallel_layers import parallel_state
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed.utils import cpu_mode
+from torch import nn
+from transformers import AutoModelForCausalLM
+from transformers.generation import SampleDecoderOnlyOutput, SampleEncoderDecoderOutput
+
+from neuronx_distributed_inference.models.config import InferenceConfig, MoENeuronConfig
+from neuronx_distributed_inference.models.deepseek.modeling_deepseek import (
+    DeepseekV3Attention,
+    DeepseekV3InferenceConfig,
+    DeepseekV3RMSNorm,
+    get_rmsnorm_cls,
+)
+from neuronx_distributed_inference.modules.moe_v2 import initialize_moe_module
+
+logger = logging.getLogger(__name__)
+
+SampleOutput = Union[SampleEncoderDecoderOutput, SampleDecoderOnlyOutput]
+
+
+class DeepseekV3MoEInferenceConfig(InferenceConfig):
+    """
+    Config class for DeepSeek-V3 / Mistral-Small-4 models with MoE + MLA.
+    Extends InferenceConfig with MoE-specific fields.
+    """
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+        # Map n_routed_experts -> num_local_experts for MoE module compatibility
+        if hasattr(self, "n_routed_experts") and not hasattr(self, "num_local_experts"):
+            self.num_local_experts = self.n_routed_experts
+
+        # Ensure n_shared_experts is set (default 0)
+        if not hasattr(self, "n_shared_experts"):
+            self.n_shared_experts = 0
+
+        # MoE intermediate size for experts
+        # moe_v2's initialize_moe_module reads config.intermediate_size for expert FFN size
+        # But we need to preserve the original intermediate_size for shared experts
+        self.shared_expert_intermediate_size = getattr(
+            self, "intermediate_size", self.hidden_size * 3
+        )
+        # Set intermediate_size to moe_intermediate_size for the MoE module
+        if hasattr(self, "moe_intermediate_size"):
+            self.intermediate_size = self.moe_intermediate_size
+
+        # Router config
+        self.neuron_config.router_config.dtype = torch.float32
+        self.neuron_config.router_config.act_fn = "softmax"
+
+        # Normalize top-k affinities
+        if hasattr(self, "norm_topk_prob") and self.norm_topk_prob:
+            self.neuron_config.normalize_top_k_affinities = True
+
+        # GLU MLP for SiLU activation
+        self.neuron_config.glu_mlp = True
+
+        # Set disable_numeric_cc_token as workaround (same as Qwen3_moe)
+        self.neuron_config.disable_numeric_cc_token = True
+
+    def add_derived_config(self):
+        self.num_cores_per_group = 1
+        # Ensure rope_scaling is accessible as dict
+        if hasattr(self, "rope_scaling") and isinstance(self.rope_scaling, dict):
+            # Already a dict, good
+            pass
+
+    def get_required_attributes(self) -> List[str]:
+        return [
+            "hidden_size",
+            "num_attention_heads",
+            "num_hidden_layers",
+            "vocab_size",
+            "max_position_embeddings",
+            "rms_norm_eps",
+            # MLA fields
+            "q_lora_rank",
+            "qk_rope_head_dim",
+            "qk_nope_head_dim",
+            "kv_lora_rank",
+            "v_head_dim",
+            # MoE fields
+            "n_routed_experts",
+            "num_experts_per_tok",
+            "moe_intermediate_size",
+            # RoPE
+            "rope_scaling",
+            "rope_theta",
+        ]
+
+    @classmethod
+    def get_neuron_config_cls(cls):
+        return MoENeuronConfig
+
+
+def convert_deepseekv3_to_neuron_state_dict(neuron_state_dict, config):
+    """
+    Convert DeepSeek-V3 / Mistral-Small-4 HF state dict to NxDI format.
+
+    Key transformations:
+    1. Router: mlp.gate.weight -> mlp.router.linear_router.weight
+    2. Experts: mlp.experts.gate_up_proj [N,H,2I] -> mlp.expert_mlps.mlp_op.gate_up_proj.weight
+    3. Experts: mlp.experts.down_proj [N,H,I] -> mlp.expert_mlps.mlp_op.down_proj.weight
+    4. Shared experts: mlp.shared_experts.{gate,up,down}_proj.weight -> mlp.shared_experts.*
+    5. MLA attention keys are already compatible (q_a_proj, kv_a_proj_with_mqa, etc.)
+    6. Add rank_util tensors
+
+    Input state dict keys (after stripping "model." prefix by HF adapter):
+      layers.N.self_attn.q_a_proj.weight
+      layers.N.self_attn.q_a_layernorm.weight
+      layers.N.self_attn.q_b_proj.weight
+      layers.N.self_attn.kv_a_proj_with_mqa.weight
+      layers.N.self_attn.kv_a_layernorm.weight
+      layers.N.self_attn.kv_b_proj.weight
+      layers.N.self_attn.o_proj.weight
+      layers.N.input_layernorm.weight
+      layers.N.post_attention_layernorm.weight
+      layers.N.mlp.gate.weight                        (router)
+      layers.N.mlp.experts.gate_up_proj                (grouped: [128, 4096, 4096])
+      layers.N.mlp.experts.down_proj                   (grouped: [128, 4096, 2048])
+      layers.N.mlp.shared_experts.gate_proj.weight
+      layers.N.mlp.shared_experts.up_proj.weight
+      layers.N.mlp.shared_experts.down_proj.weight
+      embed_tokens.weight
+      norm.weight
+      lm_head.weight
+    """
+    assert config.neuron_config.glu_mlp is True, (
+        "Only GLU MLP is supported for DeepSeek-V3"
+    )
+
+    # Add rank utility tensor
+    neuron_state_dict["rank_util.rank"] = torch.arange(
+        0, config.neuron_config.tp_degree, dtype=torch.int32
+    )
+
+    for l in range(config.num_hidden_layers):  # noqa: E741
+        # Add per-layer rank utility for attention
+        neuron_state_dict[f"layers.{l}.self_attn.rank_util.rank"] = torch.arange(
+            0, config.neuron_config.tp_degree, dtype=torch.int32
+        )
+
+        # ---- Router ----
+        router_key = f"layers.{l}.mlp.gate.weight"
+        if router_key in neuron_state_dict:
+            neuron_state_dict[f"layers.{l}.mlp.router.linear_router.weight"] = (
+                neuron_state_dict[router_key].detach().clone()
+            )
+            del neuron_state_dict[router_key]
+
+        # ---- Routed Experts ----
+        # The HF checkpoint stores grouped expert weights following nn.Linear convention
+        # (weight shape = [out_features, in_features]):
+        #   mlp.experts.gate_up_proj: [num_experts, 2*moe_intermediate_size, hidden_size]
+        #   mlp.experts.down_proj: [num_experts, hidden_size, moe_intermediate_size]
+        #
+        # NxDI MoE einsum "e...h,ehi->e...i" expects weights as [E, in_dim, out_dim]:
+        #   mlp.expert_mlps.mlp_op.gate_up_proj.weight: [num_experts, hidden_size, 2*intermediate_size]
+        #   mlp.expert_mlps.mlp_op.down_proj.weight: [num_experts, intermediate_size, hidden_size]
+        #
+        # Both need transposition of dims 1 and 2.
+        # NOTE: For Mistral-Small-4, hidden_size == 2*moe_intermediate_size == 4096,
+        # so the shapes are square [E, 4096, 4096]. The transpose changes the DATA LAYOUT
+        # even though the shape appears unchanged.
+
+        gate_up_key = f"layers.{l}.mlp.experts.gate_up_proj"
+        if gate_up_key in neuron_state_dict:
+            gate_up = neuron_state_dict[gate_up_key]
+            # gate_up is [E, 2I, H] (HF convention) -> transpose to [E, H, 2I] (NxDI convention)
+            neuron_state_dict[
+                f"layers.{l}.mlp.expert_mlps.mlp_op.gate_up_proj.weight"
+            ] = gate_up.transpose(1, 2).contiguous()
+            del neuron_state_dict[gate_up_key]
+
+        down_key = f"layers.{l}.mlp.experts.down_proj"
+        if down_key in neuron_state_dict:
+            down = neuron_state_dict[down_key]
+            # down is [E, H, I] (HF convention) -> transpose to [E, I, H] (NxDI convention)
+            neuron_state_dict[f"layers.{l}.mlp.expert_mlps.mlp_op.down_proj.weight"] = (
+                down.transpose(1, 2).contiguous()
+            )
+            del neuron_state_dict[down_key]
+
+        # ---- Shared Experts ----
+        # SharedExperts in NxDI moe_v2 expects:
+        #   mlp.shared_experts.gate_proj.weight (or fused gate_up_proj)
+        #   mlp.shared_experts.up_proj.weight
+        #   mlp.shared_experts.down_proj.weight
+        # The HF keys already match! No renaming needed for shared experts.
+        # But we need to check if NxDI SharedExperts uses fused gate+up or separate.
+        # From moe_v2.py: SharedExperts takes fused_gate_up_projection param.
+        # If fused, it expects: shared_experts.gate_up_proj.weight
+        # If not fused, it expects separate gate_proj and up_proj.
+        # We'll use non-fused (separate) for simplicity since that matches the HF format.
+
+        # ---- Remove FP8 scale keys if any remain ----
+        keys_to_delete = []
+        for key in list(neuron_state_dict.keys()):
+            if key.startswith(f"layers.{l}.") and (
+                "_scale_inv" in key or "activation_scale" in key
+            ):
+                keys_to_delete.append(key)
+        for key in keys_to_delete:
+            del neuron_state_dict[key]
+
+        gc.collect()
+
+    return neuron_state_dict
+
+
+class NeuronDeepseekV3DecoderLayer(nn.Module):
+    """
+    DeepSeek-V3 decoder layer: MLA attention + MoE with shared experts.
+    """
+
+    def __init__(self, config: DeepseekV3MoEInferenceConfig, layer_idx: int):
+        super().__init__()
+        self.hidden_size = config.hidden_size
+
+        # MLA Attention
+        self.self_attn = DeepseekV3Attention(
+            config=config,
+            layer_idx=layer_idx,
+            tensor_model_parallel_group=parallel_state.get_tensor_model_parallel_group(),
+        )
+
+        # MoE with shared experts
+        self.mlp = initialize_moe_module(config=config)
+
+        # Layer norms
+        self.input_layernorm = get_rmsnorm_cls()(
+            config.hidden_size,
+            eps=config.rms_norm_eps,
+        )
+        self.post_attention_layernorm = get_rmsnorm_cls()(
+            config.hidden_size,
+            eps=config.rms_norm_eps,
+        )
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        **kwargs,
+    ) -> Tuple[
+        torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]
+    ]:
+        if "padding_mask" in kwargs:
+            warnings.warn(
+                "Passing `padding_mask` is deprecated. Please use `attention_mask` instead."
+            )
+
+        # Convert from KV cache format (batch, 1, seq, 320) back to MLA format
+        # past_key_value from cache manager is (k_cache, v_cache) tuple
+        # k_cache has shape (batch, 1, seq_len, qk_rope_head_dim + kv_lora_rank)
+        # We need to pass a single concatenated tensor to the attention
+        mla_past_kv = None
+        if past_key_value is not None:
+            if isinstance(past_key_value, (tuple, list)):
+                # From KV cache manager: (k_cache, v_cache)
+                k_cache = past_key_value[0]
+                # k_cache shape: (batch, 1, seq_len, 320)
+                mla_past_kv = k_cache.squeeze(1)  # (batch, seq_len, 320)
+            else:
+                mla_past_kv = past_key_value
+
+        residual = hidden_states
+
+        hidden_states = self.input_layernorm(hidden_states)
+
+        # MLA Self Attention
+        hidden_states, present_key_value_raw, cos_cache, sin_cache = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=mla_past_kv,
+            **kwargs,
+        )
+        hidden_states = residual + hidden_states
+
+        # Convert MLA KV output to standard 4D KV cache format
+        # present_key_value_raw = (k_pe[batch, seq, rope_dim], compressed_kv[batch, seq, kv_rank])
+        k_pe, compressed_kv = present_key_value_raw
+        # Concatenate along last dim: (batch, seq_len, qk_rope_head_dim + kv_lora_rank)
+        concat_kv = torch.cat([k_pe, compressed_kv], dim=-1)
+        # Add head dimension: (batch, 1, seq_len, 320)
+        concat_kv = concat_kv.unsqueeze(1)
+        # Create dummy V cache with same shape
+        dummy_v = torch.zeros_like(concat_kv)
+        present_key_value = (concat_kv, dummy_v)
+
+        # MoE FFN (with shared experts handled internally by moe_v2)
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)[0]
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states, present_key_value, cos_cache, sin_cache, None)
+        return outputs
+
+
+class NeuronDeepseekV3Model(NeuronBaseModel):
+    """
+    NeuronDeepseekV3Model - traceable model for NxD compilation.
+    """
+
+    def setup_attr_for_model(self, config: DeepseekV3MoEInferenceConfig):
+        self.on_device_sampling = (
+            config.neuron_config.on_device_sampling_config is not None
+        )
+        self.tp_degree = config.neuron_config.tp_degree
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        # MLA: KV cache stores 1 compressed "head" per position
+        # The compressed KV dim = qk_rope_head_dim + kv_lora_rank
+        self.num_key_value_heads = 1
+        # Override head_dim for KV cache sizing (not used by attention module)
+        config.head_dim = config.qk_rope_head_dim + config.kv_lora_rank
+        self.max_batch_size = config.neuron_config.max_batch_size
+        self.buckets = config.neuron_config.buckets
+
+    def init_model(self, config: DeepseekV3MoEInferenceConfig):
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        self.embed_tokens = ParallelEmbedding(
+            config.vocab_size,
+            config.hidden_size,
+            self.padding_idx,
+            dtype=config.neuron_config.torch_dtype,
+            shard_across_embedding=True,
+        )
+        self.layers = nn.ModuleList(
+            [
+                NeuronDeepseekV3DecoderLayer(config, layer_idx)
+                for layer_idx in range(config.num_hidden_layers)
+            ]
+        )
+        self.norm = get_rmsnorm_cls()(self.hidden_size, eps=config.rms_norm_eps)
+        self.lm_head = ColumnParallelLinear(
+            config.hidden_size,
+            config.vocab_size,
+            gather_output=False if self.on_device_sampling else True,
+            bias=False,
+        )
+
+
+class NeuronDeepseekV3ForCausalLM(NeuronBaseForCausalLM):
+    """
+    NeuronDeepseekV3ForCausalLM - Entry point for DeepSeek-V3 / Mistral-Small-4
+    inference on Neuron.
+    """
+
+    _model_cls = NeuronDeepseekV3Model
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        return AutoModelForCausalLM.from_pretrained(model_path, **kwargs)
+
+    @classmethod
+    def get_config_cls(cls):
+        return DeepseekV3MoEInferenceConfig
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: dict, config: DeepseekV3MoEInferenceConfig
+    ) -> dict:
+        return convert_deepseekv3_to_neuron_state_dict(state_dict, config)
+
+    def get_compiler_args(self):
+        compiler_args = "--enable-saturate-infinity --enable-mixed-precision-accumulation --model-type transformer -O1"
+        compiler_args += " --tensorizer-options='--enable-ccop-compute-overlap --cc-pipeline-tiling-factor=2'"
+        compiler_args += (
+            " --auto-cast=none --internal-hlo2tensorizer-options='--verify-hlo=true'"
+        )
+        # Enable vector-offset DGE
+        compiler_args += " --internal-enable-dge-levels vector_dynamic_offsets"
+        # DMA optimization from DeepSeek attention
+        compiler_args += " --tensorizer-options='--vectorize-strided-dma'"
+        return compiler_args
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/patch_moe.py b/contrib/models/Mistral-Small-4-119B-2603/src/patch_moe.py
new file mode 100644
index 00000000..f311a89f
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/patch_moe.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Patch NxDI moe.py to forward blockwise_matmul_config to ExpertMLPs.
+
+Required for SDK 2.29+ where the NKI blockwise kernel (neuronxcc.nki._private.blockwise_mm)
+was removed in NKI 0.3.0 GA. Without this patch, MoE models crash with:
+  NotImplementedError: _call_shard_hidden_kernel is not available
+
+The patch checks if `use_torch_block_wise=True` is set in the NeuronConfig's
+blockwise_matmul_config and passes it to the ExpertMLPs constructor.
+
+Usage:
+    python patch_moe.py
+
+Must be run BEFORE starting vLLM. Only needed on SDK 2.29+ (NKI 0.3.0 GA).
+"""
+
+import importlib
+import os
+import re
+import sys
+
+
+def get_moe_path():
+    """Find the moe.py file in the installed NxDI package."""
+    try:
+        import neuronx_distributed_inference.modules.moe as moe_module
+
+        return moe_module.__file__
+    except ImportError:
+        # Fallback: search common venv locations
+        venv_paths = [
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+        ]
+        for p in venv_paths:
+            if os.path.exists(p):
+                return p
+        raise FileNotFoundError(
+            "Cannot find neuronx_distributed_inference/modules/moe.py"
+        )
+
+
+def patch_moe():
+    """Apply the torch_block_wise forwarding patch to moe.py."""
+    fpath = get_moe_path()
+    content = open(fpath).read()
+
+    # Check if already patched
+    if "use_torch_block_wise" in content:
+        print(f"Already patched: {fpath}")
+        return True
+
+    # Find the initialize_moe_module function and add the forwarding logic
+    # We need to find where ExpertMLPs is instantiated and add the kwarg
+    # The pattern: look for "def initialize_moe_module" and add our check before ExpertMLPs(...)
+
+    # Strategy: Insert a block at the top of initialize_moe_module that sets up extra kwargs
+    pattern = r"(def initialize_moe_module\([^)]*\):)"
+    match = re.search(pattern, content)
+    if not match:
+        print("ERROR: Could not find initialize_moe_module function")
+        return False
+
+    # Find the line after the function def to insert our patch
+    func_start = match.end()
+    # Skip docstring if present
+    rest = content[func_start:]
+
+    # Insert patch: check blockwise_matmul_config and add use_torch_block_wise
+    patch_code = """
+    # --- SDK 2.29 MoE patch: forward use_torch_block_wise to ExpertMLPs ---
+    extra_kwargs = {}
+    if hasattr(config, 'neuron_config') and hasattr(config.neuron_config, 'blockwise_matmul_config'):
+        bmc = config.neuron_config.blockwise_matmul_config
+        use_torch = getattr(bmc, "use_torch_block_wise", False)
+        if use_torch:
+            extra_kwargs["use_torch_block_wise"] = True
+    # --- end patch ---
+"""
+
+    # Find the first newline after the function signature
+    newline_pos = rest.find("\n")
+    if newline_pos == -1:
+        print("ERROR: Unexpected file structure")
+        return False
+
+    # Check if there's a docstring
+    stripped = rest[newline_pos:].lstrip()
+    if stripped.startswith('"""') or stripped.startswith("'''"):
+        # Skip past the closing docstring
+        quote = stripped[:3]
+        docstring_end = rest.find(
+            quote, newline_pos + rest[newline_pos:].find(quote) + 3
+        )
+        if docstring_end > 0:
+            insert_pos = func_start + docstring_end + 3
+            # Find next newline
+            next_nl = content.find("\n", insert_pos)
+            insert_pos = next_nl + 1
+        else:
+            insert_pos = func_start + newline_pos + 1
+    else:
+        insert_pos = func_start + newline_pos + 1
+
+    content = content[:insert_pos] + patch_code + content[insert_pos:]
+
+    # Now add **extra_kwargs to the ExpertMLPs constructor call
+    # Find "ExpertMLPs(" and add extra_kwargs before the closing )
+    expert_pattern = r"(ExpertMLPs\([^)]+)\)"
+    matches = list(re.finditer(expert_pattern, content))
+    if matches:
+        # Patch the last match (most likely the real one inside initialize_moe_module)
+        for m in reversed(matches):
+            old = m.group(0)
+            new = old[:-1] + ", **extra_kwargs)"
+            content = content[: m.start()] + new + content[m.end() :]
+
+        open(fpath, "w").write(content)
+        print(f"Patched successfully: {fpath}")
+        return True
+    else:
+        print(
+            "WARNING: Could not find ExpertMLPs constructor. Manual patch may be needed."
+        )
+        print(f"File: {fpath}")
+        open(fpath, "w").write(content)
+        return False
+
+
+if __name__ == "__main__":
+    success = patch_moe()
+    sys.exit(0 if success else 1)
diff --git a/contrib/models/Mistral-Small-4-119B-2603/src/register_model.py b/contrib/models/Mistral-Small-4-119B-2603/src/register_model.py
new file mode 100644
index 00000000..a5fcc47d
--- /dev/null
+++ b/contrib/models/Mistral-Small-4-119B-2603/src/register_model.py
@@ -0,0 +1,36 @@
+#!/usr/bin/env python3
+"""Patch constants.py to add deepseekv3 to MODEL_TYPES."""
+
+fpath = "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/lib/python3.12/site-packages/neuronx_distributed_inference/utils/constants.py"
+content = open(fpath).read()
+
+# Check if already added
+if '"deepseekv3"' in content:
+    print("deepseekv3 already in MODEL_TYPES")
+else:
+    # Find the closing brace of MODEL_TYPES dict
+    # It's the last "}" before some constants or end of MODEL_TYPES block
+    # Strategy: find MODEL_TYPES = { ... } and add entry before the closing }
+
+    # Find the last } that closes MODEL_TYPES
+    mt_start = content.index("MODEL_TYPES = {")
+    # Find matching closing brace
+    brace_count = 0
+    mt_end = -1
+    for i in range(mt_start, len(content)):
+        if content[i] == "{":
+            brace_count += 1
+        elif content[i] == "}":
+            brace_count -= 1
+            if brace_count == 0:
+                mt_end = i
+                break
+
+    if mt_end > 0:
+        # Insert new entry before the closing }
+        new_entry = '    "deepseekv3": {"causal-lm": NeuronDeepseekV3ForCausalLM},\n'
+        content = content[:mt_end] + new_entry + content[mt_end:]
+        open(fpath, "w").write(content)
+        print("Added deepseekv3 to MODEL_TYPES")
+    else:
+        print("ERROR: Could not find MODEL_TYPES closing brace")
diff --git a/contrib/models/Mixtral-8x22B-Instruct-v0.1/README.md b/contrib/models/Mixtral-8x22B-Instruct-v0.1/README.md
new file mode 100644
index 00000000..5c13b823
--- /dev/null
+++ b/contrib/models/Mixtral-8x22B-Instruct-v0.1/README.md
@@ -0,0 +1,128 @@
+# Contrib Model: Mixtral 8x22B Instruct v0.1
+
+NeuronX Distributed Inference implementation of Mixtral 8x22B Instruct v0.1 on trn2.48xlarge.
+
+## Model Information
+
+- **HuggingFace ID:** `mistralai/Mixtral-8x22B-Instruct-v0.1`
+- **Model Type:** Mixture-of-Experts (8 experts, top-2 routing)
+- **Parameters:** 141B total, ~39B active per token
+- **License:** Apache 2.0
+
+## Architecture Details
+
+- 56 transformer layers
+- hidden_size: 6144, intermediate_size: 16384
+- 48 attention heads, 8 KV heads (GQA)
+- 8 experts per layer, top-2 routing (softmax)
+- head_dim: 128, vocab_size: 32768
+
+## Performance (SDK 2.29, vLLM 0.16.0, trn2.48xlarge)
+
+| Workload | tok/s (conc=1) | TPOT | TTFT | GPU tok/s | GPU/Neuron |
+|----------|:--------------:|:----:|:----:|:---------:|:----------:|
+| short-short (128/128) | **25.8** | 38.8ms | 185ms | 97.3 | 3.8x |
+| short-long (128/512) | **25.4** | 39.4ms | 185ms | 96.8 | 3.8x |
+| long-short (2048/128) | **25.4** | 39.4ms | 555ms | 97.2 | 3.8x |
+| long-long (2048/512) | **25.1** | 39.8ms | 555ms | 96.5 | 3.8x |
+
+**GPU baseline**: 4x H100 (TP=4), vLLM 0.19.0
+
+### SDK 2.29 vs 2.28
+
+| Metric | SDK 2.28 | SDK 2.29 | Improvement |
+|--------|:--------:|:--------:|:-----------:|
+| tok/s (short-short) | 24.9 | **25.8** | +4% |
+| tok/s (long-short) | 21.6 | **25.4** | +18% |
+| TPOT (short) | 40.2ms | 38.8ms | -3% |
+
+Notable improvement for long inputs: SDK 2.28 showed significant long-input degradation (24.9→21.6 tok/s, -13%), while SDK 2.29 is nearly flat (25.8→25.4, -2%).
+
+## SDK 2.29 Required Workaround
+
+Same as Mixtral 8x7B. See `src/patch_moe.py`.
+
+**Two-part fix:**
+
+1. **Patch moe.py** to forward `use_torch_block_wise` to ExpertMLPs:
+```bash
+python src/patch_moe.py
+```
+
+2. **Pass torch_block_wise config to vLLM**:
+```bash
+--additional-config '{"override_neuron_config": {"blockwise_matmul_config": {"use_torch_block_wise": true}}}'
+```
+
+## TKG Optimization: Not Applicable
+
+TKG cannot be used on this model:
+- kv_heads/TP = 8/16 = 0.5 → SHARD_OVER_HEADS mode
+- Stock TKG kernel requires kv_heads/TP >= 1
+- Even if applicable, MoE expert dispatch dominates TPOT (same finding as 8x7B)
+
+## Quick Start
+
+```bash
+# Activate venv
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+
+# Apply moe.py patch (SDK 2.29 only)
+python src/patch_moe.py
+
+# Download model (262 GB -- use NVMe if available)
+# Mount NVMe first on trn2.48xlarge:
+#   sudo mkfs.ext4 /dev/nvme1n1
+#   sudo mount /dev/nvme1n1 /mnt/nvme
+huggingface-cli download mistralai/Mixtral-8x22B-Instruct-v0.1 \
+    --local-dir /mnt/nvme/models/Mixtral-8x22B-Instruct-v0.1
+
+# Symlink HF cache to NVMe (vLLM saves a local copy during compilation)
+ln -sf /mnt/nvme/local-models /home/ubuntu/local-models
+
+# Start vLLM server
+python -m vllm.entrypoints.openai.api_server \
+    --model /mnt/nvme/models/Mixtral-8x22B-Instruct-v0.1 \
+    --tensor-parallel-size 16 \
+    --max-model-len 8192 \
+    --max-num-seqs 4 \
+    --no-enable-prefix-caching \
+    --additional-config '{"override_neuron_config": {"blockwise_matmul_config": {"use_torch_block_wise": true}}}' \
+    --disable-log-requests \
+    --port 8000
+```
+
+## Instance Requirements
+
+| Resource | Minimum |
+|----------|---------|
+| Instance type | trn2.48xlarge |
+| TP degree | 16 |
+| LNC | 2 (default) |
+| HBM | ~262 GB (model) + KV cache |
+| Storage | 1 TB+ (model = 262 GB, use NVMe) |
+| EBS | 500 GB minimum |
+| Compile time | ~40 minutes |
+
+**Storage note**: The 262 GB model weights exceed typical EBS volumes. Use the NVMe instance store (4x 1.7 TB on trn2.48xlarge) for model storage.
+
+## Compatibility
+
+| Instance/SDK | SDK 2.29 | SDK 2.28 |
+|--------------|----------|----------|
+| trn2.48xlarge (TP=16) | ✅ Working (with patch) | ✅ Working (no patch needed) |
+| trn2.48xlarge (TP=8) | ❌ OOM | ❌ OOM |
+| trn2.3xlarge | ❌ OOM | ❌ OOM |
+
+## Key Flags
+
+- `--no-enable-prefix-caching`: Required to avoid OOB crash in block KV cache path
+- `--additional-config '{"override_neuron_config": ...}'`: Required on SDK 2.29
+- `--max-num-seqs 4`: Recommended for stable performance
+- `--tensor-parallel-size 16`: Required (model too large for TP=8)
+
+## Maintainer
+
+Annapurna Labs / Agent Andretti (Mistral Family Benchmark Project)
+
+**Last Updated:** 2026-04-20
diff --git a/contrib/models/Mixtral-8x22B-Instruct-v0.1/src/patch_moe.py b/contrib/models/Mixtral-8x22B-Instruct-v0.1/src/patch_moe.py
new file mode 100644
index 00000000..f311a89f
--- /dev/null
+++ b/contrib/models/Mixtral-8x22B-Instruct-v0.1/src/patch_moe.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Patch NxDI moe.py to forward blockwise_matmul_config to ExpertMLPs.
+
+Required for SDK 2.29+ where the NKI blockwise kernel (neuronxcc.nki._private.blockwise_mm)
+was removed in NKI 0.3.0 GA. Without this patch, MoE models crash with:
+  NotImplementedError: _call_shard_hidden_kernel is not available
+
+The patch checks if `use_torch_block_wise=True` is set in the NeuronConfig's
+blockwise_matmul_config and passes it to the ExpertMLPs constructor.
+
+Usage:
+    python patch_moe.py
+
+Must be run BEFORE starting vLLM. Only needed on SDK 2.29+ (NKI 0.3.0 GA).
+"""
+
+import importlib
+import os
+import re
+import sys
+
+
+def get_moe_path():
+    """Find the moe.py file in the installed NxDI package."""
+    try:
+        import neuronx_distributed_inference.modules.moe as moe_module
+
+        return moe_module.__file__
+    except ImportError:
+        # Fallback: search common venv locations
+        venv_paths = [
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+        ]
+        for p in venv_paths:
+            if os.path.exists(p):
+                return p
+        raise FileNotFoundError(
+            "Cannot find neuronx_distributed_inference/modules/moe.py"
+        )
+
+
+def patch_moe():
+    """Apply the torch_block_wise forwarding patch to moe.py."""
+    fpath = get_moe_path()
+    content = open(fpath).read()
+
+    # Check if already patched
+    if "use_torch_block_wise" in content:
+        print(f"Already patched: {fpath}")
+        return True
+
+    # Find the initialize_moe_module function and add the forwarding logic
+    # We need to find where ExpertMLPs is instantiated and add the kwarg
+    # The pattern: look for "def initialize_moe_module" and add our check before ExpertMLPs(...)
+
+    # Strategy: Insert a block at the top of initialize_moe_module that sets up extra kwargs
+    pattern = r"(def initialize_moe_module\([^)]*\):)"
+    match = re.search(pattern, content)
+    if not match:
+        print("ERROR: Could not find initialize_moe_module function")
+        return False
+
+    # Find the line after the function def to insert our patch
+    func_start = match.end()
+    # Skip docstring if present
+    rest = content[func_start:]
+
+    # Insert patch: check blockwise_matmul_config and add use_torch_block_wise
+    patch_code = """
+    # --- SDK 2.29 MoE patch: forward use_torch_block_wise to ExpertMLPs ---
+    extra_kwargs = {}
+    if hasattr(config, 'neuron_config') and hasattr(config.neuron_config, 'blockwise_matmul_config'):
+        bmc = config.neuron_config.blockwise_matmul_config
+        use_torch = getattr(bmc, "use_torch_block_wise", False)
+        if use_torch:
+            extra_kwargs["use_torch_block_wise"] = True
+    # --- end patch ---
+"""
+
+    # Find the first newline after the function signature
+    newline_pos = rest.find("\n")
+    if newline_pos == -1:
+        print("ERROR: Unexpected file structure")
+        return False
+
+    # Check if there's a docstring
+    stripped = rest[newline_pos:].lstrip()
+    if stripped.startswith('"""') or stripped.startswith("'''"):
+        # Skip past the closing docstring
+        quote = stripped[:3]
+        docstring_end = rest.find(
+            quote, newline_pos + rest[newline_pos:].find(quote) + 3
+        )
+        if docstring_end > 0:
+            insert_pos = func_start + docstring_end + 3
+            # Find next newline
+            next_nl = content.find("\n", insert_pos)
+            insert_pos = next_nl + 1
+        else:
+            insert_pos = func_start + newline_pos + 1
+    else:
+        insert_pos = func_start + newline_pos + 1
+
+    content = content[:insert_pos] + patch_code + content[insert_pos:]
+
+    # Now add **extra_kwargs to the ExpertMLPs constructor call
+    # Find "ExpertMLPs(" and add extra_kwargs before the closing )
+    expert_pattern = r"(ExpertMLPs\([^)]+)\)"
+    matches = list(re.finditer(expert_pattern, content))
+    if matches:
+        # Patch the last match (most likely the real one inside initialize_moe_module)
+        for m in reversed(matches):
+            old = m.group(0)
+            new = old[:-1] + ", **extra_kwargs)"
+            content = content[: m.start()] + new + content[m.end() :]
+
+        open(fpath, "w").write(content)
+        print(f"Patched successfully: {fpath}")
+        return True
+    else:
+        print(
+            "WARNING: Could not find ExpertMLPs constructor. Manual patch may be needed."
+        )
+        print(f"File: {fpath}")
+        open(fpath, "w").write(content)
+        return False
+
+
+if __name__ == "__main__":
+    success = patch_moe()
+    sys.exit(0 if success else 1)
diff --git a/contrib/models/Mixtral-8x7B-Instruct-v0.1/README.md b/contrib/models/Mixtral-8x7B-Instruct-v0.1/README.md
index 3c033e8d..661b7a43 100644
--- a/contrib/models/Mixtral-8x7B-Instruct-v0.1/README.md
+++ b/contrib/models/Mixtral-8x7B-Instruct-v0.1/README.md
@@ -1,121 +1,133 @@
 # Contrib Model: Mixtral 8x7B Instruct v0.1
 
-NeuronX Distributed Inference implementation of Mixtral 8x7B Instruct v0.1.
+NeuronX Distributed Inference implementation of Mixtral 8x7B Instruct v0.1 on trn2.48xlarge.
 
 ## Model Information
 
-- **HuggingFace ID:** `Mixtral-8x7B-Instruct-v0.1`
-- **Model Type:** Decoder-only transformer
-- **License:** Check HuggingFace model card
+- **HuggingFace ID:** `mistralai/Mixtral-8x7B-Instruct-v0.1`
+- **Model Type:** Mixture-of-Experts (8 experts, top-2 routing)
+- **Parameters:** 46.7B total, ~12.9B active per token
+- **License:** Apache 2.0
 
 ## Architecture Details
 
+- 32 transformer layers
+- hidden_size: 4096, intermediate_size: 14336
+- 32 attention heads, 8 KV heads (GQA)
+- 8 experts per layer, top-2 routing (softmax)
+- head_dim: 128, vocab_size: 32000
 
-## Validation Results
+## Performance (SDK 2.29, vLLM 0.16.0, trn2.48xlarge)
 
-**Validated:** 2026-01-29  
-**Configuration:** TP=5, batch_size=None, seq_len=None, None
+| Workload | tok/s (conc=1) | TPOT | TTFT | GPU tok/s | GPU/Neuron |
+|----------|:--------------:|:----:|:----:|:---------:|:----------:|
+| short-short (128/128) | **40.4** | 24.9ms | 130ms | 123.8 | 3.1x |
+| short-long (128/512) | **39.6** | 25.3ms | 130ms | 123.3 | 3.1x |
+| long-short (2048/128) | **39.6** | 25.3ms | 370ms | 123.1 | 3.1x |
+| long-long (2048/512) | **39.2** | 25.5ms | 370ms | 122.7 | 3.1x |
 
-### Test Results
+**GPU baseline**: 2x H100 (TP=2), vLLM 0.8.5
 
-| Test | Status | Result |
-|------|--------|--------|
-| Smoke Test | ✅ PASS | Model loads successfully |
-| Token Matching | ✅ PASS | **100.0% match** |
-| Throughput | ⚠️ SLOW | 5.28 tok/s (threshold: 10 tok/s) |
-
-### Performance Metrics
-
-| Metric | Value |
-|--------|-------|
-| Throughput | 5.28 tokens/s |
+### SDK 2.29 vs 2.28
 
+| Metric | SDK 2.28 | SDK 2.29 | Improvement |
+|--------|:--------:|:--------:|:-----------:|
+| tok/s (short-short) | 38.5 | **40.4** | +5% |
+| TPOT | 26.0ms | 24.9ms | -4% |
 
-**Status:** ✅ EXCELLENT
+The improvement comes from `torch_block_wise` MoE implementation replacing the broken NKI blockwise kernel.
 
-### Device Profiling Metrics
+## SDK 2.29 Required Workaround
 
-**Configuration:** TP=8, batch_size=1, seq_len=128, bfloat16
-**Instance:** trn1.32xlarge | **Profiled:** 2026-03-20
+**CRITICAL**: NKI 0.3.0 GA (SDK 2.29) removed `neuronxcc.nki._private.blockwise_mm`. Without the workaround, MoE models crash with `NotImplementedError: _call_shard_hidden_kernel is not available`.
 
-| Metric | Context Encoding | Token Generation |
-|--------|-----------------|------------------|
-| MFU (%) | 0.21 | 0.00 |
-| MBU (%) | 0.38 | 0.99 |
-| HFU (%) | 0.21 | 0.00 |
-| Execution Time (us) | 0.08 | 0.01 |
-| HBM Read | 11.94 GB | 6.01 GB |
-| HBM Write | 260.47 MB | 2.21 MB |
+**Two-part fix:**
 
-**Throughput:** 12.64 tok/s | **Compile Time:** 1208.41s
-
-> Metrics from `neuron-profile capture` on compiled NEFFs. MFU = Model FLOPs Utilization,
-> MBU = Memory Bandwidth Utilization, HFU = Hardware FLOPs Utilization.
-
-## Usage
+1. **Patch moe.py** to forward `use_torch_block_wise` to ExpertMLPs:
+```bash
+python src/patch_moe.py
+```
 
-```python
-from transformers import AutoTokenizer, GenerationConfig
-from neuronx_distributed_inference.models.config import NeuronConfig
-from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+2. **Pass torch_block_wise config to vLLM**:
+```bash
+--additional-config '{"override_neuron_config": {"blockwise_matmul_config": {"use_torch_block_wise": true}}}'
+```
 
-# Import model classes from src
-from src.modeling_mixtral_8x7b_instruct_v0_1 import NeuronMixtral8x7BInstructv01ForCausalLM, Mixtral8x7BInstructv01InferenceConfig
+Both steps are required. The patch only needs to be applied once per environment.
 
-model_path = "/path/to/Mixtral-8x7B-Instruct-v0.1/"
-compiled_model_path = "/path/to/compiled/"
+## TKG Optimization: Not Applicable
 
-# Configure
-neuron_config = NeuronConfig(
-    tp_degree=5,
-    batch_size=None,
-    seq_len=512,
-    torch_dtype=torch.None,
-)
+TKG was tested on this model (kv_heads/TP = 8/8 = 1, stock TKG eligible) but provides **no benefit**:
+- Baseline: 40.4 tok/s, 24.9ms TPOT
+- TKG: 40.3 tok/s, 25.0ms TPOT (+0%)
 
-config = Mixtral8x7BInstructv01InferenceConfig(
-    neuron_config,
-    load_config=load_pretrained_config(model_path),
-)
+**Root cause**: MoE expert dispatch dominates TPOT (~60% of decode time). The attention kernel is not the bottleneck for MoE models.
 
-# Compile and load
-model = NeuronMixtral8x7BInstructv01ForCausalLM(model_path, config)
-model.compile(compiled_model_path)
-model.load(compiled_model_path)
+## Quick Start
 
-# Generate
-tokenizer = AutoTokenizer.from_pretrained(model_path)
-# ... (see integration test for full example)
+```bash
+# Activate venv
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+
+# Apply moe.py patch (SDK 2.29 only)
+python src/patch_moe.py
+
+# Download model
+huggingface-cli download mistralai/Mixtral-8x7B-Instruct-v0.1 \
+    --local-dir /mnt/models/Mixtral-8x7B-Instruct-v0.1
+
+# Start vLLM server
+python -m vllm.entrypoints.openai.api_server \
+    --model /mnt/models/Mixtral-8x7B-Instruct-v0.1 \
+    --tensor-parallel-size 8 \
+    --max-model-len 8192 \
+    --max-num-seqs 4 \
+    --no-enable-prefix-caching \
+    --additional-config '{"override_neuron_config": {"blockwise_matmul_config": {"use_torch_block_wise": true}}}' \
+    --disable-log-requests \
+    --port 8000
 ```
 
-## Compatibility Matrix
+## Instance Requirements
 
-| Instance/Version | 2.20+ | 2.19 and earlier |
-|------------------|-------|------------------|
-| Trn1             | ✅ Working | Not tested |
-| Inf2             | Not tested | Not tested |
+| Resource | Minimum |
+|----------|---------|
+| Instance type | trn2.48xlarge |
+| TP degree | 8 |
+| LNC | 2 (default) |
+| HBM | ~93 GB (model) + KV cache |
+| EBS | 300 GB |
+| Compile time | ~25 minutes |
 
-## Testing
+**Note**: This model does NOT fit on trn2.3xlarge (93 GB > 96 GB total HBM at TP=4).
 
-Run integration tests:
+## Compatibility
 
-```bash
-pytest nxdi_contrib_models/models/Mixtral-8x7B-Instruct-v0.1/test/integration/test_model.py --capture=tee-sys
-```
+| Instance/SDK | SDK 2.29 | SDK 2.28 |
+|--------------|----------|----------|
+| trn2.48xlarge | ✅ Working (with patch) | ✅ Working (no patch needed) |
+| trn2.3xlarge | ❌ OOM | ❌ OOM |
+| trn1.32xlarge | ✅ Working (TP=5-8) | ✅ Working |
 
-Or run manually:
+## Validation Results (Legacy, Trn1)
 
-```bash
-cd nxdi_contrib_models/models/Mixtral-8x7B-Instruct-v0.1
-python3 test/integration/test_model.py
-```
+**Validated:** 2026-01-29 on trn1.32xlarge
+**Configuration:** TP=5, batch_size=None, seq_len=None
+
+| Test | Status | Result |
+|------|--------|--------|
+| Smoke Test | ✅ PASS | Model loads successfully |
+| Token Matching | ✅ PASS | **100.0% match** |
+| Throughput | 5.28 tok/s (trn1, TP=5) |
 
-## Example Checkpoints
+## Key Flags
 
-* Mixtral-8x7B-Instruct-v0.1
+- `--no-enable-prefix-caching`: Required to avoid OOB crash in block KV cache path
+- `--additional-config '{"override_neuron_config": ...}'`: Required on SDK 2.29
+- `--max-num-seqs 4`: Recommended for stable performance
 
 ## Maintainer
 
-Annapurna Labs
+Annapurna Labs / Agent Andretti (Mistral Family Benchmark Project)
 
-**Last Updated:** 2026-01-29
+**Last Updated:** 2026-04-20
diff --git a/contrib/models/Mixtral-8x7B-Instruct-v0.1/src/patch_moe.py b/contrib/models/Mixtral-8x7B-Instruct-v0.1/src/patch_moe.py
new file mode 100644
index 00000000..f311a89f
--- /dev/null
+++ b/contrib/models/Mixtral-8x7B-Instruct-v0.1/src/patch_moe.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Patch NxDI moe.py to forward blockwise_matmul_config to ExpertMLPs.
+
+Required for SDK 2.29+ where the NKI blockwise kernel (neuronxcc.nki._private.blockwise_mm)
+was removed in NKI 0.3.0 GA. Without this patch, MoE models crash with:
+  NotImplementedError: _call_shard_hidden_kernel is not available
+
+The patch checks if `use_torch_block_wise=True` is set in the NeuronConfig's
+blockwise_matmul_config and passes it to the ExpertMLPs constructor.
+
+Usage:
+    python patch_moe.py
+
+Must be run BEFORE starting vLLM. Only needed on SDK 2.29+ (NKI 0.3.0 GA).
+"""
+
+import importlib
+import os
+import re
+import sys
+
+
+def get_moe_path():
+    """Find the moe.py file in the installed NxDI package."""
+    try:
+        import neuronx_distributed_inference.modules.moe as moe_module
+
+        return moe_module.__file__
+    except ImportError:
+        # Fallback: search common venv locations
+        venv_paths = [
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+            "/opt/aws_neuronx_venv_pytorch_inference_vllm_0_13/lib/python3.12/site-packages/neuronx_distributed_inference/modules/moe.py",
+        ]
+        for p in venv_paths:
+            if os.path.exists(p):
+                return p
+        raise FileNotFoundError(
+            "Cannot find neuronx_distributed_inference/modules/moe.py"
+        )
+
+
+def patch_moe():
+    """Apply the torch_block_wise forwarding patch to moe.py."""
+    fpath = get_moe_path()
+    content = open(fpath).read()
+
+    # Check if already patched
+    if "use_torch_block_wise" in content:
+        print(f"Already patched: {fpath}")
+        return True
+
+    # Find the initialize_moe_module function and add the forwarding logic
+    # We need to find where ExpertMLPs is instantiated and add the kwarg
+    # The pattern: look for "def initialize_moe_module" and add our check before ExpertMLPs(...)
+
+    # Strategy: Insert a block at the top of initialize_moe_module that sets up extra kwargs
+    pattern = r"(def initialize_moe_module\([^)]*\):)"
+    match = re.search(pattern, content)
+    if not match:
+        print("ERROR: Could not find initialize_moe_module function")
+        return False
+
+    # Find the line after the function def to insert our patch
+    func_start = match.end()
+    # Skip docstring if present
+    rest = content[func_start:]
+
+    # Insert patch: check blockwise_matmul_config and add use_torch_block_wise
+    patch_code = """
+    # --- SDK 2.29 MoE patch: forward use_torch_block_wise to ExpertMLPs ---
+    extra_kwargs = {}
+    if hasattr(config, 'neuron_config') and hasattr(config.neuron_config, 'blockwise_matmul_config'):
+        bmc = config.neuron_config.blockwise_matmul_config
+        use_torch = getattr(bmc, "use_torch_block_wise", False)
+        if use_torch:
+            extra_kwargs["use_torch_block_wise"] = True
+    # --- end patch ---
+"""
+
+    # Find the first newline after the function signature
+    newline_pos = rest.find("\n")
+    if newline_pos == -1:
+        print("ERROR: Unexpected file structure")
+        return False
+
+    # Check if there's a docstring
+    stripped = rest[newline_pos:].lstrip()
+    if stripped.startswith('"""') or stripped.startswith("'''"):
+        # Skip past the closing docstring
+        quote = stripped[:3]
+        docstring_end = rest.find(
+            quote, newline_pos + rest[newline_pos:].find(quote) + 3
+        )
+        if docstring_end > 0:
+            insert_pos = func_start + docstring_end + 3
+            # Find next newline
+            next_nl = content.find("\n", insert_pos)
+            insert_pos = next_nl + 1
+        else:
+            insert_pos = func_start + newline_pos + 1
+    else:
+        insert_pos = func_start + newline_pos + 1
+
+    content = content[:insert_pos] + patch_code + content[insert_pos:]
+
+    # Now add **extra_kwargs to the ExpertMLPs constructor call
+    # Find "ExpertMLPs(" and add extra_kwargs before the closing )
+    expert_pattern = r"(ExpertMLPs\([^)]+)\)"
+    matches = list(re.finditer(expert_pattern, content))
+    if matches:
+        # Patch the last match (most likely the real one inside initialize_moe_module)
+        for m in reversed(matches):
+            old = m.group(0)
+            new = old[:-1] + ", **extra_kwargs)"
+            content = content[: m.start()] + new + content[m.end() :]
+
+        open(fpath, "w").write(content)
+        print(f"Patched successfully: {fpath}")
+        return True
+    else:
+        print(
+            "WARNING: Could not find ExpertMLPs constructor. Manual patch may be needed."
+        )
+        print(f"File: {fpath}")
+        open(fpath, "w").write(content)
+        return False
+
+
+if __name__ == "__main__":
+    success = patch_moe()
+    sys.exit(0 if success else 1)