diff --git a/contrib/models/Isaac-0.2-2B/README.md b/contrib/models/Isaac-0.2-2B/README.md
new file mode 100644
index 00000000..c9ee7e67
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/README.md
@@ -0,0 +1,215 @@
+# Contrib Model: PerceptronAI Isaac-0.2-2B-Preview VLM
+
+NeuronX Distributed Inference implementation for the PerceptronAI Isaac-0.2-2B-Preview Vision-Language Model. Isaac combines a Qwen3 text backbone with a SigLIP2 vision encoder and 2-layer MLP projector with pixel shuffle.
+
+## Model Information
+
+- **HuggingFace ID:** [`PerceptronAI/Isaac-0.2-2B-Preview`](https://huggingface.co/PerceptronAI/Isaac-0.2-2B-Preview)
+- **Model Type:** VLM with SigLIP2 vision encoder, pixel shuffle, MLP projector, and Qwen3 text decoder
+- **License:** CC-BY-NC-4.0 (non-commercial)
+- **Requires:** `trust_remote_code=True`
+
+## Architecture Details
+
+### Text Backbone (Qwen3)
+
+| Spec | Isaac 2B |
+|---|---:|
+| **Layers** | 28 |
+| **Hidden Size** | 2048 |
+| **Head Dim** | 128 |
+| **Attention Heads** | 16 |
+| **KV Heads** | 8 |
+| **Intermediate Size** | 6144 |
+| **Vocabulary Size** | 151,936 |
+| **Max Position Embeddings** | 40,960 |
+| **Position Encoding** | RoPE (mRoPE-capable) |
+| **Normalization** | RMSNorm |
+| **Activation** | SiLU |
+| **Total Parameters** | 2.57B |
+
+### SigLIP2 Vision Encoder
+
+| Spec | Value |
+|---|---:|
+| **Layers** | 27 |
+| **Hidden Size** | 1152 |
+| **Head Dim** | 72 |
+| **Attention Heads** | 16 |
+| **KV Heads** | 16 |
+| **Intermediate Size** | 4304 |
+| **Activation** | GELU (approximate) |
+| **Image Size** | 256×256 |
+| **Patch Size** | 16 |
+| **Pixel Shuffle Scale** | 2 |
+| **Vision Tokens per Image** | 64 |
+
+### MLP Projector
+
+| Spec | Value |
+|---|---:|
+| **Layer 1** | Linear(4608 → 18432, no bias) + SiLU |
+| **Layer 2** | Linear(18432 → 2048, no bias) |
+| **Parameters** | ~122M |
+
+## Validation Results
+
+**Validated:** 2026-04-30
+**Configuration:** trn2.3xlarge, TP=1, batch_size=1, seq_len=1024, bfloat16
+
+### Accuracy
+
+| Test | Status | Result |
+|------|--------|--------|
+| Text logit cosine (5 prompts) | PASS | avg 0.99998 vs CPU ref |
+| Top-1 token match | PASS | 100% match (8/8 prompts) |
+| Image+text generation | PASS | Coherent descriptions |
+| TP=2 accuracy | PASS | cosine 0.99997 |
+| TP=4 accuracy | PASS | cosine 0.99997 |
+
+### Performance (trn2.3xlarge, TP=1, BS=1)
+
+| Metric | seq_len=1024 | seq_len=4096 |
+|--------|-------------|-------------|
+| **TKG Throughput** | 110.7 tok/s | 94.0 tok/s |
+| **TPOT** | 9.0 ms | 10.6 ms |
+| **TTFT** | 9.0 ms | 10.6 ms |
+| **Image+text tok/s** | 108.7 tok/s | 93.1 tok/s |
+| **Projected DP=4** | ~443 tok/s | ~376 tok/s |
+
+**Compilation time:** ~196s (one-time, seq_len=1024)
+
+### GPU Comparison (L40S, vLLM 0.20.0, CUDA graphs enabled)
+
+| Metric | L40S GPU | trn2 Neuron (TP=1) | trn2 Neuron (DP=4) |
+|--------|----------|---------------------|---------------------|
+| **TPOT (short input)** | 5.75 ms | 9.0 ms | — |
+| **Throughput (short input)** | 174 tok/s | 111 tok/s | ~443 tok/s |
+| **TPOT (long input)** | 6.09 ms | 9.0 ms | — |
+| **Throughput (long input)** | 164 tok/s | 111 tok/s | ~443 tok/s |
+
+- **Per-core:** L40S is ~1.5x faster than a single NeuronCore
+- **Per-device (DP=4):** trn2.3xlarge is ~2.5x faster than L40S
+- GPU benchmark: L40S with vLLM 0.20.0, batch_size=1, CUDA graphs enabled (default)
+- Neuron benchmark: trn2.3xlarge, TP=1, batch_size=1, bfloat16, CTE flash attention
+
+## Usage
+
+```python
+import torch
+from transformers import AutoConfig, AutoTokenizer
+from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+
+from isaac_neuron.modeling_isaac import (
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+model_path = "/path/to/Isaac-0.2-2B-Preview"
+compiled_path = "/path/to/compiled/model"
+
+# Configure
+text_config = NeuronConfig(
+    batch_size=1,
+    seq_len=1024,
+    torch_dtype=torch.bfloat16,
+    tp_degree=1,
+    is_continuous_batching=True,
+    ctx_batch_size=1,
+    enable_bucketing=True,
+    context_encoding_buckets=[1024],
+    token_generation_buckets=[1024],
+    on_device_sampling_config=OnDeviceSamplingConfig(
+        dynamic=True, do_sample=True, deterministic=True,
+        top_k=1, global_topk=256, top_k_kernel_enabled=True,
+    ),
+    attn_kernel_enabled=True,  # CTE flash attention
+    fused_qkv=False,
+    mlp_kernel_enabled=False,
+)
+
+vision_config = NeuronConfig(
+    batch_size=1, seq_len=1024, torch_dtype=torch.bfloat16,
+    tp_degree=1, is_continuous_batching=True, ctx_batch_size=1,
+    enable_bucketing=True, buckets=[1],
+)
+
+hf_config = AutoConfig.from_pretrained(model_path, trust_remote_code=True)
+config = IsaacInferenceConfig(
+    text_neuron_config=text_config,
+    vision_neuron_config=vision_config,
+    load_config=load_pretrained_config(hf_config=hf_config),
+)
+config.image_token_index = 151655  # <|image_pad|>
+
+# Compile and load
+model = NeuronIsaacForConditionalGeneration(model_path, config)
+model.compile(compiled_path, debug=False)
+model.load(compiled_path)
+
+# Generate (see integration tests for full examples)
+```
+
+## Compatibility Matrix
+
+| Instance/Version | SDK 2.29 | SDK 2.28 and earlier |
+|------------------|----------|----------------------|
+| trn2.3xlarge (TP=1) | Tested | Not tested |
+| trn2.3xlarge (TP=2) | Tested | Not tested |
+| trn2.3xlarge (TP=4) | Tested | Not tested |
+| trn1 | Not tested | Not tested |
+| inf2 | Not tested | Not tested |
+
+## Known Limitations
+
+- **Batch size:** Only BS=1 supported (NxDI VLM framework limitation, shared with all VLM contribs)
+- **MLP NKI kernel:** Not compatible at TP=1 (intermediate=6144 exceeds SBUF capacity). Use default kernels.
+- **QKV NKI kernel:** Not compatible (Q/K layernorm incompatible with fused QKV kernel)
+- **Image size:** Fixed at 256×256 (64 vision tokens per image)
+- **License:** CC-BY-NC-4.0 — non-commercial use only
+
+## Testing
+
+Run integration tests:
+
+```bash
+# Set up environment
+source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+export PYTHONPATH=/path/to/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+
+# Run validation
+cd contrib/models/Isaac-0.2-2B
+python test/integration/run_isaac.py
+```
+
+## Module Structure
+
+```
+contrib/models/Isaac-0.2-2B/
+├── README.md
+├── src/
+│   └── isaac_neuron/
+│       ├── __init__.py
+│       ├── modeling_isaac.py          # VLM orchestrator + config + state dict mapping
+│       ├── modeling_isaac_text.py     # Text model (NeuronBaseModel + Qwen3 layers)
+│       ├── modeling_isaac_vision.py   # Vision wrapper + MLP projector + pixel shuffle
+│       ├── ndxi_patch.py             # SDK 2.29 compatibility patches
+│       ├── utils.py                  # QKV fusion + pixel shuffle utilities
+│       └── siglip/
+│           ├── modeling_siglip.py    # SigLIP2 vision encoder
+│           └── layers.py            # OutputChannelParallelConv2d
+└── test/
+    └── integration/
+        ├── run_isaac.py              # Main compilation + generation test
+        ├── benchmark.py              # Formal benchmark script
+        ├── test_tp.py                # TP=2/4 validation
+        ├── validate_text_logits.py   # Text logit validation vs CPU
+        ├── validate_tkg.py           # TKG multi-token validation
+        ├── validate_image_text.py    # Image+text E2E validation
+        └── validate_vision_encoder.py # Vision encoder sanity checks
+```
+
+## Example Checkpoint
+
+* [`PerceptronAI/Isaac-0.2-2B-Preview`](https://huggingface.co/PerceptronAI/Isaac-0.2-2B-Preview)
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py
new file mode 100644
index 00000000..aa6b938f
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py
@@ -0,0 +1,341 @@
+#!/usr/bin/env python3
+# Copyright 2025 (c) Amazon.com and Affiliates
+"""GPU benchmark for Isaac-0.2-2B-Preview using vLLM.
+
+Measures TTFT, TPOT, tok/s across multiple workloads to match Neuron benchmark.
+Follows GPU Benchmark Standard (steering/gpu-benchmark-standard.md).
+
+Usage:
+    pip install vllm transformers torch pillow
+    python benchmark_gpu.py [--model PerceptronAI/Isaac-0.2-2B-Preview] [--warmup 5] [--iterations 10]
+"""
+
+import argparse
+import json
+import os
+import statistics
+import time
+from pathlib import Path
+
+import torch
+from transformers import AutoTokenizer
+
+
+# ── Workload definitions matching Neuron benchmark ──────────────────────
+
+WORKLOADS = {
+    "short-short": {"input_tokens": 128, "output_tokens": 128},
+    "short-long": {"input_tokens": 128, "output_tokens": 512},
+    "long-short": {"input_tokens": 2048, "output_tokens": 128},
+    "long-long": {"input_tokens": 2048, "output_tokens": 512},
+}
+
+FILLER_TEXT = (
+    "The quick brown fox jumps over the lazy dog. "
+    "A journey of a thousand miles begins with a single step. "
+    "To be or not to be, that is the question. "
+    "All that glitters is not gold. "
+    "The only thing we have to fear is fear itself. "
+)
+
+
+def build_prompt(tokenizer, target_tokens: int) -> str:
+    """Build a synthetic prompt of approximately target_tokens length."""
+    repeated = FILLER_TEXT * (target_tokens // 10 + 10)
+    token_ids = tokenizer.encode(repeated)[:target_tokens]
+    return tokenizer.decode(token_ids, skip_special_tokens=True)
+
+
+def percentiles(values, pcts=(50, 95, 99)):
+    """Calculate percentiles."""
+    if not values:
+        return {f"p{p}": None for p in pcts}
+    s = sorted(values)
+    n = len(s)
+    return {f"p{p}": s[min(int(p / 100 * n), n - 1)] for p in pcts}
+
+
+def benchmark_vllm_offline(model_path, workloads, warmup, iterations, dtype):
+    """Run benchmark using vLLM offline (Python API)."""
+    from vllm import LLM, SamplingParams
+
+    print(f"Loading model: {model_path}")
+    print(f"dtype: {dtype}")
+
+    llm = LLM(
+        model=model_path,
+        dtype=dtype,
+        trust_remote_code=True,
+        max_model_len=4096,
+        gpu_memory_utilization=0.90,
+    )
+    tokenizer = llm.get_tokenizer()
+
+    results = {}
+
+    for wl_name, wl_config in workloads.items():
+        input_tokens = wl_config["input_tokens"]
+        output_tokens = wl_config["output_tokens"]
+        print(f"\n{'=' * 60}")
+        print(f"Workload: {wl_name} (input={input_tokens}, output={output_tokens})")
+        print(f"{'=' * 60}")
+
+        prompt = build_prompt(tokenizer, input_tokens)
+        actual_input = len(tokenizer.encode(prompt))
+        print(f"  Actual input tokens: {actual_input}")
+
+        sampling_params = SamplingParams(
+            temperature=0,  # Greedy for reproducibility
+            max_tokens=output_tokens,
+        )
+
+        # Warmup
+        print(f"  Warming up ({warmup} runs)...")
+        for _ in range(warmup):
+            llm.generate([prompt], sampling_params)
+
+        # Timed iterations
+        print(f"  Benchmarking ({iterations} runs)...")
+        ttfts = []
+        tpots = []
+        throughputs = []
+        e2e_latencies = []
+        output_lengths = []
+
+        for i in range(iterations):
+            t_start = time.perf_counter()
+            outputs = llm.generate([prompt], sampling_params)
+            t_end = time.perf_counter()
+
+            output = outputs[0]
+            n_output_tokens = len(output.outputs[0].token_ids)
+            e2e = t_end - t_start
+
+            # Extract TTFT from metrics if available
+            metrics = output.metrics
+            if (
+                metrics
+                and hasattr(metrics, "first_token_time")
+                and metrics.first_token_time
+            ):
+                ttft = metrics.first_token_time - metrics.arrival_time
+            else:
+                # Approximate: E2E - decode time
+                ttft = e2e / (n_output_tokens + 1) if n_output_tokens > 0 else e2e
+
+            # TPOT = decode time / (output tokens - 1)
+            decode_time = e2e - ttft
+            tpot = decode_time / max(n_output_tokens - 1, 1)
+            tps = n_output_tokens / e2e if e2e > 0 else 0
+
+            ttfts.append(ttft * 1000)  # to ms
+            tpots.append(tpot * 1000)  # to ms
+            throughputs.append(tps)
+            e2e_latencies.append(e2e * 1000)  # to ms
+            output_lengths.append(n_output_tokens)
+
+        results[wl_name] = {
+            "input_tokens": actual_input,
+            "target_output_tokens": output_tokens,
+            "avg_output_tokens": statistics.mean(output_lengths),
+            "ttft_ms": percentiles(ttfts),
+            "tpot_ms": percentiles(tpots),
+            "throughput_tok_s": percentiles(throughputs),
+            "e2e_latency_ms": percentiles(e2e_latencies),
+            "raw_ttfts": ttfts,
+            "raw_tpots": tpots,
+            "raw_throughputs": throughputs,
+            "raw_e2e": e2e_latencies,
+        }
+
+        print(f"  TTFT (P50): {percentiles(ttfts)['p50']:.1f} ms")
+        print(f"  TPOT (P50): {percentiles(tpots)['p50']:.2f} ms")
+        print(f"  Throughput (P50): {percentiles(throughputs)['p50']:.1f} tok/s")
+        print(f"  E2E (P50): {percentiles(e2e_latencies)['p50']:.1f} ms")
+        print(f"  Avg output tokens: {statistics.mean(output_lengths):.0f}")
+
+    return results
+
+
+def benchmark_image_text(model_path, warmup, iterations, dtype):
+    """Benchmark image+text workload."""
+    from vllm import LLM, SamplingParams
+
+    print(f"\n{'=' * 60}")
+    print("Image+Text Benchmark")
+    print(f"{'=' * 60}")
+
+    llm = LLM(
+        model=model_path,
+        dtype=dtype,
+        trust_remote_code=True,
+        max_model_len=4096,
+        gpu_memory_utilization=0.90,
+        limit_mm_per_prompt={"image": 1},
+    )
+
+    sampling_params = SamplingParams(temperature=0, max_tokens=128)
+
+    # Use a simple test prompt with image URL
+    image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
+
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": image_url}},
+                {"type": "text", "text": "Describe this image in detail."},
+            ],
+        }
+    ]
+
+    # Warmup
+    print(f"  Warming up ({warmup} runs)...")
+    for _ in range(warmup):
+        try:
+            llm.chat(messages, sampling_params)
+        except Exception as e:
+            print(f"  Warmup error (may be expected): {e}")
+            return None
+
+    # Timed iterations
+    print(f"  Benchmarking ({iterations} runs)...")
+    e2e_latencies = []
+    output_lengths = []
+
+    for i in range(iterations):
+        t_start = time.perf_counter()
+        outputs = list(llm.chat(messages, sampling_params))
+        t_end = time.perf_counter()
+
+        output = outputs[0]
+        n_tokens = len(output.outputs[0].token_ids)
+        e2e = (t_end - t_start) * 1000
+
+        e2e_latencies.append(e2e)
+        output_lengths.append(n_tokens)
+
+    avg_tokens = statistics.mean(output_lengths)
+    avg_e2e = statistics.mean(e2e_latencies)
+    avg_tps = avg_tokens / (avg_e2e / 1000) if avg_e2e > 0 else 0
+
+    result = {
+        "avg_output_tokens": avg_tokens,
+        "e2e_latency_ms": percentiles(e2e_latencies),
+        "throughput_tok_s": avg_tps,
+        "text_preview": outputs[0].outputs[0].text[:150] if outputs else "",
+    }
+
+    print(f"  Output tokens: {avg_tokens:.0f}")
+    print(f"  E2E (P50): {percentiles(e2e_latencies)['p50']:.1f} ms")
+    print(f"  Throughput: {avg_tps:.1f} tok/s")
+
+    return result
+
+
+def get_gpu_info():
+    """Get GPU information."""
+    info = {}
+    if torch.cuda.is_available():
+        info["gpu_name"] = torch.cuda.get_device_name(0)
+        info["gpu_count"] = torch.cuda.device_count()
+        props = torch.cuda.get_device_properties(0)
+        info["gpu_memory_gb"] = (
+            getattr(props, "total_memory", getattr(props, "total_mem", 0)) / 1e9
+        )
+    return info
+
+
+def main():
+    parser = argparse.ArgumentParser(description="GPU benchmark for Isaac-0.2-2B")
+    parser.add_argument(
+        "--model",
+        default="PerceptronAI/Isaac-0.2-2B-Preview",
+        help="HuggingFace model ID or local path",
+    )
+    parser.add_argument("--warmup", type=int, default=5)
+    parser.add_argument("--iterations", type=int, default=10)
+    parser.add_argument(
+        "--dtype", default="bfloat16", choices=["bfloat16", "float16", "auto"]
+    )
+    parser.add_argument(
+        "--workloads",
+        nargs="+",
+        default=["short-short", "short-long", "long-short", "long-long"],
+        choices=list(WORKLOADS.keys()),
+    )
+    parser.add_argument(
+        "--skip-image", action="store_true", help="Skip image+text benchmark"
+    )
+    parser.add_argument("--output", default="gpu_benchmark_results.json")
+    args = parser.parse_args()
+
+    gpu_info = get_gpu_info()
+    print(f"GPU: {gpu_info.get('gpu_name', 'unknown')}")
+    print(f"GPU Memory: {gpu_info.get('gpu_memory_gb', 0):.1f} GB")
+    print(f"Model: {args.model}")
+    print(f"dtype: {args.dtype}")
+    print(f"Workloads: {args.workloads}")
+    print(f"Warmup: {args.warmup}, Iterations: {args.iterations}")
+
+    # Select workloads
+    selected = {k: WORKLOADS[k] for k in args.workloads}
+
+    # Run text benchmarks
+    text_results = benchmark_vllm_offline(
+        args.model, selected, args.warmup, args.iterations, args.dtype
+    )
+
+    # Run image+text benchmark
+    image_result = None
+    if not args.skip_image:
+        image_result = benchmark_image_text(
+            args.model, args.warmup, args.iterations, args.dtype
+        )
+
+    # Compile all results
+    all_results = {
+        "metadata": {
+            "model": args.model,
+            "dtype": args.dtype,
+            "warmup": args.warmup,
+            "iterations": args.iterations,
+            "gpu": gpu_info,
+            "framework": "vLLM",
+            "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"),
+        },
+        "text_benchmarks": text_results,
+        "image_text_benchmark": image_result,
+    }
+
+    # Summary table
+    print(f"\n{'=' * 80}")
+    print("GPU BENCHMARK SUMMARY")
+    print(f"{'=' * 80}")
+    print(
+        f"{'Workload':<15} {'In':>5} {'Out':>5} {'TTFT P50':>10} {'TPOT P50':>10} "
+        f"{'tok/s P50':>10} {'E2E P50':>10}"
+    )
+    print("-" * 70)
+    for wl_name, r in text_results.items():
+        print(
+            f"{wl_name:<15} {r['input_tokens']:>5} {r['avg_output_tokens']:>5.0f} "
+            f"{r['ttft_ms']['p50']:>10.1f} {r['tpot_ms']['p50']:>10.2f} "
+            f"{r['throughput_tok_s']['p50']:>10.1f} {r['e2e_latency_ms']['p50']:>10.1f}"
+        )
+    if image_result:
+        print(
+            f"{'image+text':<15} {'N/A':>5} {image_result['avg_output_tokens']:>5.0f} "
+            f"{'N/A':>10} {'N/A':>10} "
+            f"{image_result['throughput_tok_s']:>10.1f} "
+            f"{image_result['e2e_latency_ms']['p50']:>10.1f}"
+        )
+
+    # Save
+    with open(args.output, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {args.output}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py
new file mode 100644
index 00000000..442b9183
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+"""Remove leftover indented stubs from modular_isaac.py."""
+
+import sys
+
+paths = (
+    sys.argv[1:]
+    if len(sys.argv) > 1
+    else [
+        "/mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py",
+        "/home/ubuntu/.cache/huggingface/modules/transformers_modules/"
+        "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py",
+    ]
+)
+
+INDENTED_STUBS = (
+    "\n\n"
+    "    class Event: pass\n"
+    "    class Stream: pass\n"
+    "    class TensorStream: pass\n"
+    "    class TextType: pass\n"
+    "    class VisionType: pass\n"
+    "    def create_stream(*a, **kw): return None\n"
+    "    def group_streams(*a, **kw): return None\n"
+    "    def compute_mrope_pos_tensor(*a, **kw): return None\n"
+    "    def modality_mask(*a, **kw): return None\n"
+    "    def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None\n"
+    "    def tensor_stream_token_view(*a, **kw): return None\n"
+    "    def ts_slice(*a, **kw): return None"
+)
+
+for path in paths:
+    try:
+        with open(path, "r") as f:
+            content = f.read()
+    except FileNotFoundError:
+        print(f"SKIP: {path}")
+        continue
+
+    if INDENTED_STUBS in content:
+        content = content.replace(INDENTED_STUBS, "")
+        with open(path, "w") as f:
+            f.write(content)
+        print(f"FIXED: removed indented stubs from {path}")
+    else:
+        print(f"OK: no indented stubs found in {path}")
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json b/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json
new file mode 100644
index 00000000..5654fb81
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json
@@ -0,0 +1,310 @@
+{
+  "metadata": {
+    "model": "/home/ubuntu/Isaac-0.2-2B-Preview",
+    "dtype": "bfloat16",
+    "warmup": 5,
+    "iterations": 10,
+    "gpu": {
+      "gpu_name": "NVIDIA L40S",
+      "gpu_count": 1,
+      "gpu_memory_gb": 47.665709056
+    },
+    "framework": "vLLM",
+    "timestamp": "2026-05-01 02:03:04"
+  },
+  "text_benchmarks": {
+    "short-short": {
+      "input_tokens": 128,
+      "target_output_tokens": 128,
+      "avg_output_tokens": 128,
+      "ttft_ms": {
+        "p50": 5.725635930232787,
+        "p95": 5.727955108527073,
+        "p99": 5.727955108527073
+      },
+      "tpot_ms": {
+        "p50": 5.770719677714935,
+        "p95": 5.773057117255633,
+        "p99": 5.773057117255633
+      },
+      "throughput_tok_s": {
+        "p50": 173.30585065401496,
+        "p95": 173.38756231629444,
+        "p99": 173.38756231629444
+      },
+      "e2e_latency_ms": {
+        "p50": 738.6070350000296,
+        "p95": 738.9062089999925,
+        "p99": 738.9062089999925
+      },
+      "raw_ttfts": [
+        5.727955108527073,
+        5.7257079457365085,
+        5.725635930232787,
+        5.724229116279367,
+        5.724252604651145,
+        5.722717643411124,
+        5.726226426356489,
+        5.726308906976418,
+        5.725303279069792,
+        5.725415837209167
+      ],
+      "raw_tpots": [
+        5.773057117255633,
+        5.770792260269867,
+        5.770719677714935,
+        5.769301786486291,
+        5.769325459805879,
+        5.76777841225688,
+        5.771314823414414,
+        5.771397953488043,
+        5.770384407251444,
+        5.770497851675381
+      ],
+      "raw_throughputs": [
+        173.22902208824354,
+        173.29700910685713,
+        173.2991887898751,
+        173.34177962821394,
+        173.34106835349465,
+        173.38756231629444,
+        173.28131794586687,
+        173.27882203606558,
+        173.3092578069886,
+        173.30585065401496
+      ],
+      "raw_e2e": [
+        738.9062089999925,
+        738.6163250000095,
+        738.6070350000296,
+        738.4255560000383,
+        738.4285859999977,
+        738.230576000035,
+        738.6832089999871,
+        738.6938489999579,
+        738.5641230000033,
+        738.5786429999825
+      ]
+    },
+    "short-long": {
+      "input_tokens": 128,
+      "target_output_tokens": 512,
+      "avg_output_tokens": 512,
+      "ttft_ms": {
+        "p50": 5.739147309941538,
+        "p95": 5.742040413255263,
+        "p99": 5.742040413255263
+      },
+      "tpot_ms": {
+        "p50": 5.750378517984477,
+        "p95": 5.753277282948521,
+        "p99": 5.753277282948521
+      },
+      "throughput_tok_s": {
+        "p50": 173.90271333810932,
+        "p95": 173.90904504703894,
+        "p99": 173.90904504703894
+      },
+      "e2e_latency_ms": {
+        "p50": 2944.1825700000095,
+        "p95": 2945.6667319999497,
+        "p99": 2945.6667319999497
+      },
+      "raw_ttfts": [
+        5.739037317738782,
+        5.739147309941538,
+        5.739028274853798,
+        5.740372360623822,
+        5.739132317738796,
+        5.742040413255263,
+        5.7412690487330265,
+        5.739214384015546,
+        5.738923366471569,
+        5.739001226120805
+      ],
+      "raw_tpots": [
+        5.750268310532791,
+        5.750378517984477,
+        5.750259249951359,
+        5.751605966026217,
+        5.750363496442785,
+        5.753277282948521,
+        5.752504408906673,
+        5.7504457233189035,
+        5.75015413626897,
+        5.750232148285424
+      ],
+      "raw_throughputs": [
+        173.90559200169955,
+        173.9022590572562,
+        173.905866021654,
+        173.8651466422899,
+        173.90271333810932,
+        173.81463912327226,
+        173.8379918776767,
+        173.900226665327,
+        173.90904504703894,
+        173.90668566485508
+      ],
+      "raw_e2e": [
+        2944.126143999995,
+        2944.1825700000095,
+        2944.1215049999983,
+        2944.811021000021,
+        2944.174879000002,
+        2945.6667319999497,
+        2945.2710220000426,
+        2944.2169789999753,
+        2944.067686999915,
+        2944.107628999973
+      ]
+    },
+    "long-short": {
+      "input_tokens": 2048,
+      "target_output_tokens": 128,
+      "avg_output_tokens": 128,
+      "ttft_ms": {
+        "p50": 6.052418666666749,
+        "p95": 6.054631875969768,
+        "p99": 6.054631875969768
+      },
+      "tpot_ms": {
+        "p50": 6.100075506561763,
+        "p95": 6.102306142709688,
+        "p99": 6.102306142709688
+      },
+      "throughput_tok_s": {
+        "p50": 163.94669556515132,
+        "p95": 164.06837099200277,
+        "p99": 164.06837099200277
+      },
+      "e2e_latency_ms": {
+        "p50": 780.7620080000106,
+        "p95": 781.0475120001001,
+        "p99": 781.0475120001001
+      },
+      "raw_ttfts": [
+        6.054631875969768,
+        6.052931550387207,
+        6.052418666666749,
+        6.0518397131778165,
+        6.053834852713224,
+        6.052260209301938,
+        6.053383906977044,
+        6.051427062015693,
+        6.047771767441205,
+        6.0498680697673635
+      ],
+      "raw_tpots": [
+        6.102306142709688,
+        6.100592428736713,
+        6.100075506561763,
+        6.099491994383941,
+        6.10150284367947,
+        6.099915801501166,
+        6.101048347189462,
+        6.099076094000069,
+        6.095392017578537,
+        6.097504826222225
+      ],
+      "raw_throughputs": [
+        163.88247582047683,
+        163.9285119541835,
+        163.94240330402738,
+        163.9580869689748,
+        163.90405192021308,
+        163.94669556515132,
+        163.91626192283178,
+        163.96926738880532,
+        164.06837099200277,
+        164.01152067662508
+      ],
+      "raw_e2e": [
+        781.0475120001001,
+        780.8281699999498,
+        780.7620080000106,
+        780.6873229999383,
+        780.9446960000059,
+        780.74156699995,
+        780.8865240000387,
+        780.6340910000245,
+        780.1625579999154,
+        780.4329809999899
+      ]
+    },
+    "long-long": {
+      "input_tokens": 2048,
+      "target_output_tokens": 512,
+      "avg_output_tokens": 512,
+      "ttft_ms": {
+        "p50": 6.079098984405498,
+        "p95": 6.080241068226007,
+        "p99": 6.080241068226007
+      },
+      "tpot_ms": {
+        "p50": 6.090995459913141,
+        "p95": 6.092139778731342,
+        "p99": 6.092139778731342
+      },
+      "throughput_tok_s": {
+        "p50": 164.1837129551535,
+        "p95": 164.20382107301052,
+        "p99": 164.20382107301052
+      },
+      "e2e_latency_ms": {
+        "p50": 3118.57777900002,
+        "p95": 3119.163667999942,
+        "p99": 3119.163667999942
+      },
+      "raw_ttfts": [
+        6.0786242748539125,
+        6.078865341130544,
+        6.079913033138502,
+        6.0791121345028465,
+        6.078488198830433,
+        6.079595109161773,
+        6.078705553606205,
+        6.080241068226007,
+        6.079098984405498,
+        6.078120933723229
+      ],
+      "raw_tpots": [
+        6.090519821380045,
+        6.090761359410643,
+        6.091811101696503,
+        6.091008635744536,
+        6.090383479062978,
+        6.091492555559351,
+        6.090601259190561,
+        6.092139778731342,
+        6.090995459913141,
+        6.090015495237364
+      ],
+      "raw_throughputs": [
+        164.19022415811258,
+        164.1837129551535,
+        164.15542078009074,
+        164.17704759822297,
+        164.19389980114536,
+        164.16400505967496,
+        164.18802876035227,
+        164.14656443094012,
+        164.1774027403524,
+        164.20382107301052
+      ],
+      "raw_e2e": [
+        3118.3342530000573,
+        3118.457919999969,
+        3118.9953860000514,
+        3118.58452499996,
+        3118.264446000012,
+        3118.8322909999897,
+        3118.375948999983,
+        3119.163667999942,
+        3118.57777900002,
+        3118.076039000016
+      ]
+    }
+  },
+  "image_text_benchmark": null
+}
\ No newline at end of file
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py
new file mode 100644
index 00000000..01ac91b4
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py
@@ -0,0 +1,97 @@
+#!/usr/bin/env python3
+"""Remove perceptron.tensorstream import entirely from modular_isaac.py.
+Replaces the try/except import block with direct stub definitions."""
+
+import sys
+
+paths = (
+    sys.argv[1:]
+    if len(sys.argv) > 1
+    else [
+        "/mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py",
+        "/home/ubuntu/.cache/huggingface/modules/transformers_modules/"
+        "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py",
+    ]
+)
+
+# Replacement: just the stubs, no try/except, no import
+REPLACEMENT = """# perceptron.tensorstream stubs (not available outside Perceptron environment)
+class Event: pass
+class Stream: pass
+class TensorStream: pass
+class TextType: pass
+class VisionType: pass
+def create_stream(*a, **kw): return None
+def group_streams(*a, **kw): return None
+def compute_mrope_pos_tensor(*a, **kw): return None
+def modality_mask(*a, **kw): return None
+def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None
+def tensor_stream_token_view(*a, **kw): return None
+def ts_slice(*a, **kw): return None"""
+
+for path in paths:
+    try:
+        with open(path, "r") as f:
+            lines = f.readlines()
+    except FileNotFoundError:
+        print(f"SKIP: {path} not found")
+        continue
+
+    # Find the try block that imports from perceptron
+    try_start = None
+    except_end = None
+    in_except = False
+
+    for i, line in enumerate(lines):
+        if (
+            line.strip() == "try:"
+            and i + 1 < len(lines)
+            and "perceptron" in lines[i + 1]
+        ):
+            try_start = i
+        if try_start is not None and line.strip().startswith(
+            "except ModuleNotFoundError"
+        ):
+            in_except = True
+        if in_except and try_start is not None:
+            # Find end of except block (next non-indented, non-blank line after except body)
+            if i > try_start + 5:  # we're past the except line itself
+                # Check if this line is NOT indented (new top-level statement)
+                stripped = line.strip()
+                if (
+                    stripped
+                    and not line.startswith(" ")
+                    and not line.startswith("\t")
+                    and "def " not in lines[i - 1]
+                    if i > 0
+                    else True
+                ):
+                    # But also check it's not a continuation of the except body
+                    pass
+
+    # Simpler approach: find by content markers
+    content = "".join(lines)
+
+    # Pattern 1: Original unpatched try/except
+    import re
+
+    # Match everything from "try:\n    from perceptron" to the end of the except block
+    pattern = r"try:\n    from perceptron\.tensorstream\.tensorstream import \(.*?\n(?:.*?\n)*?except ModuleNotFoundError.*?\n(?:    .*\n)*"
+    match = re.search(pattern, content)
+    if match:
+        old_block = match.group(0)
+        # Remove trailing newlines from old_block to be precise
+        content = content.replace(old_block, REPLACEMENT + "\n\n")
+        with open(path, "w") as f:
+            f.write(content)
+        print(f"SUCCESS: Replaced try/import block in {path}")
+    else:
+        # Check if already replaced
+        if "# perceptron.tensorstream stubs" in content:
+            print(f"ALREADY PATCHED: {path}")
+        else:
+            print(f"WARN: Could not find try/import block in {path}")
+            # Show perceptron references
+            for i, line in enumerate(lines):
+                if "perceptron" in line.lower():
+                    print(f"  Line {i + 1}: {line.rstrip()}")
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py
new file mode 100644
index 00000000..43b1457f
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py
@@ -0,0 +1,86 @@
+#!/usr/bin/env python3
+"""Patch modular_isaac.py on GPU to handle missing imports."""
+
+import sys
+
+path = (
+    sys.argv[1]
+    if len(sys.argv) > 1
+    else (
+        "/home/ubuntu/.cache/huggingface/modules/transformers_modules/"
+        "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py"
+    )
+)
+
+with open(path, "r") as f:
+    content = f.read()
+
+fixes = 0
+
+# Fix 1: DefaultFastImageProcessorKwargs
+old1 = (
+    "from transformers.image_processing_utils_fast import (\n"
+    "    BaseImageProcessorFast,\n"
+    "    DefaultFastImageProcessorKwargs,\n"
+    "    SizeDict,\n"
+    "    group_images_by_shape,\n"
+    "    reorder_images,\n"
+    ")"
+)
+new1 = (
+    "from transformers.image_processing_utils_fast import (\n"
+    "    BaseImageProcessorFast,\n"
+    "    SizeDict,\n"
+    "    group_images_by_shape,\n"
+    "    reorder_images,\n"
+    ")\n"
+    "try:\n"
+    "    from transformers.image_processing_utils_fast import DefaultFastImageProcessorKwargs\n"
+    "except ImportError:\n"
+    "    from typing import TypedDict\n"
+    "    class DefaultFastImageProcessorKwargs(TypedDict, total=False):\n"
+    "        pass"
+)
+if old1 in content:
+    content = content.replace(old1, new1)
+    fixes += 1
+    print("Fix 1 applied: DefaultFastImageProcessorKwargs")
+else:
+    print("Fix 1: not found (may already be patched)")
+
+# Fix 2: perceptron soft-fail
+old2 = (
+    "except ModuleNotFoundError as exc:  # pragma: no cover - import guard\n"
+    "    raise ModuleNotFoundError(\n"
+    '        "perceptron.tensorstream is required for the Isaac HuggingFace integration. "\n'
+    '        "Ensure the TensorStream package is installed and on PYTHONPATH."\n'
+    "    ) from exc"
+)
+new2 = (
+    "except ModuleNotFoundError:  # pragma: no cover - import guard\n"
+    "    import warnings as _warnings\n"
+    '    _warnings.warn("perceptron.tensorstream not available; TensorStream features disabled")\n'
+    "\n"
+    "    class Event: pass\n"
+    "    class Stream: pass\n"
+    "    class TensorStream: pass\n"
+    "    class TextType: pass\n"
+    "    class VisionType: pass\n"
+    "    def create_stream(*a, **kw): return None\n"
+    "    def group_streams(*a, **kw): return None\n"
+    "    def compute_mrope_pos_tensor(*a, **kw): return None\n"
+    "    def modality_mask(*a, **kw): return None\n"
+    "    def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None\n"
+    "    def tensor_stream_token_view(*a, **kw): return None\n"
+    "    def ts_slice(*a, **kw): return None"
+)
+if old2 in content:
+    content = content.replace(old2, new2)
+    fixes += 1
+    print("Fix 2 applied: perceptron soft-fail")
+else:
+    print("Fix 2: not found (may already be patched)")
+
+with open(path, "w") as f:
+    f.write(content)
+print(f"Done: {fixes} fixes applied to {path}")
diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh b/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh
new file mode 100644
index 00000000..47549197
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# Setup script for GPU benchmark of Isaac-0.2-2B
+# Run on a fresh GPU DLAMI (g6e.xlarge with L40S)
+#
+# Usage:
+#   bash setup_gpu.sh
+
+set -e
+
+echo "=== Isaac GPU Benchmark Setup ==="
+
+# Use the PyTorch 2.7 virtual environment from DLAMI
+echo "Setting up Python environment..."
+source /opt/dlami/nvme/pytorch-2.7/bin/activate 2>/dev/null || {
+    echo "DLAMI venv not found, using system Python..."
+    python3 -m venv ~/gpu_bench_env
+    source ~/gpu_bench_env/bin/activate
+}
+
+# Install vLLM and dependencies
+echo "Installing vLLM..."
+pip install -U vllm transformers torch pillow requests 2>&1 | tail -5
+
+# Download model (Isaac requires trust_remote_code)
+echo "Downloading Isaac-0.2-2B-Preview..."
+pip install -U "huggingface_hub[cli]" 2>&1 | tail -3
+huggingface-cli download PerceptronAI/Isaac-0.2-2B-Preview --local-dir ~/Isaac-0.2-2B-Preview
+
+echo ""
+echo "=== Setup complete ==="
+echo "To run benchmark:"
+echo "  python benchmark_gpu.py --model ~/Isaac-0.2-2B-Preview"
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py
new file mode 100644
index 00000000..667cd6a4
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py
@@ -0,0 +1,23 @@
+# Copyright 2025 © Amazon.com and Affiliates
+
+from .modeling_isaac import (
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+from .modeling_isaac_vision import (
+    NeuronIsaacVisionModel,
+    NeuronIsaacMultiModalProjector,
+    IsaacVisionModelWrapper,
+)
+from .modeling_isaac_text import (
+    NeuronIsaacTextModel,
+)
+
+__all__ = [
+    "NeuronIsaacForConditionalGeneration",
+    "IsaacInferenceConfig",
+    "NeuronIsaacVisionModel",
+    "NeuronIsaacMultiModalProjector",
+    "IsaacVisionModelWrapper",
+    "NeuronIsaacTextModel",
+]
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py
new file mode 100644
index 00000000..826acb54
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py
@@ -0,0 +1,624 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Isaac NxDI orchestrator: VLM model combining vision encoder and Qwen3 text decoder.
+
+Isaac-0.2-2B-Preview architecture:
+- Vision: SigLIP2 (27 layers) -> pixel shuffle (2x2) -> 2-layer MLP projector
+- Text: Qwen3 (28 layers, 2048 hidden, GQA 16/8)
+- mRoPE: interleaved, section=(2,1,1) weighting -> ~[32,16,16]
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import copy  # noqa: E402
+import logging  # noqa: E402
+from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+import torch.nn.utils.rnn as rnn_utils  # noqa: E402
+from transformers.modeling_outputs import CausalLMOutputWithPast  # noqa: E402
+
+import neuronx_distributed_inference.modules.autobucketing as autobucketing  # noqa: E402
+from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig  # noqa: E402
+from neuronx_distributed_inference.models.image_to_text_model_base import (  # noqa: E402
+    ImageToTextInferenceConfig,
+    NeuronBaseForImageToText,
+)
+from neuronx_distributed_inference.models.image_to_text_model_wrapper import (  # noqa: E402
+    ImageToTextModelWrapper,
+    IMAGE_TO_TEXT_MODEL_WRAPPER_INPUT_KEYS,
+)
+from neuronx_distributed_inference.models.llama4.utils.encoder_utils import (  # noqa: E402
+    pad_vision_embeddings,
+)
+from neuronx_distributed_inference.models.model_wrapper import (  # noqa: E402
+    CONTEXT_ENCODING_MODEL_TAG,
+    TOKEN_GENERATION_MODEL_TAG,
+    VISION_ENCODER_MODEL_TAG,
+)
+from neuronx_distributed_inference.modules.flashdecode.utils import (  # noqa: E402
+    calculate_num_cores_per_group,
+)
+
+from isaac_neuron.modeling_isaac_text import NeuronIsaacTextModel  # noqa: E402
+from isaac_neuron.modeling_isaac_vision import (  # noqa: E402
+    NeuronIsaacVisionModel,
+    IsaacVisionModelWrapper,
+)
+from isaac_neuron.utils import convert_state_dict_to_fused_qkv, StateDict  # noqa: E402
+
+logger = logging.getLogger("Neuron")
+
+
+class IsaacInferenceConfig(ImageToTextInferenceConfig):
+    """Isaac-specific inference configuration.
+
+    Extends ImageToTextInferenceConfig with:
+    - pixel_shuffle_scale from model config
+    - projector_intermediate_size from model config
+    - Isaac-specific required attributes
+    """
+
+    def __init__(
+        self,
+        text_neuron_config,
+        vision_neuron_config,
+        fused_spec_config=None,
+        load_config=None,
+        metadata: Optional[Dict] = None,
+        **kwargs,
+    ):
+        super().__init__(
+            text_neuron_config=text_neuron_config,
+            vision_neuron_config=vision_neuron_config,
+            fused_spec_config=fused_spec_config,
+            load_config=load_config,
+            metadata=metadata,
+            **kwargs,
+        )
+
+        # Isaac uses hidden_act for the text model MLP (SiLU)
+        if not hasattr(self.text_config, "hidden_act"):
+            self.text_config.hidden_act = "silu"
+
+        # Isaac's SigLIP2 encoder does NOT use a pooling head
+        # (no head weights in the checkpoint; features go to pixel shuffle + MLP projector)
+        if not hasattr(self.vision_config, "vision_use_head"):
+            self.vision_config.vision_use_head = False
+
+        # Extract Isaac-specific config values
+        # pixel_shuffle_scale is in the vision_config or top-level config
+        if not hasattr(self, "pixel_shuffle_scale"):
+            self.pixel_shuffle_scale = getattr(
+                self.vision_config, "pixel_shuffle_scale", 2
+            )
+
+        # Projector intermediate size
+        if not hasattr(self, "projector_intermediate_size"):
+            vision_hidden = self.vision_config.hidden_size  # 1152
+            self.projector_intermediate_size = (
+                vision_hidden * (self.pixel_shuffle_scale**2) * 4
+            )  # 18432
+
+        # Validation
+        if self.text_config.neuron_config.is_block_kv_layout:
+            raise ValueError("Isaac does not yet support block_kv_layout.")
+        if self.text_config.neuron_config.is_prefix_caching:
+            raise ValueError("Isaac does not yet support prefix_caching.")
+        if self.text_config.neuron_config.is_chunked_prefill:
+            raise ValueError("Isaac does not yet support chunked_prefill.")
+        if self.text_config.neuron_config.is_medusa:
+            raise ValueError("Isaac does not yet support medusa.")
+        if self.text_config.neuron_config.enable_fused_speculation:
+            raise ValueError("Isaac does not yet support fused speculation.")
+
+        if self.neuron_config.flash_decoding_enabled:
+            num_attn_heads = self.text_config.num_attention_heads
+            num_kv_heads = self.text_config.num_key_value_heads
+            num_attn_heads = (
+                num_attn_heads // self.neuron_config.tp_degree + 1
+            ) * self.neuron_config.tp_degree
+            self.text_config.num_cores_per_group = calculate_num_cores_per_group(
+                num_attn_heads, num_kv_heads, self.neuron_config.tp_degree
+            )
+
+    def get_required_attributes(self) -> List[str]:
+        return [
+            "text_config",
+            "vision_config",
+            "text_config.hidden_size",
+            "text_config.num_attention_heads",
+            "text_config.num_hidden_layers",
+            "text_config.num_key_value_heads",
+            "text_config.head_dim",
+            "text_config.rope_theta",
+            "text_config.rms_norm_eps",
+            "vision_config.hidden_size",
+            "vision_config.image_size",
+            "vision_config.num_attention_heads",
+            "vision_config.num_hidden_layers",
+            "vision_config.patch_size",
+        ]
+
+    @classmethod
+    def get_neuron_config_cls(cls) -> Type[NeuronConfig]:
+        return NeuronConfig
+
+
+class NeuronIsaacForConditionalGeneration(NeuronBaseForImageToText):
+    """Isaac VLM orchestrator for NxDI.
+
+    Combines:
+    - NeuronIsaacVisionModel (SigLIP2 + pixel shuffle + MLP projector)
+    - NeuronIsaacTextModel (Qwen3 decoder)
+    - ImageToTextModelWrapper (text model tracing wrapper)
+    - IsaacVisionModelWrapper (vision model tracing wrapper)
+    """
+
+    # Model classes
+    text_model_cls = NeuronIsaacTextModel
+    vision_model_cls = NeuronIsaacVisionModel
+
+    # Model wrappers
+    text_model_wrapper = ImageToTextModelWrapper
+    vision_model_wrapper = IsaacVisionModelWrapper
+
+    def __init__(self, *args, **kwargs):
+        super().__init__(
+            self.text_model_cls,
+            self.vision_model_cls,
+            self.text_model_wrapper,
+            self.vision_model_wrapper,
+            *args,
+            **kwargs,
+        )
+
+    @classmethod
+    def get_config_cls(cls):
+        return IsaacInferenceConfig
+
+    def enable_vision_encoder(
+        self, enable_wlt_optimization: bool = True, **model_init_kwargs
+    ):
+        """Enable and configure the vision encoder for compilation."""
+        self.compile_tag = VISION_ENCODER_MODEL_TAG
+
+        new_config = copy.deepcopy(self.config)
+        if new_config.vision_config.neuron_config.enable_bucketing:
+            if (
+                new_config.vision_config.neuron_config.buckets
+                == [new_config.vision_config.neuron_config.seq_len]
+                or new_config.vision_config.neuron_config.buckets is None
+            ):
+                if new_config.vision_config.neuron_config.seq_len > 1024:
+                    new_config.vision_config.neuron_config.buckets = (
+                        autobucketing.generate_buckets(
+                            1024, new_config.vision_config.neuron_config.seq_len
+                        )
+                    )
+                else:
+                    new_config.vision_config.neuron_config.buckets = [
+                        new_config.vision_config.neuron_config.seq_len
+                    ]
+
+        new_config.neuron_config = copy.deepcopy(new_config.vision_config.neuron_config)
+
+        self.vision_encoder_model = self.vision_model_wrapper(
+            config=new_config,
+            model_cls=self.vision_model_cls,
+            tag=VISION_ENCODER_MODEL_TAG,
+            compiler_args=self.get_compiler_args(),
+            model_init_kwargs=model_init_kwargs,
+            priority_model_idx=(0 if enable_wlt_optimization else None),
+            pipeline_execution=True,
+            return_ranked_to_cpu=True,
+        )
+        self.vision_models.append(self.vision_encoder_model)
+
+    @staticmethod
+    def update_state_dict_for_tied_weights(state_dict: StateDict) -> None:
+        """Isaac ties embed_tokens and lm_head weights."""
+        try:
+            state_dict["lm_head.weight"] = state_dict["embed_tokens.weight"].clone()
+        except KeyError:
+            state_dict["embed_tokens.weight"] = state_dict["lm_head.weight"].clone()
+
+    @staticmethod
+    def convert_hf_to_neuron_state_dict(
+        state_dict: StateDict, inference_config: InferenceConfig
+    ) -> StateDict:
+        """Convert HuggingFace Isaac state dict to NxDI format.
+
+        NOTE: The base class ApplicationBase.get_state_dict strips the leading
+        "model." prefix BEFORE calling this method. So incoming keys are:
+        - text_model.embed_tokens.weight (was model.text_model.embed_tokens.weight)
+        - text_model.layers.{i}.* (was model.text_model.layers.{i}.*)
+        - text_model.norm.weight (was model.text_model.norm.weight)
+        - lm_head.weight (unchanged)
+        - vision_embedding.0.* (was model.vision_embedding.0.*)
+        - vision_embedding.1.weight (was model.vision_embedding.1.weight)
+        - vision_embedding.3.weight (was model.vision_embedding.3.weight)
+        - rotary_emb.* (was model.rotary_emb.*)
+
+        Key mappings applied here:
+        - text_model.* -> * (strip text_model prefix)
+        - vision_embedding.0.* -> vision_encoder.vision_encoder.vision_model.*
+        - vision_embedding.1.weight -> vision_encoder.multi_modal_projector.fc1.weight
+        - vision_embedding.3.weight -> vision_encoder.multi_modal_projector.fc2.weight
+        - rotary_emb.* -> skipped
+
+        Also renames attention keys for NxDI format:
+        - .self_attn.q_proj. -> .self_attn.qkv_proj.q_proj.
+        - .self_attn.k_proj. -> .self_attn.qkv_proj.k_proj.
+        - .self_attn.v_proj. -> .self_attn.qkv_proj.v_proj.
+        - .self_attn.o_proj. -> .self_attn.o_proj.o_proj.
+        - .self_attn.q_norm. -> .self_attn.q_layernorm.
+        - .self_attn.k_norm. -> .self_attn.k_layernorm.
+        """
+        neuron_config = inference_config.neuron_config
+
+        attention_keys = {
+            ".self_attn.q_proj.": ".self_attn.qkv_proj.q_proj.",
+            ".self_attn.k_proj.": ".self_attn.qkv_proj.k_proj.",
+            ".self_attn.v_proj.": ".self_attn.qkv_proj.v_proj.",
+            ".self_attn.o_proj.": ".self_attn.o_proj.o_proj.",
+            ".self_attn.out_proj.": ".self_attn.o_proj.o_proj.",  # for siglip
+            ".self_attn.q_norm.": ".self_attn.q_layernorm.",
+            ".self_attn.k_norm.": ".self_attn.k_layernorm.",
+        }
+
+        new_state_dict = {}
+        for key, weights in state_dict.items():
+            new_key = key
+
+            # Text model weights: text_model.* -> *
+            # (base class already stripped leading "model." prefix)
+            if new_key.startswith("text_model."):
+                new_key = new_key.replace("text_model.", "", 1)
+                # Rename attention keys
+                for attn_key, replacement in attention_keys.items():
+                    if attn_key in new_key:
+                        new_key = new_key.replace(attn_key, replacement)
+                        break
+
+            # LM head: lm_head.weight -> lm_head.weight (no change)
+            # (already handled by tied weights)
+
+            # Vision encoder: vision_embedding.0.* -> vision_encoder.vision_model.*
+            # NeuronIsaacVisionModel.vision_encoder = NeuronSiglipVisionModel
+            # NeuronSiglipVisionModel.vision_model = NeuronSiglipVisionTransformer
+            elif new_key.startswith("vision_embedding.0."):
+                new_key = new_key.replace(
+                    "vision_embedding.0.",
+                    "vision_encoder.vision_model.",
+                    1,
+                )
+                # Rename attention keys for vision encoder
+                for attn_key, replacement in attention_keys.items():
+                    if attn_key in new_key:
+                        new_key = new_key.replace(attn_key, replacement)
+                        break
+
+            # MLP projector fc1: vision_embedding.1.weight
+            elif new_key == "vision_embedding.1.weight":
+                new_key = "multi_modal_projector.fc1.weight"
+
+            # MLP projector fc2: vision_embedding.3.weight
+            elif new_key == "vision_embedding.3.weight":
+                new_key = "multi_modal_projector.fc2.weight"
+
+            # Skip rotary_emb (handled by NxDI internally)
+            elif new_key.startswith("rotary_emb"):
+                continue
+
+            new_state_dict[new_key] = weights
+
+        # Reshape patch_embedding weight from HF 2D [out_ch, in_ch*kH*kW] to Conv2d 4D
+        patch_key = "vision_encoder.vision_model.embeddings.patch_embedding.weight"
+        if patch_key in new_state_dict:
+            w = new_state_dict[patch_key]
+            if w.dim() == 2:
+                patch_size = inference_config.vision_config.patch_size
+                num_channels = inference_config.vision_config.num_channels
+                out_channels = w.shape[0]
+                new_state_dict[patch_key] = w.reshape(
+                    out_channels, num_channels, patch_size, patch_size
+                )
+
+        # Add lm_head.bias if needed for LNC > 1
+        if (
+            "lm_head.bias" not in new_state_dict
+            and inference_config.neuron_config.lm_head_pad
+        ):
+            new_state_dict["lm_head.bias"] = torch.zeros(
+                new_state_dict["embed_tokens.weight"].shape[0],
+                dtype=torch.float32,
+            )
+
+        # Fuse QKV for text model
+        if inference_config.text_config.neuron_config.fused_qkv:
+            new_state_dict = convert_state_dict_to_fused_qkv(
+                state_dict=new_state_dict,
+                num_layers=inference_config.text_config.num_hidden_layers,
+                neuron_config=inference_config.text_config.neuron_config,
+                prefix="layers.{layer_num}.self_attn",
+            )
+
+        # Fuse QKV for vision model
+        if inference_config.vision_config.neuron_config.fused_qkv:
+            new_state_dict = convert_state_dict_to_fused_qkv(
+                state_dict=new_state_dict,
+                num_layers=inference_config.vision_config.num_hidden_layers,
+                neuron_config=inference_config.vision_config.neuron_config,
+                prefix="vision_encoder.vision_model.encoder.layers.{layer_num}.self_attn",
+            )
+
+        # Add rank utilities
+        if neuron_config.vocab_parallel:
+            new_state_dict["embed_tokens.rank_util.rank"] = torch.arange(
+                0, neuron_config.local_ranks_size
+            )
+
+        tp_degree = neuron_config.tp_degree
+        for i in range(inference_config.text_config.num_hidden_layers):
+            new_state_dict[f"layers.{i}.self_attn.rank_util.rank"] = torch.arange(
+                0, tp_degree, dtype=torch.int32
+            )
+
+        new_state_dict["rank_util.rank"] = torch.arange(0, tp_degree, dtype=torch.int32)
+
+        return new_state_dict
+
+    @staticmethod
+    def _convert_input_dict_to_ordered_tuple(input_dict: Dict[str, Any]):
+        """Convert input dictionary to ordered tuple for model wrapper."""
+        args = []
+        for key in IMAGE_TO_TEXT_MODEL_WRAPPER_INPUT_KEYS:
+            if key in input_dict and input_dict[key] is not None:
+                arg = input_dict[key]
+            else:
+                arg = torch.empty(0)
+            args.append(arg)
+        return tuple(args)
+
+    def _select_buckets_for_padding_length(self, position_ids):
+        """Select appropriate buckets based on whether prefill or decode."""
+        neuron_config = self.config.neuron_config
+        context_encoding_buckets = (
+            neuron_config.context_encoding_buckets
+            if neuron_config.context_encoding_buckets is not None
+            else neuron_config.buckets
+        )
+        token_generation_buckets = (
+            neuron_config.token_generation_buckets
+            if neuron_config.token_generation_buckets is not None
+            else neuron_config.buckets
+        )
+
+        if self._is_prefill(position_ids):
+            return context_encoding_buckets
+        return token_generation_buckets
+
+    @staticmethod
+    def get_padding_length(buckets, position_ids):
+        """Find the smallest bucket that fits the input."""
+        max_position_id = torch.max(position_ids).item()
+        for val in buckets:
+            if val > max_position_id:
+                return val
+        raise ValueError("No bucket found for provided input_ids!")
+
+    @staticmethod
+    def get_required_kwargs() -> List[str]:
+        """Additional kwargs for HuggingFaceGenerationAdapter."""
+        return [
+            "pixel_values",
+            "vision_mask",
+        ]
+
+    @staticmethod
+    def generate_positions_from_mask(mask: torch.Tensor) -> torch.Tensor:
+        """Generate position indices from a boolean vision mask."""
+        if mask.dim() == 1:
+            return torch.nonzero(mask).squeeze()
+        else:
+            rows, cols = torch.nonzero(mask, as_tuple=True)
+            row_counts = torch.bincount(rows, minlength=mask.shape[0])
+            cols_per_row = torch.split(cols, row_counts.tolist())
+            return rnn_utils.pad_sequence(
+                cols_per_row, batch_first=True, padding_value=0
+            )
+
+    @staticmethod
+    def pad_positions(
+        positions: torch.LongTensor, target_size: int, fill_value: float
+    ) -> torch.LongTensor:
+        """Pad positions tensor to target size."""
+        positions_2d = positions.unsqueeze(0) if positions.dim() == 1 else positions
+        padding_size = target_size - positions_2d.shape[1]
+        assert padding_size >= 0, (
+            "Text model sequence length is not enough to handle all vision embeddings"
+        )
+        positions_padded = F.pad(positions_2d, (0, padding_size), value=fill_value)
+        return positions_padded.unsqueeze(-1)
+
+    @staticmethod
+    def _create_position_ids(
+        attention_mask_2d: torch.LongTensor, is_prefill: bool
+    ) -> torch.LongTensor:
+        """Create position IDs from attention mask."""
+        position_ids = attention_mask_2d.long().cumsum(-1) - 1
+        position_ids.masked_fill_(attention_mask_2d == 0, 1)
+        if is_prefill:
+            return position_ids
+        else:
+            return torch.amax(position_ids, dim=1, keepdim=True) + 1
+
+    def forward(
+        self,
+        input_ids: torch.LongTensor = None,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        seq_ids: Optional[torch.LongTensor] = None,
+        sampling_params: Optional[torch.FloatTensor] = None,
+        pixel_values: Optional[torch.FloatTensor] = None,
+        vision_mask: Optional[torch.FloatTensor] = None,
+        image_sizes: Optional[torch.FloatTensor] = None,
+        adapter_ids: Optional[torch.LongTensor] = None,
+        past_key_values: Optional[List[torch.FloatTensor]] = None,
+        use_cache: Optional[bool] = None,
+        medusa_args=None,
+        input_capture_hook: Optional[Callable] = None,
+        tensor_capture_hook: Optional[Callable] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, CausalLMOutputWithPast]:
+        """Forward pass combining vision encoder and text decoder."""
+        is_prefill = input_ids.shape[-1] > 1
+        include_images = (
+            pixel_values is not None
+            and vision_mask is not None
+            and pixel_values.sum() != 0
+        )
+
+        if position_ids is None:
+            position_ids = self._create_position_ids(
+                attention_mask_2d=attention_mask, is_prefill=is_prefill
+            )
+
+        buckets = self._select_buckets_for_padding_length(position_ids=position_ids)
+        pad_target_size = self.get_padding_length(
+            buckets=buckets, position_ids=position_ids
+        )
+        pad_fill_value = pad_target_size - 1
+
+        if is_prefill and include_images:
+            assert vision_mask.dtype == torch.bool, (
+                f"vision_mask must be bool, got {vision_mask.dtype}"
+            )
+
+            # Run vision encoder
+            vision_embeddings = self.vision_encoder_model(
+                pixel_values.to(self.vision_config.neuron_config.torch_dtype),
+            ).to(self.text_config.neuron_config.torch_dtype)
+
+            # Flatten vision embeddings for multi-image support
+            batch_sz = 1 if vision_mask.dim() == 1 else vision_mask.shape[0]
+            num_images, seq_len, embedding_dim = vision_embeddings.shape
+            img_per_sample = num_images // batch_sz
+            vision_embeddings = vision_embeddings.view(
+                batch_sz, img_per_sample * seq_len, embedding_dim
+            )
+
+            # Pad to bucket size
+            vision_embeddings = pad_vision_embeddings(
+                vision_embeddings=vision_embeddings, pad_limit=pad_target_size
+            )
+
+            # Create scatter positions from vision mask
+            vision_mask = self.generate_positions_from_mask(mask=vision_mask.squeeze())
+            vision_mask = self.pad_positions(
+                positions=vision_mask,
+                target_size=pad_target_size,
+                fill_value=pad_fill_value,
+            )
+        else:
+            # Text-only or token generation -> dummy vision inputs
+            vision_embeddings, vision_mask = (
+                self.context_encoding_model.get_dummy_vision_inputs(
+                    config=self.text_config,
+                    input_ids=input_ids,
+                    n_active_tokens=pad_target_size,
+                    fill_value=pad_fill_value,
+                )
+            )
+
+        return super().forward(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+            sampling_params=sampling_params,
+            vision_embeddings=vision_embeddings,
+            vision_mask=vision_mask,
+        )
+
+    def enable_token_generation(self):
+        self.compile_tag = TOKEN_GENERATION_MODEL_TAG
+        super().enable_token_generation()
+
+    def enable_context_encoding(self):
+        self.compile_tag = CONTEXT_ENCODING_MODEL_TAG
+        super().enable_context_encoding()
+
+    def get_compiler_args(self) -> str:
+        """Get compiler arguments based on compilation phase."""
+        logical_nc_config = self.text_config.neuron_config.logical_nc_config
+
+        if self.compile_tag == CONTEXT_ENCODING_MODEL_TAG:
+            optimization_level = "-O1"
+        elif self.compile_tag == TOKEN_GENERATION_MODEL_TAG:
+            optimization_level = "-O2"
+        elif self.compile_tag == VISION_ENCODER_MODEL_TAG:
+            return (
+                f"-O1 --model-type=transformer "
+                f"--tensorizer-options='--enable-ccop-compute-overlap' "
+                f"--auto-cast=none --lnc={logical_nc_config}"
+            )
+        else:
+            raise ValueError(
+                f"get_compiler_args() Invalid compile tag: {self.compile_tag}"
+            )
+
+        args = (
+            f"--auto-cast=none --model-type=transformer "
+            f"--tensorizer-options='--enable-ccop-compute-overlap "
+            f"--cc-pipeline-tiling-factor=1 --vectorize-strided-dma "
+            f"--enable-scalar-dge-vectorization' "
+            f"--lnc={logical_nc_config} {optimization_level} "
+        )
+        return args
+
+    def _get_constructed_outputs(self, outputs, is_run_on_neuron):
+        """Process model outputs into the expected format."""
+        if (
+            self.on_device_sampling
+            and self.text_config.neuron_config.output_logits
+            and not (
+                self.text_config.neuron_config.enable_fused_speculation
+                or self.text_config.neuron_config.is_medusa
+            )
+        ):
+            logits_or_next_tokens = outputs[:2]
+            constructed_outputs = self._construct_output_with_tokens_and_logits(
+                next_tokens=logits_or_next_tokens[0],
+                logits=logits_or_next_tokens[1],
+            )
+        else:
+            if is_run_on_neuron:
+                logits_or_next_tokens = (
+                    outputs[0] if isinstance(outputs, (list, tuple)) else outputs
+                )
+            else:
+                logits_or_next_tokens, *_ = outputs
+            constructed_outputs = self._construct_output(logits_or_next_tokens)
+
+        if logging.root.isEnabledFor(logging.DEBUG):
+            logging.debug("---output---")
+            logging.debug(
+                f"{'tokens' if self.on_device_sampling else 'logits'} = %s",
+                logits_or_next_tokens,
+            )
+
+        return constructed_outputs
+
+    @staticmethod
+    def load_hf_model(model_path, **kwargs):
+        """Load the HuggingFace Isaac model for weight extraction."""
+        from transformers import AutoModelForCausalLM
+
+        model = AutoModelForCausalLM.from_pretrained(
+            model_path, trust_remote_code=True, **kwargs
+        ).eval()
+        return model
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py
new file mode 100644
index 00000000..52f861ef
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py
@@ -0,0 +1,576 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Isaac text model for NxDI: Qwen3 decoder layers adapted for VLM.
+
+Isaac's text backbone is a standard Qwen3 model (28 layers, 2048 hidden, GQA 16/8 heads).
+This module wraps Qwen3 decoder layers in the NeuronBaseModel VLM pattern, supporting:
+- Vision embedding injection via scatter_by_index_put
+- Standard NxDI KV cache management
+- On-device sampling
+"""
+
+import logging
+from typing import Optional, Tuple
+
+import torch
+import torch.nn as nn
+
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed.parallel_layers.mappings import _gather_along_dim
+from neuronx_distributed.utils import cpu_mode
+from neuronx_distributed_inference.models.config import InferenceConfig
+from neuronx_distributed_inference.models.model_base import NeuronBaseModel
+from neuronx_distributed_inference.models.llama.modeling_llama import NeuronLlamaMLP
+from neuronx_distributed_inference.modules.attention.attention_base import (
+    NeuronAttentionBase,
+    QKNormPlacement,
+)
+from neuronx_distributed_inference.modules.attention.utils import RotaryEmbedding
+from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm
+from neuronx_distributed_inference.modules.flashdecode.utils import (
+    get_cache_size,
+    mask_util,
+    turn_2d_mask_to_4d,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (
+    Sampler,
+    mask_padded_logits,
+)
+from neuronx_distributed_inference.modules.kvcache.kv_cache_manager import (
+    KVCacheManager,
+)
+from neuronx_distributed_inference.modules.kvcache.block_kv_cache_manager import (
+    generate_tokengen_slot_mapping,
+)
+from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum
+from neuronx_distributed_inference.utils.distributed import get_tp_group
+
+# Use HF Qwen3RMSNorm for CPU, CustomRMSNorm for Neuron
+from transformers.models.qwen3.modeling_qwen3 import Qwen3RMSNorm
+
+logger = logging.getLogger("Neuron")
+
+
+def get_rmsnorm_cls():
+    """Return appropriate RMSNorm class based on execution mode."""
+    return Qwen3RMSNorm if cpu_mode() else CustomRMSNorm
+
+
+class NeuronIsaacAttention(NeuronAttentionBase):
+    """Isaac attention: standard Qwen3 GQA with QK normalization.
+
+    Qwen3 applies QK norm BEFORE RoPE (pre-rope), same as NxDI built-in Qwen3.
+    Config: 16 attention heads, 8 KV heads, head_dim=128, rope_theta=1M
+    """
+
+    def __init__(self, config: InferenceConfig):
+        head_dim = getattr(
+            config, "head_dim", config.hidden_size // config.num_attention_heads
+        )
+        rotary_emb = RotaryEmbedding(
+            dim=head_dim,
+            max_position_embeddings=config.max_position_embeddings,
+            base=config.rope_theta,
+        )
+
+        super().__init__(
+            config=config,
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_key_value_heads,
+            head_dim=head_dim,
+            rotary_emb=rotary_emb,
+            num_cores_per_group=getattr(config, "num_cores_per_group", 1),
+            rms_norm_eps=config.rms_norm_eps,
+            qk_norm_placement=QKNormPlacement.PRE_ROPE,
+            q_layernorm=get_rmsnorm_cls()(
+                hidden_size=head_dim, eps=config.rms_norm_eps
+            ),
+            k_layernorm=get_rmsnorm_cls()(
+                hidden_size=head_dim, eps=config.rms_norm_eps
+            ),
+        )
+
+
+class NeuronIsaacDecoderLayer(nn.Module):
+    """Isaac decoder layer: Qwen3 architecture (RMSNorm -> Attn -> RMSNorm -> MLP).
+
+    Identical to NeuronQwen3DecoderLayer from NxDI built-in, but adapted
+    for the VLM text model pattern.
+    """
+
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+        self.neuron_config = config.neuron_config
+        self.hidden_size = config.hidden_size
+
+        self.self_attn = NeuronIsaacAttention(config)
+        self.mlp = NeuronLlamaMLP(config)  # Qwen3 MLP is compatible with LlamaMLP
+
+        self.input_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+        self.post_attention_layernorm = get_rmsnorm_cls()(
+            config.hidden_size, eps=config.rms_norm_eps
+        )
+
+        # Kernel enablement flags
+        self.qkv_kernel_enabled = config.neuron_config.qkv_kernel_enabled
+        self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled
+        self.quantized_mlp_kernel_enabled = (
+            config.neuron_config.quantized_mlp_kernel_enabled
+        )
+        self.rmsnorm_quantize_kernel_enabled = (
+            config.neuron_config.rmsnorm_quantize_kernel_enabled
+        )
+        self.sequence_parallel_enabled = config.neuron_config.sequence_parallel_enabled
+
+        # Fused rmsnorm only when sequence parallelism is disabled
+        self.qkv_kernel_fused_rmsnorm = not self.sequence_parallel_enabled
+        self.mlp_kernel_fused_rmsnorm = not self.sequence_parallel_enabled
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        position_ids: Optional[torch.LongTensor] = None,
+        past_key_value: Optional[Tuple[torch.Tensor]] = None,
+        adapter_ids=None,
+        **kwargs,
+    ) -> Tuple[torch.FloatTensor, ...]:
+        residual = hidden_states
+
+        # QKV kernel fusion with RMSNorm
+        if self.qkv_kernel_enabled and self.qkv_kernel_fused_rmsnorm:
+            qkv_fused_rmsnorm = self.input_layernorm
+        else:
+            hidden_states = self.input_layernorm(hidden_states)
+            qkv_fused_rmsnorm = None
+
+        # Self Attention
+        attn_output = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            past_key_value=past_key_value,
+            adapter_ids=adapter_ids,
+            rmsnorm=qkv_fused_rmsnorm,
+            **kwargs,
+        )
+        hidden_states = attn_output.hidden_states
+
+        # First residual
+        hidden_states = residual + hidden_states
+        residual = hidden_states
+
+        # MLP kernel fusion with RMSNorm
+        if self.mlp_kernel_enabled and self.mlp_kernel_fused_rmsnorm:
+            mlp_fused_rmsnorm = self.post_attention_layernorm
+        else:
+            hidden_states = self.post_attention_layernorm(hidden_states)
+            mlp_fused_rmsnorm = None
+
+        hidden_states, _ = self.mlp(
+            hidden_states,
+            rmsnorm=mlp_fused_rmsnorm,
+            adapter_ids=adapter_ids,
+        )
+
+        # Second residual
+        hidden_states = residual + hidden_states
+
+        return (
+            hidden_states,
+            attn_output.present_key_value,
+            attn_output.cos_cache,
+            attn_output.sin_cache,
+            None,  # residual (not used for Qwen3)
+        )
+
+
+class NeuronIsaacTextModel(NeuronBaseModel):
+    """Isaac text model for VLM: Qwen3 decoder with vision embedding injection.
+
+    Follows the same pattern as NeuronGemma3TextModel:
+    - Inherits from NeuronBaseModel
+    - Uses scatter_by_index_put for vision token injection
+    - Manages KV cache and on-device sampling
+    """
+
+    def scatter_by_index_put(self, h_image, encoded_patches_proj, positions):
+        """Scatter vision embeddings into the input embedding sequence.
+
+        Args:
+            h_image: (B, max_positions, hidden_dim) - text input embeddings
+            encoded_patches_proj: (num_patches, patch_size, hidden_dim) - vision embeddings
+            positions: (B, num_positions, 1) - scatter positions
+
+        Returns:
+            Updated h_image with vision embeddings scattered in.
+        """
+        B, max_positions, embedding_dim = h_image.shape
+        h_image_new = h_image.clone()
+        encoded_patches_flat = encoded_patches_proj.view(-1, embedding_dim)
+        positions = positions.view(-1)
+
+        num_updates_per_batch = positions.shape[0] // B
+        batch_idx = torch.arange(B, device=h_image.device, dtype=positions.dtype)
+        batch_idx = batch_idx.repeat_interleave(num_updates_per_batch)
+
+        h_image_new.index_put_(
+            (batch_idx.long(), positions.long()),
+            encoded_patches_flat,
+            accumulate=False,
+        )
+        return h_image_new
+
+    def encode_vision_to_input(
+        self, inputs_embeds, vision_embeddings, vision_mask
+    ) -> torch.Tensor:
+        """Inject vision embeddings into text input embeddings."""
+        return self.scatter_by_index_put(inputs_embeds, vision_embeddings, vision_mask)
+
+    def setup_attr_for_model(self, config: InferenceConfig):
+        """Set up model attributes needed for inference."""
+        self.on_device_sampling = (
+            config.neuron_config.on_device_sampling_config is not None
+        )
+        self.tp_degree = config.neuron_config.tp_degree
+        self.hidden_size = config.hidden_size
+        self.num_attention_heads = config.num_attention_heads
+        self.num_key_value_heads = config.num_key_value_heads
+        self.max_batch_size = config.neuron_config.max_batch_size
+        self.buckets = config.neuron_config.buckets
+        self.is_chunked_prefill = config.neuron_config.is_chunked_prefill
+
+    def init_model(self, config: InferenceConfig):
+        """Initialize the Qwen3 text model components."""
+        self.padding_idx = config.pad_token_id
+        self.vocab_size = config.vocab_size
+
+        # Embedding layer
+        if parallel_state_initialized():
+            self.embed_tokens = ParallelEmbedding(
+                config.vocab_size,
+                config.hidden_size,
+                self.padding_idx,
+                dtype=config.neuron_config.torch_dtype,
+                shard_across_embedding=True,
+                pad=True,
+                sequence_parallel_enabled=False,
+                tensor_model_parallel_group=get_tp_group(config),
+            )
+
+            lm_head_pad = config.neuron_config.lm_head_pad
+            lnc = config.neuron_config.logical_nc_config
+            lm_head_pad_alignment_size = (
+                config.neuron_config.lm_head_pad_alignment_size * lnc
+            )
+            self.lm_head = ColumnParallelLinear(
+                config.hidden_size,
+                config.vocab_size,
+                gather_output=not self.on_device_sampling,
+                bias=lm_head_pad,
+                pad=True,
+                pad_alignment_size_per_rank=lm_head_pad_alignment_size
+                if lm_head_pad
+                else 1,
+                keep_padded_output=lm_head_pad,
+                dtype=config.neuron_config.torch_dtype,
+                tensor_model_parallel_group=get_tp_group(config),
+            )
+        else:
+            from transformers.models.qwen3.modeling_qwen3 import (
+                Qwen3RMSNorm as HFQwen3RMSNorm,
+            )
+
+            self.embed_tokens = nn.Embedding(
+                config.vocab_size,
+                config.hidden_size,
+                self.padding_idx,
+            )
+            self.lm_head = nn.Linear(
+                config.hidden_size,
+                config.vocab_size,
+                bias=False,
+            )
+
+        # Decoder layers
+        self.layers = nn.ModuleList(
+            [NeuronIsaacDecoderLayer(config) for _ in range(config.num_hidden_layers)]
+        )
+
+        # Final norm
+        self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps)
+
+    def init_inference_optimization(self, config: InferenceConfig):
+        """Initialize KV cache and sampling for inference."""
+        super().init_inference_optimization(config)
+
+        if self.on_device_sampling:
+            self.sampler = Sampler(config.neuron_config)
+
+        self.kv_mgr = KVCacheManager(
+            config,
+            num_kv_head=self.num_key_value_heads,
+            global_rank=self.rank_util,
+        )
+
+    def forward(
+        self,
+        input_ids,
+        attention_mask,
+        position_ids,
+        seq_ids,
+        sampling_params,
+        prev_hidden=None,
+        adapter_ids=None,
+        accepted_indices=None,
+        current_length=None,
+        medusa_mask=None,
+        scatter_index=None,
+        slot_mapping=None,
+        active_block_table=None,
+        num_queries=None,
+        computed_context_lens=None,
+        tile_q_indices=None,
+        tile_block_tables=None,
+        tile_masks=None,
+        inputs_embeds: Optional[torch.FloatTensor] = None,
+        kv_cache: Optional[torch.Tensor] = None,
+        active_mask=None,
+        rotary_position_id=None,
+        vision_embeddings=None,
+        vision_mask=None,
+    ):
+        """Forward pass for Isaac text model with vision support.
+
+        This follows NeuronBaseModel.forward() pattern with vision embedding injection.
+        The 25 positional arguments match ImageToTextModelWrapper's expected interface.
+        """
+        # Handle optional empty tensors
+        prev_hidden = self.set_none_if_empty(prev_hidden)
+        adapter_ids = self.set_none_if_empty(adapter_ids)
+        accepted_indices = self.set_none_if_empty(accepted_indices)
+        current_length = self.set_none_if_empty(current_length)
+        medusa_mask = self.set_none_if_empty(medusa_mask)
+        scatter_index = self.set_none_if_empty(scatter_index)
+        slot_mapping = self.set_none_if_empty(slot_mapping)
+        active_block_table = self.set_none_if_empty(active_block_table)
+        num_queries = self.set_none_if_empty(num_queries)
+        computed_context_lens = self.set_none_if_empty(computed_context_lens)
+        tile_q_indices = self.set_none_if_empty(tile_q_indices)
+        tile_block_tables = self.set_none_if_empty(tile_block_tables)
+        tile_masks = self.set_none_if_empty(tile_masks)
+        inputs_embeds = self.set_none_if_empty(inputs_embeds)
+        kv_cache = self.set_none_if_empty(kv_cache)
+        active_mask = self.set_none_if_empty(active_mask)
+        rotary_position_id = self.set_none_if_empty(rotary_position_id)
+        vision_embeddings = self.set_none_if_empty(vision_embeddings)
+        vision_mask = self.set_none_if_empty(vision_mask)
+
+        is_for_token_gen = attention_mask.dim() == 4
+        is_for_context_encoding = self._is_context_encoding(input_ids)
+        is_for_speculation = self._is_for_speculation(input_ids)
+
+        # For non-speculative prefix caching, generate the slot mapping
+        if (
+            not is_for_context_encoding
+            and not self.neuron_config.enable_fused_speculation
+            and not self.neuron_config.enable_eagle_speculation
+            and self.is_prefix_caching
+            and active_block_table is not None
+        ):
+            block_size = torch.tensor(
+                self.neuron_config.pa_block_size,
+                device=position_ids.device,
+                dtype=torch.int32,
+            )
+            slot_mapping = generate_tokengen_slot_mapping(
+                position_ids, slot_mapping, active_block_table, block_size
+            )
+
+        cache_size = (
+            get_cache_size(
+                self.n_positions, self.num_cores_per_group, is_for_context_encoding
+            )
+            if self.neuron_config.flash_decoding_enabled
+            else self.n_positions
+        )
+
+        # Prepare attention mask
+        if self.is_chunked_prefill:
+            attn_mask = self.create_attn_mask(
+                attention_mask,
+                is_for_context_encoding,
+                is_for_speculation,
+                query_lens=num_queries,
+                key_lens=num_queries + computed_context_lens,
+            )
+        else:
+            attn_mask = self.create_attn_mask(
+                attention_mask,
+                is_for_context_encoding,
+                is_for_speculation,
+                position_ids=position_ids,
+            )
+
+        active_mask = None
+        if self.is_prefix_caching:
+            active_length = (
+                self.speculation_length if is_for_speculation else self.n_active_tokens
+            )
+            active_mask = torch.full(
+                (active_length, active_length),
+                True,
+                device=attention_mask.device,
+            ).tril(diagonal=0)
+            active_mask = active_mask[None, None, :, :].expand(
+                self.batch_size, 1, active_length, active_length
+            )
+        if is_for_speculation:
+            active_mask = torch.full(
+                (self.speculation_length, self.speculation_length),
+                True,
+                device=attention_mask.device,
+            ).tril(diagonal=0)
+            active_mask = active_mask[None, None, :, :].expand(
+                self.batch_size, 1, self.speculation_length, self.speculation_length
+            )
+
+        # FlashDecoding masks
+        active_mask_2d = None
+        if self.neuron_config.flash_decoding_enabled and not is_for_context_encoding:
+            rank_id = self.rank_util.get_rank()
+            active_mask_tmp, attention_mask_tmp = mask_util(
+                pos_ids=position_ids,
+                rank_id=rank_id,
+                num_cores_per_group=self.num_cores_per_group,
+                cache_size=cache_size,
+            )
+            if is_for_speculation:
+                active_mask = active_mask_tmp[:, None, :, :].expand(
+                    self.batch_size, 1, -1, -1
+                )
+                attn_mask = attention_mask_tmp[:, None, :, :].expand(
+                    self.batch_size, 1, -1, -1
+                )
+                active_mask_2d = active_mask_tmp.sum(dim=-2, keepdims=False).to(
+                    torch.bool
+                )
+            else:
+                active_mask = turn_2d_mask_to_4d(
+                    active_mask_tmp, n_positions=1, batch_size=self.batch_size
+                )
+                attn_mask = turn_2d_mask_to_4d(
+                    attention_mask_tmp,
+                    n_positions=cache_size,
+                    batch_size=self.batch_size,
+                )
+                active_mask_2d = active_mask_tmp
+
+        # Context encoding or token generation
+        if is_for_context_encoding:
+            past_key_values = None
+        else:
+            past_key_values = self.kv_mgr.get_cache(self.n_positions)
+
+        hidden_states, updated_kv_cache = self.get_model_output(
+            input_ids=input_ids,
+            seq_ids=seq_ids,
+            attention_mask=attn_mask,
+            position_ids=position_ids,
+            past_key_values=past_key_values,
+            active_mask=active_mask,
+            inputs_embeds=inputs_embeds,
+            adapter_ids=adapter_ids,
+            prev_hidden=prev_hidden,
+            tile_q_indices=tile_q_indices,
+            tile_block_tables=tile_block_tables,
+            tile_masks=tile_masks,
+            num_queries=num_queries,
+            is_for_context_encoding=is_for_context_encoding,
+            scatter_index=slot_mapping if self.is_block_kv_layout else scatter_index,
+            kvcache_buffer=kv_cache,
+            is_for_speculation=is_for_speculation,
+            active_block_table=active_block_table,
+            kv_active_mask=active_mask_2d,
+            update_cache=True,
+            vision_embeddings=vision_embeddings,
+            vision_mask=vision_mask,
+        )
+
+        batch_size = input_ids.shape[0]
+        if not self.sliced_hidden:
+            if self.padding_side == "left":
+                index = torch.tensor(
+                    [hidden_states.shape[1] - 1], device=hidden_states.device
+                )
+                index = index.unsqueeze(1).expand(batch_size, 1, self.hidden_size)
+                hidden_states = torch.gather(hidden_states, dim=1, index=index)
+            elif self.is_chunked_prefill:
+                if is_for_context_encoding:
+                    index = neuron_cumsum(num_queries.reshape(1, -1).float()).int() - 1
+                    index = index.reshape(1, -1, 1)
+                    index = index.expand(batch_size, -1, self.hidden_size)
+                    hidden_states = torch.gather(hidden_states, dim=1, index=index)
+            else:
+                if not (
+                    position_ids.shape[-1] == self.speculation_length
+                    or position_ids.shape[-1] == 1
+                ):
+                    index = torch.max(position_ids, dim=1, keepdim=True).indices
+                    index = index.unsqueeze(1).expand(batch_size, 1, self.hidden_size)
+                    hidden_states = torch.gather(hidden_states, dim=1, index=index)
+
+        logits = self.lm_head(hidden_states)
+        logits = logits.float()
+
+        if hasattr(self.lm_head, "pad_size"):
+            if self.lm_head.gather_output:
+                rank_id = torch.tensor(0, device=logits.device, dtype=torch.int32)
+                world_size = 1
+            else:
+                rank_id = self.rank_util.get_rank()
+                world_size = torch.distributed.get_world_size(
+                    group=self.lm_head.tensor_parallel_group
+                )
+            logits = mask_padded_logits(
+                logits, rank_id, world_size, pad_size=self.lm_head.pad_size
+            )
+
+        if self.on_device_sampling:
+            res = self._sample_on_device(
+                logits, sampling_params, is_for_speculation, is_for_context_encoding
+            )
+        else:
+            res = logits
+
+        # Ensure active_block_table and attention_mask not optimized away for prefix caching
+        if self.is_prefix_caching:
+            if active_block_table is not None and len(active_block_table.shape) == 1:
+                res = res + active_block_table[0] * 0
+            if attention_mask is not None and self.prefix_size == 0:
+                res = res + attention_mask[0] * 0
+
+        outputs = [res]
+        if self.neuron_config.output_logits:
+            logits = _gather_along_dim(
+                logits,
+                partition_dim=2,
+                process_group=get_tp_group(self.config),
+            )
+            outputs += [logits]
+        outputs += updated_kv_cache
+
+        return outputs
+
+
+def parallel_state_initialized():
+    """Check if parallel state is initialized."""
+    from neuronx_distributed.parallel_layers import parallel_state
+
+    return parallel_state.model_parallel_is_initialized()
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py
new file mode 100644
index 00000000..231fb3b7
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py
@@ -0,0 +1,271 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Isaac vision model for NxDI: SigLIP2 encoder + pixel shuffle + 2-layer MLP projector.
+
+Isaac's vision pipeline:
+  pixel_values -> SigLIP2 encoder -> pixel_shuffle (2x2, 1152->4608) -> MLP projector (4608->2048)
+
+The MLP projector is a 2-layer network: Linear(4608->18432) -> SiLU -> Linear(18432->2048).
+No bias terms, ~122M parameters.
+
+Pixel shuffle is a deterministic CPU-side operation (channel concatenation of 2x2 patch groups).
+"""
+
+import logging
+from typing import List, Tuple
+
+import torch
+from torch import nn
+
+from neuronx_distributed_inference.models.config import InferenceConfig
+from neuronx_distributed_inference.models.llama4.modeling_llama4_vision import (
+    Llama4VisionModelWrapper,
+)
+from neuronx_distributed_inference.modules.async_execution import is_ranked_io
+
+from isaac_neuron.siglip.modeling_siglip import NeuronSiglipVisionModel
+from isaac_neuron.utils import pixel_shuffle_varlen
+
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+
+
+class NeuronIsaacMultiModalProjector(nn.Module):
+    """Isaac's 2-layer MLP projector: Linear -> SiLU -> Linear.
+
+    Maps pixel-shuffled vision features (4608-dim) to text hidden size (2048-dim).
+    No bias terms on either linear layer.
+
+    HF weight keys:
+        model.vision_embedding.1.weight  -> projector_fc1.weight  (4608, 18432)
+        model.vision_embedding.2          -> SiLU (no weights)
+        model.vision_embedding.3.weight  -> projector_fc2.weight  (18432, 2048)
+    """
+
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        vision_hidden = config.vision_config.hidden_size  # 1152
+        pixel_shuffle_scale = getattr(config, "pixel_shuffle_scale", 2)
+        projector_input_dim = vision_hidden * (pixel_shuffle_scale**2)  # 4608
+
+        # Isaac uses intermediate_size from vision config for the projector
+        # The HF model has: Linear(4608, 18432) -> SiLU -> Linear(18432, 2048)
+        projector_intermediate = getattr(
+            config,
+            "projector_intermediate_size",
+            projector_input_dim * 4,  # 18432
+        )
+        text_hidden = config.text_config.hidden_size  # 2048
+
+        self.fc1 = nn.Linear(projector_input_dim, projector_intermediate, bias=False)
+        self.act = nn.SiLU()
+        self.fc2 = nn.Linear(projector_intermediate, text_hidden, bias=False)
+
+    def forward(self, vision_outputs: torch.Tensor) -> torch.Tensor:
+        """Forward pass: project vision features to text embedding space.
+
+        Args:
+            vision_outputs: (batch, num_patches, 4608) pixel-shuffled features
+
+        Returns:
+            (batch, num_patches, 2048) projected embeddings
+        """
+        hidden = self.fc1(vision_outputs)
+        hidden = self.act(hidden)
+        hidden = self.fc2(hidden)
+        return hidden
+
+
+class NeuronIsaacVisionModel(nn.Module):
+    """Isaac vision model: SigLIP2 encoder + pixel shuffle + MLP projector.
+
+    Full pipeline:
+        pixel_values -> SigLIP2 -> pixel_shuffle(scale=2) -> MLP projector -> vision_embeddings
+    """
+
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+        self.vision_config = config.vision_config
+        self.pixel_shuffle_scale = getattr(config, "pixel_shuffle_scale", 2)
+
+        logger.info(f"NeuronIsaacVisionModel: vision_config={vars(self.vision_config)}")
+
+        # SigLIP2 vision encoder (reused from Gemma3-vision contrib)
+        self.vision_encoder = NeuronSiglipVisionModel(self.vision_config)
+
+        # MLP projector (2-layer with SiLU)
+        self.multi_modal_projector = NeuronIsaacMultiModalProjector(config)
+
+    def forward(self, pixel_values: torch.Tensor) -> torch.Tensor:
+        """Generate vision embeddings from pixel values.
+
+        Args:
+            pixel_values: (batch, num_channels, image_size, image_size)
+
+        Returns:
+            vision_embeddings: (batch, num_vision_tokens, text_hidden_size)
+                where num_vision_tokens = (image_size / patch_size)^2 / pixel_shuffle_scale^2
+        """
+        # SigLIP2 encoder
+        encoder_output = self.vision_encoder(pixel_values).last_hidden_state
+        logger.info(f"encoder_output.shape={encoder_output.shape}")
+
+        # Pixel shuffle: merge 2x2 patches by channel concatenation
+        # (batch, num_patches, 1152) -> (batch, num_patches/4, 4608)
+        shuffled = pixel_shuffle_varlen(encoder_output, scale=self.pixel_shuffle_scale)
+        logger.info(f"pixel_shuffle output.shape={shuffled.shape}")
+
+        # MLP projector: (batch, num_patches/4, 4608) -> (batch, num_patches/4, 2048)
+        projected = self.multi_modal_projector(shuffled)
+        logger.info(f"projected_embedding.shape={projected.shape}")
+
+        return projected
+
+
+class IsaacVisionModelWrapper(Llama4VisionModelWrapper):
+    """Neuron ModelWrapper for Isaac's vision model.
+
+    Inherits from Llama4VisionModelWrapper (same as Gemma3).
+    Generates input shapes for trace and compilation.
+    """
+
+    def __init__(
+        self,
+        config: InferenceConfig,
+        model_cls,
+        tag="",
+        compiler_args: str = None,
+        priority_model_idx: int = None,
+        pipeline_execution: bool = True,
+        return_ranked_to_cpu: bool = True,
+        model_init_kwargs={},
+    ) -> None:
+        super().__init__(
+            config,
+            model_cls,
+            tag,
+            compiler_args,
+            priority_model_idx,
+            pipeline_execution,
+            return_ranked_to_cpu,
+            model_init_kwargs,
+        )
+
+    def input_generator(self) -> List[Tuple[torch.Tensor]]:
+        """Generate example inputs for vision encoder tracing.
+
+        Returns:
+            List of (pixel_values,) tuples for each bucket.
+        """
+        inputs = []
+        for bucket in self.neuron_config.buckets:
+            pixel_values = torch.ones(
+                [
+                    self.neuron_config.batch_size,
+                    self.config.vision_config.num_channels,
+                    self.config.vision_config.image_size,
+                    self.config.vision_config.image_size,
+                ],
+                dtype=self.config.neuron_config.torch_dtype,
+            )
+            inputs.append((pixel_values,))
+        return inputs
+
+    def forward(self, *args):
+        """Forward pass for vision encoder wrapper.
+
+        Handles batch size padding when input batch < compiled batch.
+        """
+        if self.model is None:
+            raise RuntimeError(
+                "Forward called before load. Run load() or load_state_dict() first."
+            )
+
+        if not self.neuron_config.on_cpu:
+            args = self.convert_int64_to_int32(*args)
+
+        pixel_values = args[0]
+        input_batch_size = pixel_values.shape[0]
+
+        if input_batch_size == self.neuron_config.batch_size:
+            return self._forward(*args)
+
+        cur_batch = 0
+        outputs = []
+
+        logging.debug(
+            f"input_batch_size={input_batch_size}, compiled_batch_size={self.neuron_config.batch_size}"
+        )
+
+        while cur_batch < input_batch_size:
+            if cur_batch + self.neuron_config.batch_size <= input_batch_size:
+                batch_args = [
+                    arg[cur_batch : cur_batch + self.neuron_config.batch_size]
+                    for arg in args
+                ]
+                batch_args = self.vllm_cte_repadding(batch_args)
+                output = self._forward(*batch_args)
+            else:
+                output = self._forward_with_pad(
+                    *[
+                        arg[cur_batch:input_batch_size]
+                        if not is_ranked_io(arg)
+                        else arg
+                        for arg in args
+                    ]
+                )
+            outputs.append(output)
+            cur_batch += self.neuron_config.batch_size
+
+        return output
+
+    def _forward_with_pad(self, *args):
+        """Forward with batch padding for undersized inputs."""
+
+        def pad_helper(tensor, pad_type="fill_0", batch_sort_indices=None):
+            if tensor is None or tensor.shape[0] == self.neuron_config.batch_size:
+                return tensor
+
+            padded_shape = list(tensor.shape)
+            padded_shape[0] = self.neuron_config.batch_size
+
+            def repeat_first_batchline(tensor, padded_shape):
+                return tensor[0].repeat(padded_shape[0], 1, 1, 1).to(tensor.dtype)
+
+            def fill_value_tensor(value):
+                return lambda tensor, padded_shape: torch.full(
+                    padded_shape, fill_value=value, dtype=tensor.dtype
+                )
+
+            PAD_TYPES = {
+                "repeat_first_batchline": repeat_first_batchline,
+                "fill_0": fill_value_tensor(0),
+                "fill_1": fill_value_tensor(1),
+                "fill_-1": fill_value_tensor(-1),
+            }
+
+            padded_tensor = PAD_TYPES[pad_type](tensor, padded_shape)
+            padded_tensor[: tensor.shape[0]] = tensor
+
+            if batch_sort_indices is not None:
+                padded_tensor = torch.index_select(padded_tensor, 0, batch_sort_indices)
+
+            return padded_tensor
+
+        pixel_values = args[0]
+        orig_batch_size = pixel_values.shape[0]
+
+        padded_args = []
+        for arg in args:
+            if is_ranked_io(arg):
+                padded_args.append(arg)
+            else:
+                padded_arg = pad_helper(
+                    arg,
+                    pad_type="repeat_first_batchline",
+                    batch_sort_indices=None,
+                )
+                padded_args.append(padded_arg)
+
+        outputs = self._forward(*padded_args)
+        return outputs[:orig_batch_size]
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py
new file mode 100644
index 00000000..48f1b17f
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py
@@ -0,0 +1,252 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""NxDI patches for Isaac model compatibility.
+
+These patches fix known issues in the NxDI framework that affect
+VLM models. Copied from gemma3-vision contrib with minimal modifications.
+"""
+
+from typing import Callable, List, Optional, Tuple, Union
+
+from neuronx_distributed_inference.utils.tensor_replacement.registry import (
+    TensorReplacementRegister,
+)
+import torch
+from transformers.modeling_outputs import CausalLMOutputWithPast
+
+
+def patched_get_last_kv_window(
+    window_size,
+    position_ids,
+    latest_k,
+    latest_v,
+    windowed_context_encoding_window_idx=-1,
+    spec_len=0,
+):
+    """Fix: Convert index tensor in torch.gather to LongTensor."""
+    batch_size, num_head, _, head_dim = latest_k.shape
+    latest_pos = torch.amax(position_ids, dim=1)
+    if windowed_context_encoding_window_idx >= 1:
+        latest_pos -= windowed_context_encoding_window_idx * window_size
+
+    window_size = window_size - 1 + spec_len - 1 if spec_len > 0 else window_size - 1
+
+    end_idx = (latest_pos + 1).clamp(min=window_size)
+    start_idx = (end_idx - window_size).clamp(min=0)
+    orig_indices = start_idx[:, None] + torch.arange(window_size)
+
+    left_shifts = (window_size - (end_idx % window_size)) % window_size
+    base = torch.arange(window_size).expand(batch_size, window_size)
+    shifted_idx = (base + left_shifts[:, None]) % window_size
+
+    gather_idx = torch.gather(orig_indices, dim=1, index=shifted_idx.long())
+    gather_idx = (
+        gather_idx[:, None, :, None]
+        .expand(batch_size, num_head, window_size, head_dim)
+        .to(device=latest_k.device)
+    )
+
+    latest_k = torch.gather(latest_k, dim=2, index=gather_idx.long())
+    latest_v = torch.gather(latest_v, dim=2, index=gather_idx.long())
+    return latest_k, latest_v
+
+
+def patched_base_image_to_text_model_forward(
+    self,
+    input_ids: torch.LongTensor = None,
+    seq_ids: Optional[torch.LongTensor] = None,
+    attention_mask: Optional[torch.Tensor] = None,
+    position_ids: Optional[torch.LongTensor] = None,
+    past_key_values: Optional[List[torch.FloatTensor]] = None,
+    inputs_embeds: Optional[torch.FloatTensor] = None,
+    sampling_params: Optional[torch.FloatTensor] = None,
+    prev_hidden: Optional[torch.FloatTensor] = None,
+    labels: Optional[torch.LongTensor] = None,
+    use_cache: Optional[bool] = None,
+    output_attentions: Optional[bool] = None,
+    output_hidden_states: Optional[bool] = None,
+    adapter_ids: Optional[torch.LongTensor] = None,
+    medusa_args=None,
+    return_dict: Optional[bool] = None,
+    llava_args: Optional[List] = [],
+    input_capture_hook: Optional[Callable] = None,
+    slot_mapping: Optional[torch.LongTensor] = None,
+    block_table: Optional[torch.LongTensor] = None,
+    full_context_lens: Optional[torch.LongTensor] = None,
+    computed_context_lens: Optional[torch.LongTensor] = None,
+    vision_embeddings: Optional[torch.FloatTensor] = None,
+    vision_mask: Optional[torch.BoolTensor] = None,
+    tensor_capture_hook: Optional[Callable] = None,
+) -> Union[Tuple, CausalLMOutputWithPast]:
+    """Patched forward that includes tensor_capture_hook argument (fixes NameError)."""
+    if attention_mask is None:
+        attention_mask = self._infer_attention_mask(position_ids)
+
+    if seq_ids is None:
+        seq_ids = torch.arange(input_ids.shape[0])
+
+    self.preprocess_inputs(
+        input_ids=input_ids,
+        seq_ids=seq_ids,
+        attention_mask=attention_mask,
+        position_ids=position_ids,
+        past_key_values=past_key_values,
+        inputs_embeds=inputs_embeds,
+        sampling_params=sampling_params,
+        prev_hidden=prev_hidden,
+        labels=labels,
+        use_cache=use_cache,
+        output_attentions=output_attentions,
+        output_hidden_states=output_hidden_states,
+        adapter_ids=adapter_ids,
+        medusa_args=medusa_args,
+        return_dict=return_dict,
+        llava_args=llava_args,
+        input_capture_hook=input_capture_hook,
+        slot_mapping=slot_mapping,
+        block_table=block_table,
+        full_context_lens=full_context_lens,
+        computed_context_lens=computed_context_lens,
+    )
+
+    if self.async_mode:
+        outputs, is_run_on_neuron = self._get_model_outputs_async(
+            input_ids=input_ids,
+            attention_mask=attention_mask,
+            position_ids=position_ids,
+            seq_ids=seq_ids,
+            sampling_params=sampling_params,
+            prev_hidden=prev_hidden,
+            adapter_ids=adapter_ids,
+            vision_embeddings=vision_embeddings,
+            vision_mask=vision_mask,
+            medusa_args=medusa_args,
+            llava_args=llava_args,
+        )
+    else:
+        outputs, is_run_on_neuron = self._get_model_outputs(
+            input_ids,
+            attention_mask,
+            position_ids,
+            seq_ids,
+            sampling_params,
+            prev_hidden,
+            adapter_ids,
+            vision_embeddings,
+            vision_mask,
+            None,  # deepstack_vision_embeds (Isaac doesn't use deepstack)
+            medusa_args,
+            llava_args,
+        )
+
+    generation_model = self.get_generation_model()
+    if not generation_model.is_neuron():
+        self._copy_past_key_values(outputs)
+
+    constructed_outputs = self._get_constructed_outputs(outputs, is_run_on_neuron)
+
+    if tensor_capture_hook and constructed_outputs.captured_tensors:
+        tensor_capture_hook(self, constructed_outputs.captured_tensors)
+
+    return constructed_outputs
+
+
+def patched_hf_adapter_prepare_inputs_for_generation(
+    self,
+    input_ids,
+    past_key_values=None,
+    attention_mask=None,
+    inputs_embeds=None,
+    sampling_params=None,
+    adapter_ids=None,
+    **kwargs,
+):
+    """Patched prepare_inputs_for_generation that avoids tensor_capture_hook NameError."""
+    self.prev_kv_cache_populated = self.neuron_model.kv_cache_populated
+    if self.neuron_model.kv_cache_populated:
+        input_ids = input_ids[:, -1:]
+
+    accepted_indices = kwargs.get("accepted_indices", None)
+    current_length = kwargs.get("current_length", None)
+    medusa_mask = kwargs.get("medusa_mask", None)
+    scatter_index = kwargs.get("scatter_index", None)
+    position_ids = kwargs.get("position_ids", None)
+    input_capture_hook = kwargs.get("input_capture_hook", None)
+
+    if attention_mask is not None and position_ids is None:
+        position_ids = attention_mask.long().cumsum(-1) - 1
+        if self.input_start_offsets:
+            if len(self.input_start_offsets) > 1:
+                position_ids += torch.tensor(
+                    self.input_start_offsets,
+                    dtype=position_ids.dtype,
+                    device=position_ids.device,
+                )[:, None]
+            else:
+                position_ids += self.input_start_offsets[0]
+            for i, offset in enumerate(self.input_start_offsets):
+                position_ids[i, 0:offset] = torch.arange(offset)
+        else:
+            position_ids.masked_fill_(attention_mask == 0, 1)
+
+        if self.neuron_model.kv_cache_populated:
+            position_ids = torch.amax(position_ids, 1, keepdim=True)
+            position_ids = position_ids + 1
+
+    if inputs_embeds is not None and past_key_values is None:
+        model_inputs = {"inputs_embeds": inputs_embeds}
+    else:
+        model_inputs = {"input_ids": input_ids}
+
+    model_inputs.update(
+        {
+            "position_ids": position_ids,
+            "past_key_values": past_key_values,
+            "use_cache": kwargs.get("use_cache", False),
+            "attention_mask": attention_mask,
+            "medusa_args": (
+                accepted_indices,
+                current_length,
+                medusa_mask,
+                scatter_index,
+            ),
+            "sampling_params": sampling_params,
+            "input_capture_hook": input_capture_hook,
+            "adapter_ids": adapter_ids,
+        }
+    )
+
+    tf_args = []
+    if self.neuron_config.tensor_replacement_config:
+        if hasattr(self, "generation_step"):
+            self.generation_step += 1
+        else:
+            self.generation_step = 1
+        reg = TensorReplacementRegister.get_instance()
+        tf, masks = reg.step_args(self.generation_step)
+        tf_args = tf + masks
+
+    if tf_args:
+        model_inputs["tf_args"] = tf_args
+
+    additional_kwargs = self.neuron_model.get_required_kwargs()
+    for arg in additional_kwargs:
+        model_inputs.update({arg: kwargs.get(arg, None)})
+
+    return model_inputs
+
+
+def apply_patch() -> None:
+    """Apply NxDI patches for Isaac model compatibility."""
+    import neuronx_distributed_inference.modules.attention.utils as u
+
+    u.get_last_kv_window = patched_get_last_kv_window
+
+    import neuronx_distributed_inference.models.image_to_text_model_base as mm_base
+
+    mm_base.NeuronBaseForImageToText.forward = patched_base_image_to_text_model_forward
+
+    import neuronx_distributed_inference.utils.hf_adapter as hf_adapter
+
+    hf_adapter.HuggingFaceGenerationAdapter.prepare_inputs_for_generation = (
+        patched_hf_adapter_prepare_inputs_for_generation
+    )
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py
new file mode 100644
index 00000000..36cc4b5e
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py
@@ -0,0 +1,15 @@
+# Copyright 2025 © Amazon.com and Affiliates
+
+from .modeling_siglip import (
+    NeuronSiglipVisionModel,
+    NeuronSiglipAttention,
+)
+from .layers import (
+    OutputChannelParallelConv2d,
+)
+
+__all__ = [
+    "NeuronSiglipVisionModel",
+    "NeuronSiglipAttention",
+    "OutputChannelParallelConv2d",
+]
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py
new file mode 100644
index 00000000..27fc092d
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py
@@ -0,0 +1,358 @@
+# Copyright 2025 © Amazon.com and Affiliates
+# Adapted from Gemma3-vision contrib for Isaac SigLIP2 vision encoder.
+import math
+from typing import Optional, Tuple, Union, Any, Callable
+
+from neuronx_distributed.parallel_layers.layers import (
+    _as_tuple2,
+    _initialize_affine_weight_neuron,
+    _initialize_parameter_cpu,
+    CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION,
+    CONV_KERNEL_INPUT_CHANNEL_DIMENSION,
+    conv2d_with_weight_grad_allreduce,
+)
+from neuronx_distributed.parallel_layers.mappings import (
+    copy_to_tensor_model_parallel_region,
+    gather_from_tensor_model_parallel_region_with_dim,
+)
+from neuronx_distributed.parallel_layers.parallel_state import (
+    get_tensor_model_parallel_size,
+)
+from neuronx_distributed.parallel_layers.utils import (
+    divide,
+    get_padding_length,
+    set_tensor_model_parallel_attributes,
+)
+import neuronx_distributed.trace.trace as nxd_tracing_utils
+import torch
+from torch.nn.parameter import Parameter
+
+
+class BaseParallelConv(torch.nn.Module):
+    def set_weight_shape(self) -> None:
+        if self.partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION:
+            if self.partition_pad:
+                self.partition_pad_size = get_padding_length(
+                    self.out_channels, self.world_size
+                )
+                self.out_channels = self.out_channels + self.partition_pad_size
+
+            self.channels_per_partition = divide(self.out_channels, self.world_size)
+            self.weight_shape = [
+                self.channels_per_partition,
+                self.in_channels,
+                *_as_tuple2(self.kernel_size),
+            ]
+        elif self.partition_dim == CONV_KERNEL_INPUT_CHANNEL_DIMENSION:
+            if self.partition_pad:
+                self.partition_pad_size = get_padding_length(
+                    self.in_channels, self.world_size
+                )
+                self.in_channels = self.in_channels + self.partition_pad_size
+
+            self.channels_per_partition = divide(self.in_channels, self.world_size)
+            self.weight_shape = [
+                self.out_channels,
+                self.channels_per_partition,
+                *_as_tuple2(self.kernel_size),
+            ]
+        else:
+            assert False, f"Unsupported partition dim: {self.partition_dim}"
+
+    def set_bias_shape(self) -> None:
+        if self.add_bias:
+            self.bias_shape = (
+                self.channels_per_partition
+                if self.partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION
+                else self.out_channels
+            )
+        else:
+            self.bias_shape = None
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: Union[int, Tuple[int, int]],
+        stride: Union[int, Tuple[int, int]],
+        padding: Union[int, Tuple[int, int]],
+        dilation: Union[int, Tuple[int, int]],
+        groups: int,
+        bias: bool,
+        padding_mode: str,
+        partition_dim: int,
+        dtype: torch.dtype,
+        device: Optional[torch.device] = None,
+        init_method: Optional[Callable[[Any], torch.Tensor]] = None,
+        keep_master_params: bool = False,
+        partition_pad: bool = False,
+    ):
+        if not all(d == 1 for d in _as_tuple2(dilation)):
+            raise NotImplementedError(
+                f"Non-1 dilation is not yet supported. Received: {dilation}"
+            )
+        if groups != 1:
+            raise NotImplementedError(
+                f"Non-1 groups is not yet supported. Received: {groups}"
+            )
+        if padding_mode != "zeros":
+            raise NotImplementedError(
+                f"Non-zeros padding is not yet supported. Received: {padding_mode}"
+            )
+
+        super().__init__()
+        self.in_channels = in_channels
+        self.out_channels = out_channels
+        self.kernel_size = kernel_size
+        self.stride = stride
+        self.padding = padding
+        self.partition_dim = partition_dim
+        self.arg_init_method = init_method
+        self.dtype = dtype
+        self.device = device
+        self.keep_master_params = keep_master_params
+        self.partition_pad = partition_pad
+        self.add_bias = bias
+        self.world_size = get_tensor_model_parallel_size()
+
+        self.set_weight_shape()
+        self.set_bias_shape()
+
+        # Get torch init device if device is not explicitly mentioned
+        init_device = self.device
+        self.weight = Parameter(
+            torch.empty(*self.weight_shape, device=init_device, dtype=self.dtype)
+        )
+        self.device = self.weight.device
+
+        if self.device.type == "cpu":
+            self.master_weight = _initialize_parameter_cpu(
+                self.weight,
+                partition_dim=partition_dim,
+                num_partitions=self.world_size,
+                init_method=self._init_weight,
+                return_master_param=self.keep_master_params,
+                param_dtype=self.dtype,
+                stride=1,
+            )
+        elif self.device.type == "meta":
+            set_tensor_model_parallel_attributes(
+                tensor=self.weight,
+                is_parallel=True,
+                dim=partition_dim,
+                stride=1,
+                num_partitions=self.world_size,
+            )
+        else:
+            assert device and device.type == "xla", (
+                "Currently only xla device type is supported"
+            )
+            _initialize_affine_weight_neuron(
+                self.weight,
+                self._init_weight,
+                partition_dim=partition_dim,
+                num_partitions=self.world_size,
+                stride=1,
+            )
+
+        if self.add_bias:
+            # Bias is added before running the all-gather collective
+            # If conv layer is sharded across output channels (partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION),
+            # then the bias must be sharded
+            # 1. We initialize the bias to an empty parameter tensor of shape (C_out,) or (C_out/TP,)
+            self.bias = Parameter(
+                torch.empty(self.bias_shape, dtype=dtype, device=device)
+            )
+
+            # 2. Parameter initialization
+            # These parallel layers are used for both training and inference. When training from scratch, weight
+            # initialization must be carefully done, especially when distributed (e.g. ensure the same seed is used on every rank)
+            # Such careful initialization is not needed when tracing (device.type == meta) or at inference
+            if self.device.type == "cpu":
+                if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION:
+                    self.master_bias = _initialize_parameter_cpu(
+                        self.bias,
+                        CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION,
+                        num_partitions=self.world_size,
+                        init_method=self._init_bias,
+                        return_master_param=self.keep_master_params,
+                        param_dtype=self.dtype,
+                        stride=1,
+                    )
+                else:
+                    self._init_bias(self.bias)
+                    self.master_bias = self.bias if self.keep_master_params else None
+            elif self.device.type == "meta":
+                if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION:
+                    set_tensor_model_parallel_attributes(
+                        self.bias,
+                        is_parallel=True,
+                        dim=self.partition_dim,
+                        stride=1,
+                        num_partitions=self.world_size,
+                    )
+                self.master_bias = self.bias if self.keep_master_params else None
+            else:
+                assert device and device.type == "xla", (
+                    "Currently only xla device type is supported"
+                )
+                if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION:
+                    set_tensor_model_parallel_attributes(
+                        self.bias,
+                        is_parallel=True,
+                        dim=self.partition_dim,
+                        stride=1,
+                        num_partitions=self.world_size,
+                    )
+                self._init_bias(self.bias)
+                self.master_bias = self.bias if self.keep_master_params else None
+        else:
+            self.register_parameter("bias", None)
+
+        self._forward_impl = conv2d_with_weight_grad_allreduce
+
+    def _init_weight(self, weight):
+        if self.arg_init_method is None:
+            torch.nn.init.kaiming_uniform_(weight, a=math.sqrt(5))
+        else:
+            self.arg_init_method(weight)
+
+    def _init_bias(self, bias):
+        fan_in, _ = torch.nn.init._calculate_fan_in_and_fan_out(self.weight)
+        bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0
+        torch.nn.init.uniform_(bias, -bound, bound)
+
+
+class OutputChannelParallelConv2d(BaseParallelConv):
+    """Conv2d layer with parallelism on its output channels
+
+    The definition of a Conv2d layer can be found at https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html
+
+    This layer parallelizes the Conv2d along the output channel dimension
+
+    .. note::
+        Input is expected to be four dimensional, in order [N, C, H, W]
+
+    Arguments:
+        in_channels: Number of input channels
+        out_channels: Number of output channels in the original Conv that is being parallelized. Parallelization is handled internally by this class
+        kernel_size: Size of the kernel. Can be a single number for a square kernel or a tuple of two numbers
+        stride: Stride of the convolution. Can be a single number for uniform H/W stride or a tuple of two numbers
+        padding: Padding of the convolution. Can be a single number for uniform H/W padding or a tuple of two numbers
+        bias: If true, add bias
+        gather_output: If true, call all-gather on the output to assemble the partial outputs produced by each Neuron device into the full output, and make the full output available on all Neuron devices
+        dtype: Datatype of the weights
+        device: Device on which the weights should be initialized
+        init_method: Method for initializing the weight
+        keep_master_weight: If device="cpu", whether to keep the original ("master") weight the per-worker weights are split from
+        partition_pad: Pad the output channel dimension if needed to make the output channel count divisible by the tensor model parallel size
+    """
+
+    def __init__(
+        self,
+        in_channels: int,
+        out_channels: int,
+        kernel_size: Union[int, Tuple[int, int]],
+        stride: Union[int, Tuple[int, int]] = 1,
+        padding: Union[int, Tuple[int, int]] = 0,
+        dilation: Union[int, Tuple[int, int]] = 1,
+        groups: int = 1,
+        bias: bool = True,
+        padding_mode: str = "zeros",
+        gather_output: bool = True,
+        dtype: torch.dtype = torch.float32,
+        device: Optional[torch.device] = None,
+        init_method: Optional[Callable[[Any], torch.Tensor]] = None,
+        keep_master_weight: bool = False,
+        partition_pad: bool = False,
+    ):
+        # Base class expects these all to be tuples so it can support N-dimensional convs
+        kernel_size = _as_tuple2(kernel_size)
+        stride = _as_tuple2(stride)
+        padding = _as_tuple2(padding)
+        dilation = _as_tuple2(dilation)
+
+        super().__init__(
+            in_channels,
+            out_channels,
+            kernel_size,
+            stride,
+            padding,
+            dilation,
+            groups,
+            bias,
+            padding_mode,
+            CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION,
+            dtype,
+            device,
+            init_method,
+            keep_master_weight,
+            partition_pad,
+        )
+        self.kernel_size: Tuple[int, int]
+        self.stride: Tuple[int, int]
+        self.padding: Tuple[int, int]
+        self.dilation: Tuple[int, int]
+
+        self.allreduce_weight_grad = get_tensor_model_parallel_size() > 1
+        self.gather_output = gather_output
+
+    def forward(self, in_tensor: torch.Tensor) -> torch.Tensor:
+        """Forward of OutputChannelParallelConv2d
+
+        Args:
+            in_tensor: 4D tensor in order [N, C, H ,W]
+
+        Returns:
+            - output
+        """
+
+        if self.allreduce_weight_grad:
+            input_parallel = in_tensor
+        else:
+            input_parallel = copy_to_tensor_model_parallel_region(in_tensor)
+
+        output_parallel = self._forward_impl(
+            input=input_parallel,
+            weight=self.weight,
+            bias=self.bias,
+            stride=self.stride,
+            padding=self.padding,
+            allreduce_weight_grad=self.allreduce_weight_grad,
+        )
+
+        # We intentionally did the bias add in _forward_impl to do less work overall
+        # This way, each worker only has to do 1/world_size of the bias add
+        if self.gather_output:
+            # All-gather across the partitions
+            output = gather_from_tensor_model_parallel_region_with_dim(
+                output_parallel, gather_dim=1
+            )
+            if self.partition_pad and self.partition_pad_size > 0:
+                output = torch.narrow(
+                    output, 1, 0, self.out_channels - self.partition_pad_size
+                )
+        else:
+            output = output_parallel
+
+        return output
+
+    def preshard_hook(self, model_state_dict: dict, prefix: str) -> None:
+        if not self.partition_pad or self.partition_pad_size == 0:
+            return
+        if (
+            self.out_channels
+            != model_state_dict[prefix].shape[0] + self.partition_pad_size
+        ):
+            size = model_state_dict[prefix].shape[0]
+            raise RuntimeError(
+                f"State dict {prefix} is of an unexpected size {size} expected {size - self.partition_pad_size}"
+            )
+        model_state_dict[prefix] = torch.nn.functional.pad(
+            model_state_dict[prefix], (0, 0, 0, 0, 0, 0, 0, self.partition_pad_size)
+        )
+
+
+nxd_tracing_utils.__SUPPORTED_SHARDED_MODULES = (
+    nxd_tracing_utils.__SUPPORTED_SHARDED_MODULES + (OutputChannelParallelConv2d,)
+)
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py
new file mode 100644
index 00000000..7cce2da7
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py
@@ -0,0 +1,521 @@
+# Copyright 2025 © Amazon.com and Affiliates
+# Adapted from Gemma3-vision contrib SigLIP encoder for Isaac SigLIP2.
+from typing import List, Optional, Tuple, Union
+
+import torch
+import torch.nn as nn
+from torch import Size
+from transformers.activations import ACT2FN
+from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
+from transformers.utils import torch_int
+
+from neuronx_distributed.parallel_layers import parallel_state
+from neuronx_distributed.parallel_layers.layers import (
+    ColumnParallelLinear,
+    RowParallelLinear,
+    ParallelEmbedding,
+)
+from neuronx_distributed_inference.models.config import NeuronConfig, InferenceConfig
+from neuronx_distributed_inference.modules.attention.attention_base import (
+    NeuronAttentionBase,
+)
+
+from isaac_neuron.siglip.layers import OutputChannelParallelConv2d
+
+
+class NeuronSiglipConfig(NeuronConfig):
+    def __init__(self, **kwargs):
+        super().__init__(**kwargs)
+
+
+class SiglipInferenceConfig(InferenceConfig):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def get_required_attributes(self) -> List[str]:
+        # To validate if the config.json include all the configs we need in model.
+        # Need to manually add what's required in below list
+        return [
+            "hidden_size",
+            "image_size",
+            "intermediate_size",
+            "model_type",
+            "num_attention_heads",
+            "num_hidden_layers",
+            "patch_size",
+        ]
+
+
+class NeuronSiglipAttention(NeuronAttentionBase):
+    def __init__(self, config: SiglipInferenceConfig, tensor_model_parallel_group=None):
+        super().__init__(
+            config=config,
+            hidden_size=config.hidden_size,
+            num_attention_heads=config.num_attention_heads,
+            num_key_value_heads=config.num_attention_heads,  # siglip is MHA, not GQA
+            head_dim=getattr(
+                config, "head_dim", config.hidden_size // config.num_attention_heads
+            ),
+            qkv_bias=True,
+            o_bias=True,
+            num_cores_per_group=config.num_cores_per_group,
+            tensor_model_parallel_group=tensor_model_parallel_group,
+        )
+
+
+class NeuronSiglipMLP(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.config = config
+        self.activation_fn = ACT2FN[config.hidden_act]
+        self.fc1 = ColumnParallelLinear(
+            config.hidden_size, config.intermediate_size, gather_output=False
+        )
+        self.fc2 = RowParallelLinear(
+            config.intermediate_size, config.hidden_size, input_is_parallel=True
+        )
+
+    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.fc1(hidden_states)
+        hidden_states = self.activation_fn(hidden_states)
+        hidden_states = self.fc2(hidden_states)
+        return hidden_states
+
+
+_shape_t = Union[int, List[int], Size]
+
+
+class LayerNorm(torch.nn.LayerNorm):
+    """
+    Compared to NxD's LayerNorm, always cast input to torch.double to preseve numerical accuracy
+    """
+
+    def __init__(
+        self,
+        normalized_shape: _shape_t,
+        eps: float = 1e-5,
+        elementwise_affine: bool = True,
+        bias: bool = True,
+        device=None,
+        dtype=None,
+    ):
+        self.dtype = dtype
+        super().__init__(
+            normalized_shape=normalized_shape,
+            eps=eps,
+            elementwise_affine=elementwise_affine,
+            bias=bias,
+            device=device,
+            dtype=dtype,
+        )
+
+    def forward(self, input: torch.Tensor) -> torch.Tensor:
+        # Ensure input matches the weight dtype to avoid mixed dtype errors
+        input = input.to(self.weight.dtype)
+        output = super().forward(input)
+        return output
+
+
+class NeuronSiglipEncoderLayer(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.embed_dim = config.hidden_size
+        self.layer_norm1 = LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        self.self_attn = NeuronSiglipAttention(config)
+        self.layer_norm2 = LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
+        self.mlp = NeuronSiglipMLP(config)
+
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.tensor,
+    ) -> torch.FloatTensor:
+        residual = hidden_states
+
+        hidden_states = self.layer_norm1(hidden_states)
+        hidden_states = self.self_attn(
+            hidden_states=hidden_states,
+            attention_mask=attention_mask,
+        ).hidden_states
+        hidden_states = residual + hidden_states
+
+        residual = hidden_states
+        hidden_states = self.layer_norm2(hidden_states)
+        hidden_states = self.mlp(hidden_states)
+        hidden_states = residual + hidden_states
+
+        outputs = (hidden_states,)
+
+        return outputs
+
+
+class NeuronSiglipEncoder(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+        self.layers = nn.ModuleList(
+            [NeuronSiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)]
+        )
+        self.gradient_checkpointing = False
+
+    def forward(
+        self,
+        inputs_embeds,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+    ) -> Union[Tuple, BaseModelOutput]:
+        # Use False defaults since InferenceConfig doesn't have HF PretrainedConfig attrs
+        output_attentions = (
+            output_attentions if output_attentions is not None else False
+        )
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else False
+        )
+        return_dict = return_dict if return_dict is not None else True
+
+        encoder_states = () if output_hidden_states else None
+        all_attentions = () if output_attentions else None
+
+        hidden_states = inputs_embeds
+        for encoder_layer in self.layers:
+            if output_hidden_states:
+                encoder_states = encoder_states + (hidden_states,)
+            if self.gradient_checkpointing and self.training:
+
+                def create_custom_forward(module):
+                    def custom_forward(*inputs):
+                        return module(*inputs, output_attentions)
+
+                    return custom_forward
+
+                layer_outputs = torch.utils.checkpoint.checkpoint(
+                    create_custom_forward(encoder_layer),
+                    hidden_states,
+                    attention_mask,
+                )
+            else:
+                layer_outputs = encoder_layer(
+                    hidden_states,
+                    attention_mask,
+                )
+
+            hidden_states = layer_outputs[0]
+
+            if output_attentions:
+                all_attentions = all_attentions + (layer_outputs[1],)
+
+        if output_hidden_states:
+            encoder_states = encoder_states + (hidden_states,)
+
+        return BaseModelOutput(
+            last_hidden_state=hidden_states,
+            hidden_states=encoder_states,
+            attentions=all_attentions,
+        )
+
+
+class NeuronSiglipMultiheadAttention(NeuronSiglipAttention):
+    """
+    Compared to NeuronSiglipAttention:
+        1. Accept three inputs (Query, Key, Value) instead of a single hidden states
+    """
+
+    def __init__(self, config: InferenceConfig):
+        super().__init__(config=config)
+        self.scale = self.head_dim**-0.5
+        self.dropout = 0.0  # No dropout during inference
+
+    def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
+        """Reshape tensor to (bsz, num_heads, seq_len, head_dim)."""
+        return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2)
+
+    def forward(
+        self,
+        query: torch.Tensor,
+        key: torch.Tensor,
+        value: torch.Tensor,
+        attention_mask: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = True,
+    ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]:
+        """Input shape: Batch x Time x Channel"""
+
+        bsz, tgt_len, embed_dim = query.size()
+
+        # get query/key/value projections via NxDI QKV proj
+        qkv_proj = self.get_qkv_proj()
+        query_states = qkv_proj.q_proj(query) * self.scale
+        key_states = self._shape(qkv_proj.k_proj(key), -1, bsz)
+        value_states = self._shape(qkv_proj.v_proj(value), -1, bsz)
+
+        proj_shape = (bsz * self.num_heads, -1, self.head_dim)
+        query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
+        key_states = key_states.view(*proj_shape)
+        value_states = value_states.view(*proj_shape)
+
+        src_len = key_states.size(1)
+        attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
+
+        if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
+            raise ValueError(
+                f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
+                f" {attn_weights.size()}"
+            )
+
+        if attention_mask is not None:
+            if attention_mask.size() != (bsz, 1, tgt_len, src_len):
+                raise ValueError(
+                    f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
+                )
+            attn_weights = (
+                attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
+                + attention_mask
+            )
+            attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
+
+        attn_weights = nn.functional.softmax(attn_weights, dim=-1)
+
+        if output_attentions:
+            # this operation is a bit akward, but it's required to
+            # make sure that attn_weights keeps its gradient.
+            # In order to do so, attn_weights have to reshaped
+            # twice and have to be reused in the following
+            attn_weights_reshaped = attn_weights.view(
+                bsz, self.num_heads, tgt_len, src_len
+            )
+            attn_weights = attn_weights_reshaped.view(
+                bsz * self.num_heads, tgt_len, src_len
+            )
+        else:
+            attn_weights_reshaped = None
+
+        attn_probs = nn.functional.dropout(
+            attn_weights, p=self.dropout, training=self.training
+        )
+
+        attn_output = torch.bmm(attn_probs, value_states)
+
+        if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
+            raise ValueError(
+                f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
+                f" {attn_output.size()}"
+            )
+
+        attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
+        attn_output = attn_output.transpose(1, 2)
+        attn_output = attn_output.reshape(bsz, tgt_len, -1)
+
+        attn_output = self.get_o_proj().o_proj(attn_output)
+
+        return attn_output, attn_weights_reshaped
+
+
+class NeuronSiglipMultiheadAttentionPoolingHead(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+
+        self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
+        self.attention = NeuronSiglipMultiheadAttention(config)
+        self.layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.mlp = NeuronSiglipMLP(config)
+
+    def forward(self, hidden_state):
+        batch_size = hidden_state.shape[0]
+        probe = self.probe.repeat(batch_size, 1, 1)
+
+        hidden_state = self.attention(probe, hidden_state, hidden_state)[0]
+
+        residual = hidden_state
+        hidden_state = self.layernorm(hidden_state)
+        hidden_state = residual + self.mlp(hidden_state)
+
+        return hidden_state[:, 0]
+
+
+class NeuronSiglipVisionEmbeddings(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+        self.embed_dim = config.hidden_size
+        self.image_size = config.image_size
+        self.patch_size = config.patch_size
+        self.num_patches = (self.image_size // self.patch_size) ** 2
+        self.num_positions = self.num_patches
+
+        if parallel_state.model_parallel_is_initialized():
+            self.patch_embedding = OutputChannelParallelConv2d(
+                in_channels=config.num_channels,
+                out_channels=self.embed_dim,
+                kernel_size=self.patch_size,
+                stride=self.patch_size,
+                padding=0,  # padding="valid" in nn.Conv2d
+                partition_pad=True,
+            )
+
+            self.position_embedding = ParallelEmbedding(
+                self.num_positions,
+                self.embed_dim,
+                shard_across_embedding=True,
+                pad=True,
+            )
+
+        else:
+            self.patch_embedding = nn.Conv2d(
+                in_channels=config.num_channels,
+                out_channels=self.embed_dim,
+                kernel_size=self.patch_size,
+                stride=self.patch_size,
+                padding="valid",
+            )
+            self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
+
+        self.register_buffer(
+            "position_ids",
+            torch.arange(self.num_positions).expand((1, -1)),
+            persistent=False,
+        )
+
+    def interpolate_pos_encoding(
+        self, embeddings: torch.Tensor, height: int, width: int
+    ) -> torch.Tensor:
+        """
+        This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution
+        images. This method is also adapted to support torch.jit tracing and no class embeddings.
+
+        Adapted from:
+        - https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/vision_transformer.py#L174-L194, and
+        - https://github.com/facebookresearch/dinov2/blob/e1277af2ba9496fbadf7aec6eba56e8d882d1e35/dinov2/models/vision_transformer.py#L179-L211
+        """
+
+        num_patches = embeddings.shape[1]
+        num_positions = self.position_embedding.weight.shape[0]
+
+        # always interpolate when tracing to ensure the exported model works for dynamic input shapes
+        if (
+            not torch.jit.is_tracing()
+            and num_patches == num_positions
+            and height == width
+        ):
+            return self.position_embedding(self.position_ids)
+
+        patch_pos_embed = self.position_embedding.weight.unsqueeze(0)
+
+        dim = embeddings.shape[-1]
+
+        new_height = height // self.patch_size
+        new_width = width // self.patch_size
+
+        sqrt_num_positions = torch_int(num_positions**0.5)
+        patch_pos_embed = patch_pos_embed.reshape(
+            1, sqrt_num_positions, sqrt_num_positions, dim
+        )
+        patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2)
+
+        patch_pos_embed = nn.functional.interpolate(
+            patch_pos_embed,
+            size=(new_height, new_width),
+            mode="bicubic",
+            align_corners=False,
+        )
+
+        patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim)
+        return patch_pos_embed
+
+    def forward(
+        self, pixel_values: torch.FloatTensor, interpolate_pos_encoding=False
+    ) -> torch.Tensor:
+        _, _, height, width = pixel_values.shape
+        target_dtype = self.patch_embedding.weight.dtype
+        # Convert pixel_values to target dtype before passing to patch_embedding to avoid mixed dtype errors
+        pixel_values_converted = pixel_values.to(dtype=target_dtype)
+        patch_embeds = self.patch_embedding(
+            pixel_values_converted
+        )  # shape = [*, width, grid, grid]
+        embeddings = patch_embeds.flatten(2).transpose(1, 2)
+
+        if interpolate_pos_encoding:
+            embeddings = embeddings + self.interpolate_pos_encoding(
+                embeddings, height, width
+            )
+        else:
+            # Ensure position embeddings match the dtype of embeddings
+            pos_emb = self.position_embedding(self.position_ids)
+            embeddings = embeddings + pos_emb.to(dtype=embeddings.dtype)
+        return embeddings
+
+
+class NeuronSiglipVisionTransformer(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.config = config
+        embed_dim = config.hidden_size
+
+        self.embeddings = NeuronSiglipVisionEmbeddings(config)
+        self.encoder = NeuronSiglipEncoder(config)
+        self.post_layernorm = LayerNorm(embed_dim, eps=config.layer_norm_eps)
+        self.use_head = (
+            True if not hasattr(config, "vision_use_head") else config.vision_use_head
+        )
+        if self.use_head:
+            self.head = NeuronSiglipMultiheadAttentionPoolingHead(config)
+
+    def forward(
+        self,
+        pixel_values,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        interpolate_pos_encoding: Optional[bool] = False,
+    ) -> BaseModelOutputWithPooling:
+        # InferenceConfig doesn't have HF PretrainedConfig defaults, so set them here
+        output_attentions = (
+            output_attentions if output_attentions is not None else False
+        )
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else False
+        )
+
+        hidden_states = self.embeddings(
+            pixel_values, interpolate_pos_encoding=interpolate_pos_encoding
+        )
+
+        encoder_outputs = self.encoder(
+            inputs_embeds=hidden_states,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+        )
+
+        last_hidden_state = encoder_outputs.last_hidden_state
+        last_hidden_state = self.post_layernorm(last_hidden_state)
+
+        pooler_output = self.head(last_hidden_state) if self.use_head else None
+
+        return BaseModelOutputWithPooling(
+            last_hidden_state=last_hidden_state,
+            pooler_output=pooler_output,
+            hidden_states=encoder_outputs.hidden_states,
+            attentions=encoder_outputs.attentions,
+        )
+
+
+class NeuronSiglipVisionModel(nn.Module):
+    def __init__(self, config: InferenceConfig):
+        super().__init__()
+        self.vision_model = NeuronSiglipVisionTransformer(config)
+
+    def get_input_embeddings(self) -> nn.Module:
+        return self.vision_model.embeddings.patch_embedding
+
+    def forward(
+        self,
+        pixel_values,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        interpolate_pos_encoding: bool = False,
+    ):
+        return self.vision_model(
+            pixel_values=pixel_values,
+            output_attentions=output_attentions,
+            output_hidden_states=output_hidden_states,
+            interpolate_pos_encoding=interpolate_pos_encoding,
+        )
diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py
new file mode 100644
index 00000000..1168dd4c
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py
@@ -0,0 +1,109 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Utility functions for Isaac NxDI contrib model."""
+
+from collections import OrderedDict
+import gc
+
+import torch
+from neuronx_distributed_inference.models.config import NeuronConfig
+
+
+StateDict = OrderedDict[str, torch.FloatTensor]
+
+
+def _helper_concat_and_delete_qkv(
+    state_dict: StateDict, prefix: str, attr: str
+) -> None:
+    """Concatenate Q, K, V weights into fused Wqkv tensor and delete originals."""
+    full_state_key_q_proj = f"{prefix}.qkv_proj.q_proj.{attr}"
+    full_state_key_k_proj = f"{prefix}.qkv_proj.k_proj.{attr}"
+    full_state_key_v_proj = f"{prefix}.qkv_proj.v_proj.{attr}"
+
+    if (
+        full_state_key_q_proj in state_dict
+        and full_state_key_k_proj in state_dict
+        and full_state_key_v_proj in state_dict
+    ):
+        state_dict[f"{prefix}.qkv_proj.Wqkv.{attr}"] = torch.cat(
+            [
+                state_dict[full_state_key_q_proj],
+                state_dict[full_state_key_k_proj],
+                state_dict[full_state_key_v_proj],
+            ],
+            dim=0,
+        )
+        del state_dict[full_state_key_q_proj]
+        del state_dict[full_state_key_k_proj]
+        del state_dict[full_state_key_v_proj]
+
+
+def convert_state_dict_to_fused_qkv(
+    state_dict: StateDict,
+    num_layers: int,
+    neuron_config: NeuronConfig,
+    prefix: str,
+) -> StateDict:
+    """Convert separate Q, K, V weights to fused QKV format for all layers."""
+    for layer_num in range(num_layers):
+        layer_prefix = prefix.format(layer_num=layer_num)
+        _helper_concat_and_delete_qkv(state_dict, layer_prefix, "weight")
+        _helper_concat_and_delete_qkv(state_dict, layer_prefix, "bias")
+        is_qkv_quantized = (
+            neuron_config.quantized_mlp_kernel_enabled or neuron_config.quantized
+        ) and f"{layer_prefix}.qkv_proj.q_proj.scale" in state_dict
+        if is_qkv_quantized:
+            _helper_concat_and_delete_qkv(state_dict, layer_prefix, "scale")
+
+    gc.collect()
+    return state_dict
+
+
+def pixel_shuffle_varlen(hidden_states: torch.Tensor, scale: int = 2) -> torch.Tensor:
+    """Apply pixel shuffle (channel concatenation) to vision encoder output.
+
+    This is a deterministic CPU-side operation that merges scale x scale patches
+    by concatenating along the channel dimension.
+
+    Isaac's pixel shuffle:
+    - Input:  (batch, num_patches, hidden_dim) where num_patches = (H/p * W/p)
+    - After reshape to (batch, H/p, W/p, hidden_dim)
+    - Group scale x scale patches and concatenate channels
+    - Output: (batch, num_patches / scale^2, hidden_dim * scale^2)
+
+    For Isaac: hidden_dim=1152, scale=2 -> output hidden_dim=4608
+
+    Args:
+        hidden_states: Vision encoder output of shape (batch, num_patches, hidden_dim)
+        scale: Pixel shuffle scale factor (default: 2)
+
+    Returns:
+        Shuffled tensor of shape (batch, num_patches // scale^2, hidden_dim * scale^2)
+    """
+    batch_size, num_patches, hidden_dim = hidden_states.shape
+
+    # Compute spatial dimensions
+    h = w = int(num_patches**0.5)
+    assert h * w == num_patches, f"num_patches {num_patches} is not a perfect square"
+    assert h % scale == 0 and w % scale == 0, (
+        f"Spatial dims ({h}, {w}) not divisible by scale {scale}"
+    )
+
+    # Reshape to spatial: (batch, h, w, hidden_dim)
+    hidden_states = hidden_states.view(batch_size, h, w, hidden_dim)
+
+    # Group into scale x scale blocks
+    new_h = h // scale
+    new_w = w // scale
+    hidden_states = hidden_states.view(
+        batch_size, new_h, scale, new_w, scale, hidden_dim
+    )
+
+    # Rearrange: (batch, new_h, new_w, scale, scale, hidden_dim)
+    hidden_states = hidden_states.permute(0, 1, 3, 2, 4, 5).contiguous()
+
+    # Concatenate channels: (batch, new_h * new_w, hidden_dim * scale^2)
+    hidden_states = hidden_states.view(
+        batch_size, new_h * new_w, hidden_dim * scale * scale
+    )
+
+    return hidden_states
diff --git a/contrib/models/Isaac-0.2-2B/test/__init__.py b/contrib/models/Isaac-0.2-2B/test/__init__.py
new file mode 100644
index 00000000..fb28dfcd
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/__init__.py
@@ -0,0 +1 @@
+# Copyright 2025 © Amazon.com and Affiliates
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/__init__.py b/contrib/models/Isaac-0.2-2B/test/integration/__init__.py
new file mode 100644
index 00000000..fb28dfcd
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/__init__.py
@@ -0,0 +1 @@
+# Copyright 2025 © Amazon.com and Affiliates
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py b/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py
new file mode 100644
index 00000000..3f0bc0f5
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py
@@ -0,0 +1,454 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Formal benchmark for Isaac on trn2.3xlarge.
+
+Measures TTFT, TPOT, tok/s, and HBM usage with warmup and multiple iterations.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python benchmark.py [--seq-len 1024] [--warmup 3] [--iterations 10]
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import argparse  # noqa: E402
+import json  # noqa: E402
+import os  # noqa: E402
+import statistics  # noqa: E402
+import time  # noqa: E402
+
+import torch  # noqa: E402
+import torchvision.transforms as T  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+from transformers.image_utils import load_image  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+IMAGE_TOKEN_ID = 151655
+IMAGE_SIZE = 256
+NUM_VISION_TOKENS = 64  # (256/16)^2 / 4
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_model_and_tokenizer(seq_len, tp=1):
+    """Create and load model at specified config."""
+    traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-bench-s{seq_len}-tp{tp}"
+
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[seq_len],
+        token_generation_buckets=[seq_len],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=True,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp,
+        world_size=tp,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    config.image_token_index = IMAGE_TOKEN_ID
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    # Compile or load
+    if not os.path.exists(traced_path):
+        print(f"  Compiling (seq_len={seq_len}, TP={tp})...")
+        t0 = time.time()
+        model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config)
+        model.compile(traced_path, debug=False)
+        tokenizer.save_pretrained(traced_path)
+        print(f"  Compiled in {time.time() - t0:.1f}s")
+        model.load(traced_path, skip_warmup=True)
+    else:
+        print(f"  Loading from {traced_path}...")
+        model = NeuronIsaacForConditionalGeneration(traced_path, config)
+        model.load(traced_path, skip_warmup=True)
+
+    return model, tokenizer
+
+
+def benchmark_text(model, tokenizer, prompt, max_new_tokens, warmup, iterations):
+    """Benchmark text-only generation with proper warmup and timing."""
+    gen_model = HuggingFaceGenerationAdapter(model)
+
+    messages = [{"role": "user", "content": prompt}]
+    input_ids = tokenizer.apply_chat_template(
+        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+    )
+    attention_mask = torch.ones_like(input_ids)
+    input_len = input_ids.shape[1]
+
+    sampling_params = prepare_sampling_params(
+        batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        output_scores=True,
+        return_dict_in_generate=True,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=max_new_tokens,
+    )
+
+    # Warmup
+    for _ in range(warmup):
+        gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=max_new_tokens,
+        )
+
+    # Timed iterations
+    latencies = []
+    token_counts = []
+    for _ in range(iterations):
+        t0 = time.time()
+        outputs = gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=max_new_tokens,
+        )
+        elapsed = time.time() - t0
+
+        generated = outputs.sequences[0, input_len:]
+        n_tokens = len(generated)
+        latencies.append(elapsed)
+        token_counts.append(n_tokens)
+
+    gen_text = tokenizer.decode(
+        outputs.sequences[0, input_len:], skip_special_tokens=True
+    )
+
+    avg_tokens = statistics.mean(token_counts)
+    avg_latency = statistics.mean(latencies)
+    # TTFT ≈ latency - (n_tokens - 1) * TPOT; approximate TPOT from overall
+    avg_tpot = avg_latency / avg_tokens if avg_tokens > 1 else avg_latency
+    avg_ttft = (
+        avg_latency - (avg_tokens - 1) * avg_tpot if avg_tokens > 1 else avg_latency
+    )
+    avg_tps = avg_tokens / avg_latency
+
+    return {
+        "input_tokens": input_len,
+        "avg_output_tokens": avg_tokens,
+        "avg_latency_s": avg_latency,
+        "ttft_ms": avg_ttft * 1000,
+        "tpot_ms": avg_tpot * 1000,
+        "tok_per_sec": avg_tps,
+        "latency_std_ms": statistics.stdev(latencies) * 1000
+        if len(latencies) > 1
+        else 0,
+        "text_preview": gen_text[:150],
+    }
+
+
+def benchmark_image_text(model, tokenizer, max_new_tokens, warmup, iterations):
+    """Benchmark image+text generation."""
+    gen_model = HuggingFaceGenerationAdapter(model)
+
+    # Load test image
+    try:
+        ref_img = load_image(
+            "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp"
+        )
+    except Exception:
+        ref_img = Image.new("RGB", (256, 256), color="blue")
+
+    transform = T.Compose(
+        [
+            T.Resize(
+                (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC
+            ),
+            T.ToTensor(),
+            T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ]
+    )
+    pixel_values = transform(ref_img).unsqueeze(0).to(torch.bfloat16)
+
+    # Build input with image tokens
+    prompt = "Describe this image in detail."
+    messages = [{"role": "user", "content": f"<image>\n{prompt}"}]
+    text = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+    full_ids = tokenizer.encode(text, return_tensors="pt")[0]
+
+    image_text_ids = tokenizer.encode("<image>", add_special_tokens=False)
+    image_text_tensor = torch.tensor(image_text_ids)
+    found_pos = -1
+    for idx in range(len(full_ids) - len(image_text_ids) + 1):
+        if torch.equal(full_ids[idx : idx + len(image_text_ids)], image_text_tensor):
+            found_pos = idx
+            break
+
+    if found_pos >= 0:
+        before = full_ids[:found_pos]
+        after = full_ids[found_pos + len(image_text_ids) :]
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0)
+    else:
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0)
+
+    attention_mask = torch.ones_like(input_ids)
+    vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool)
+    input_len = input_ids.shape[1]
+
+    sampling_params = prepare_sampling_params(
+        batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=max_new_tokens,
+    )
+
+    # Warmup
+    for _ in range(warmup):
+        gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=max_new_tokens,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+        )
+
+    # Timed iterations
+    latencies = []
+    token_counts = []
+    for _ in range(iterations):
+        t0 = time.time()
+        outputs = gen_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=max_new_tokens,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+        )
+        elapsed = time.time() - t0
+
+        generated = outputs[0, input_len:]
+        n_tokens = len(generated)
+        latencies.append(elapsed)
+        token_counts.append(n_tokens)
+
+    gen_text = tokenizer.decode(outputs[0, input_len:], skip_special_tokens=True)
+
+    avg_tokens = statistics.mean(token_counts)
+    avg_latency = statistics.mean(latencies)
+    avg_tpot = avg_latency / avg_tokens if avg_tokens > 1 else avg_latency
+    avg_ttft = (
+        avg_latency - (avg_tokens - 1) * avg_tpot if avg_tokens > 1 else avg_latency
+    )
+    avg_tps = avg_tokens / avg_latency
+
+    return {
+        "input_tokens": input_len,
+        "vision_tokens": NUM_VISION_TOKENS,
+        "avg_output_tokens": avg_tokens,
+        "avg_latency_s": avg_latency,
+        "ttft_ms": avg_ttft * 1000,
+        "tpot_ms": avg_tpot * 1000,
+        "tok_per_sec": avg_tps,
+        "latency_std_ms": statistics.stdev(latencies) * 1000
+        if len(latencies) > 1
+        else 0,
+        "text_preview": gen_text[:150],
+    }
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--seq-len", type=int, default=1024)
+    parser.add_argument("--tp", type=int, default=1)
+    parser.add_argument("--warmup", type=int, default=3)
+    parser.add_argument("--iterations", type=int, default=10)
+    parser.add_argument("--max-new-tokens", type=int, default=128)
+    args = parser.parse_args()
+
+    print(f"{'=' * 70}")
+    print(f"ISAAC BENCHMARK — seq_len={args.seq_len}, TP={args.tp}")
+    print(
+        f"warmup={args.warmup}, iterations={args.iterations}, max_new_tokens={args.max_new_tokens}"
+    )
+    print(f"{'=' * 70}")
+
+    model, tokenizer = create_model_and_tokenizer(args.seq_len, args.tp)
+
+    all_results = {
+        "config": {
+            "seq_len": args.seq_len,
+            "tp": args.tp,
+            "batch_size": 1,
+            "warmup": args.warmup,
+            "iterations": args.iterations,
+            "max_new_tokens": args.max_new_tokens,
+            "instance": "trn2.3xlarge",
+            "lnc": 2,
+            "sdk": "2.29",
+            "model": "Isaac-0.2-2B-Preview",
+        },
+        "text_benchmarks": [],
+        "image_text_benchmark": None,
+    }
+
+    # Text benchmarks — short, medium, long prompts
+    text_prompts = [
+        ("short", "The capital of France is", 32),
+        ("medium", "Explain quantum entanglement in simple terms:", 128),
+        (
+            "long",
+            "Write a detailed essay about the history and future of artificial intelligence, "
+            "covering its origins, key milestones, current capabilities, and predictions "
+            "for the next decade:",
+            args.max_new_tokens,
+        ),
+    ]
+
+    for label, prompt, max_tok in text_prompts:
+        print(f"\n--- Text benchmark: {label} (max_new_tokens={max_tok}) ---")
+        result = benchmark_text(
+            model, tokenizer, prompt, max_tok, args.warmup, args.iterations
+        )
+        result["label"] = label
+        result["prompt"] = prompt[:80]
+        all_results["text_benchmarks"].append(result)
+        print(
+            f"  Input: {result['input_tokens']} tok, Output: {result['avg_output_tokens']:.0f} tok"
+        )
+        print(f"  TTFT: {result['ttft_ms']:.1f}ms")
+        print(f"  TPOT: {result['tpot_ms']:.2f}ms")
+        print(f"  Throughput: {result['tok_per_sec']:.1f} tok/s")
+        print(f"  Latency std: {result['latency_std_ms']:.1f}ms")
+
+    # Image+text benchmark
+    print(f"\n--- Image+text benchmark ---")
+    img_result = benchmark_image_text(
+        model, tokenizer, args.max_new_tokens, args.warmup, args.iterations
+    )
+    all_results["image_text_benchmark"] = img_result
+    print(
+        f"  Input: {img_result['input_tokens']} tok ({img_result['vision_tokens']} vision)"
+    )
+    print(f"  Output: {img_result['avg_output_tokens']:.0f} tok")
+    print(f"  TTFT: {img_result['ttft_ms']:.1f}ms (includes vision encoding)")
+    print(f"  TPOT: {img_result['tpot_ms']:.2f}ms")
+    print(f"  Throughput: {img_result['tok_per_sec']:.1f} tok/s")
+
+    # Summary table
+    print(f"\n{'=' * 70}")
+    print("BENCHMARK SUMMARY")
+    print(f"{'=' * 70}")
+    print(
+        f"{'Workload':<20} {'In':>5} {'Out':>5} {'TTFT(ms)':>10} {'TPOT(ms)':>10} {'tok/s':>8}"
+    )
+    print("-" * 60)
+    for r in all_results["text_benchmarks"]:
+        print(
+            f"{r['label']:<20} {r['input_tokens']:>5} {r['avg_output_tokens']:>5.0f} "
+            f"{r['ttft_ms']:>10.1f} {r['tpot_ms']:>10.2f} {r['tok_per_sec']:>8.1f}"
+        )
+    ir = all_results["image_text_benchmark"]
+    print(
+        f"{'image+text':<20} {ir['input_tokens']:>5} {ir['avg_output_tokens']:>5.0f} "
+        f"{ir['ttft_ms']:>10.1f} {ir['tpot_ms']:>10.2f} {ir['tok_per_sec']:>8.1f}"
+    )
+
+    # Save
+    out_path = os.path.join(
+        REFERENCE_DIR, f"benchmark_s{args.seq_len}_tp{args.tp}.json"
+    )
+    with open(out_path, "w") as f:
+        json.dump(all_results, f, indent=2, default=str)
+    print(f"\nResults saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py b/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py
new file mode 100644
index 00000000..3a00b9d2
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py
@@ -0,0 +1,313 @@
+#!/usr/bin/env python3
+"""Kernel sweep for Isaac-0.2-2B: test TKG attention block, MLP, out_proj, and combos.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:/mnt/models/neuronx-distributed-inference/src:$PYTHONPATH \
+        python3 kernel_sweep.py
+"""
+
+import os
+import sys
+import time
+import json
+import torch
+import traceback
+
+# Ensure the correct paths
+NXDI_ROOT = "/mnt/models/neuronx-distributed-inference"
+sys.path.insert(0, f"{NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src")
+sys.path.insert(0, f"{NXDI_ROOT}/src")
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+from transformers import AutoConfig, AutoTokenizer
+from neuronx_distributed_inference.models.config import (
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (
+    prepare_sampling_params,
+)
+from isaac_neuron.modeling_isaac import (
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+MODEL_PATH = "/mnt/models/Isaac-0.2-2B-Preview"
+COMPILED_BASE = "/mnt/models/traced_model/Isaac-0.2-2B"
+
+# Kernel configurations to test
+CONFIGS = {
+    "baseline": {
+        "desc": "No kernels (reference)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": False,
+            "mlp_kernel_enabled": False,
+            "fused_qkv": False,
+        },
+    },
+    "cte_flash_only": {
+        "desc": "CTE flash attention only (current production config)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": False,
+            "fused_qkv": False,
+        },
+    },
+    "mlp_tp1": {
+        "desc": "MLP kernel at TP=1 (nkilib production, NOT experimental)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": True,
+            "fused_qkv": False,
+        },
+    },
+    "tkg_block": {
+        "desc": "TKG attention block kernel (fuses RMSNorm+QKV+QKnorm+RoPE+Attn+Oproj)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": False,
+            "fused_qkv": True,
+            "qkv_kernel_enabled": True,
+            "attn_block_tkg_nki_kernel_enabled": True,
+        },
+    },
+    "tkg_block_plus_mlp": {
+        "desc": "TKG block + MLP kernel (full TKG optimization)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": True,
+            "fused_qkv": True,
+            "qkv_kernel_enabled": True,
+            "attn_block_tkg_nki_kernel_enabled": True,
+        },
+    },
+    "out_proj": {
+        "desc": "CTE flash + out_proj kernel",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": False,
+            "fused_qkv": False,
+            "out_proj_kernel_enabled": True,
+        },
+    },
+    "tkg_block_mlp_outproj": {
+        "desc": "TKG block + MLP + out_proj (maximum kernel coverage)",
+        "tp": 1,
+        "flags": {
+            "attn_kernel_enabled": True,
+            "mlp_kernel_enabled": True,
+            "fused_qkv": True,
+            "qkv_kernel_enabled": True,
+            "attn_block_tkg_nki_kernel_enabled": True,
+            "out_proj_kernel_enabled": True,
+        },
+    },
+}
+
+
+def build_config(config_name, flags, tp_degree=1, seq_len=1024):
+    """Build IsaacInferenceConfig with specified kernel flags."""
+    compiled_dir = f"{COMPILED_BASE}/kernel_sweep_{config_name}"
+
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp_degree,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[seq_len],
+        token_generation_buckets=[seq_len],
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        save_sharded_checkpoint=True,
+        **flags,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp_degree,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        save_sharded_checkpoint=True,
+        fused_qkv=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    inference_config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    # Override save/compiled paths
+    inference_config.save_path = compiled_dir
+    inference_config.compiled_model_path = compiled_dir
+
+    return inference_config
+
+
+def test_config(config_name, config_info):
+    """Test a single kernel configuration: compile, load, generate, benchmark."""
+    print(f"\n{'=' * 70}")
+    print(f"Testing: {config_name}")
+    print(f"  {config_info['desc']}")
+    print(f"  Flags: {config_info['flags']}")
+    print(f"{'=' * 70}")
+
+    tp = config_info["tp"]
+    flags = config_info["flags"]
+
+    try:
+        inference_config = build_config(config_name, flags, tp_degree=tp)
+        compiled_dir = f"{COMPILED_BASE}/kernel_sweep_{config_name}"
+
+        # Compile
+        t0 = time.time()
+        print(f"  Compiling...")
+        model = NeuronIsaacForConditionalGeneration(MODEL_PATH, inference_config)
+        model.compile(compiled_dir, debug=False)
+        compile_time = time.time() - t0
+        print(f"  Compile time: {compile_time:.1f}s")
+
+        # Load
+        print(f"  Loading compiled model...")
+        model.load(compiled_dir, skip_warmup=True)
+
+        # Generate text-only
+        tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
+        tokenizer.pad_token = tokenizer.eos_token
+        prompt = "What is the capital of France?"
+        messages = [{"role": "user", "content": prompt}]
+        input_ids = tokenizer.apply_chat_template(
+            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+        )
+        attention_mask = torch.ones_like(input_ids)
+
+        generation_model = HuggingFaceGenerationAdapter(model)
+        sampling_params = prepare_sampling_params(
+            batch_size=1,
+            top_k=[1],
+            top_p=[1.0],
+            temperature=[0.0],
+        )
+        gen_kwargs = dict(
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            max_new_tokens=50,
+        )
+
+        # Warmup
+        print(f"  Warmup (3 runs)...")
+        for _ in range(3):
+            out = generation_model.generate(input_ids, **gen_kwargs)
+
+        # Benchmark (10 runs)
+        print(f"  Benchmarking (10 runs, 50 tokens each)...")
+        times = []
+        for _ in range(10):
+            t0 = time.time()
+            out = generation_model.generate(input_ids, **gen_kwargs)
+            times.append(time.time() - t0)
+
+        output_text = tokenizer.decode(
+            out[0][input_ids.shape[1] :], skip_special_tokens=True
+        )
+        avg_time = sum(times) / len(times)
+        tok_per_sec = 50 / avg_time
+        tpot_ms = (avg_time / 50) * 1000
+
+        result = {
+            "status": "SUCCESS",
+            "compile_time_s": compile_time,
+            "avg_time_s": avg_time,
+            "tok_per_sec": tok_per_sec,
+            "tpot_ms": tpot_ms,
+            "output_preview": output_text[:100],
+        }
+        print(f"  tok/s: {tok_per_sec:.1f}")
+        print(f"  TPOT:  {tpot_ms:.2f} ms")
+        print(f"  Output: {output_text[:80]}...")
+
+        # Cleanup
+        del model
+        torch.cuda.empty_cache() if torch.cuda.is_available() else None
+
+        return result
+
+    except Exception as e:
+        tb = traceback.format_exc()
+        print(f"  FAILED: {e}")
+        print(f"  {tb[-500:]}")
+        return {
+            "status": "FAILED",
+            "error": str(e),
+            "traceback": tb[-500:],
+        }
+
+
+def main():
+    # Parse args
+    configs_to_test = sys.argv[1:] if len(sys.argv) > 1 else list(CONFIGS.keys())
+
+    print(f"Isaac Kernel Sweep")
+    print(f"Configs to test: {configs_to_test}")
+    print(f"Model: {MODEL_PATH}")
+
+    results = {}
+    for name in configs_to_test:
+        if name not in CONFIGS:
+            print(f"Unknown config: {name}, skipping")
+            continue
+        results[name] = test_config(name, CONFIGS[name])
+
+    # Summary
+    print(f"\n{'=' * 80}")
+    print(f"KERNEL SWEEP SUMMARY")
+    print(f"{'=' * 80}")
+    print(f"{'Config':<25} {'Status':<10} {'tok/s':>8} {'TPOT ms':>10} {'Compile':>10}")
+    print("-" * 70)
+    for name, r in results.items():
+        if r["status"] == "SUCCESS":
+            print(
+                f"{name:<25} {'OK':<10} {r['tok_per_sec']:>8.1f} {r['tpot_ms']:>10.2f} {r['compile_time_s']:>10.1f}s"
+            )
+        else:
+            print(f"{name:<25} {'FAIL':<10} {'—':>8} {'—':>10} {'—':>10}")
+
+    # Save results
+    out_path = "/mnt/models/kernel_sweep_results.json"
+    with open(out_path, "w") as f:
+        json.dump(results, f, indent=2, default=str)
+    print(f"\nResults saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py b/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py
new file mode 100644
index 00000000..c1359cd5
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py
@@ -0,0 +1,255 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Isaac-0.2-2B NxDI integration test script.
+
+Compiles and runs the Isaac VLM model on Neuron.
+Supports both text-only and image+text generation.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python run_isaac.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import logging  # noqa: E402
+import os  # noqa: E402
+
+import torch  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, AutoProcessor  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)  # noqa: E402
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (
+    prepare_sampling_params,
+)  # noqa: E402
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# Configure logging
+logger = logging.getLogger(__name__)
+logger.setLevel(logging.DEBUG)
+
+# Model configuration
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+
+CONFIG = {
+    "TEXT_TP_DEGREE": 1,  # TP=1 for 2B model on trn2.3xlarge
+    "VISION_TP_DEGREE": 1,
+    "WORLD_SIZE": 1,
+    "BATCH_SIZE": 1,
+    "SEQ_LENGTH": 1024,  # Start small for initial compilation test
+    "CTX_BUCKETS": [1024],
+    "TKG_BUCKETS": [1024],
+    "DTYPE": torch.bfloat16,
+    "MODEL_PATH": f"{DATA_PATH}/Isaac-0.2-2B-Preview",
+    "TRACED_MODEL_PATH": f"{DATA_PATH}/traced_model/Isaac-0.2-2B",
+    "MAX_NEW_TOKENS": 50,
+    # Optimizations
+    "FUSED_QKV": False,  # Start without QKV fusion
+    "VISION_FUSED_QKV": False,
+    "ASYNC_MODE": False,  # Disable async for debugging
+    "OUTPUT_LOGITS": True,
+    "ON_DEVICE_SAMPLING": OnDeviceSamplingConfig(
+        dynamic=True,
+        do_sample=True,
+        deterministic=True,
+        temperature=1.0,
+        top_p=1.0,
+        top_k=1,  # Greedy for validation
+        global_topk=256,
+        top_k_kernel_enabled=True,
+    ),
+}
+
+# Environment setup
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_neuron_configs():
+    """Create text and vision neuron configurations."""
+    text_config = NeuronConfig(
+        batch_size=CONFIG["BATCH_SIZE"],
+        seq_len=CONFIG["SEQ_LENGTH"],
+        torch_dtype=CONFIG["DTYPE"],
+        # Distributed
+        tp_degree=CONFIG["TEXT_TP_DEGREE"],
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        # Continuous batching
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        # Bucketing
+        enable_bucketing=True,
+        context_encoding_buckets=CONFIG["CTX_BUCKETS"],
+        token_generation_buckets=CONFIG["TKG_BUCKETS"],
+        # Optimizations
+        async_mode=CONFIG["ASYNC_MODE"],
+        on_device_sampling_config=CONFIG["ON_DEVICE_SAMPLING"],
+        output_logits=CONFIG["OUTPUT_LOGITS"],
+        fused_qkv=CONFIG["FUSED_QKV"],
+        sequence_parallel_enabled=False,
+        # Kernels — conservative for initial test
+        # ISA limit: text MLP intermediate=6144 > 4096 at TP=1
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=CONFIG["BATCH_SIZE"],
+        seq_len=CONFIG["SEQ_LENGTH"],
+        torch_dtype=CONFIG["DTYPE"],
+        # Distributed
+        tp_degree=CONFIG["VISION_TP_DEGREE"],
+        world_size=CONFIG["WORLD_SIZE"],
+        save_sharded_checkpoint=True,
+        # Continuous batching
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        # Bucketing
+        enable_bucketing=True,
+        buckets=[1],
+        # Optimizations
+        fused_qkv=CONFIG["VISION_FUSED_QKV"],
+        # Kernels — all disabled for vision encoder
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    return text_config, vision_config
+
+
+def setup_model():
+    """Initialize model configuration and compile/load."""
+    text_config, vision_config = create_neuron_configs()
+
+    # Isaac uses trust_remote_code; load HF config directly
+    hf_config = AutoConfig.from_pretrained(CONFIG["MODEL_PATH"], trust_remote_code=True)
+
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+
+    print(
+        f"Text config: {config.text_config.num_hidden_layers} layers, "
+        f"hidden={config.text_config.hidden_size}"
+    )
+    print(
+        f"Vision config: {config.vision_config.num_hidden_layers} layers, "
+        f"hidden={config.vision_config.hidden_size}"
+    )
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        CONFIG["MODEL_PATH"], padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    return config, tokenizer
+
+
+def compile_model(config, tokenizer):
+    """Compile model (text + vision) and save traced artifacts."""
+    print("\nCompiling Isaac model (text + vision)...")
+    model = NeuronIsaacForConditionalGeneration(CONFIG["MODEL_PATH"], config)
+    # debug=False to avoid profiler's CUDA introspection issue on Neuron instances
+    model.compile(CONFIG["TRACED_MODEL_PATH"], debug=False)
+    tokenizer.save_pretrained(CONFIG["TRACED_MODEL_PATH"])
+    print(f"Model compiled and saved to {CONFIG['TRACED_MODEL_PATH']}")
+    # Load compiled model for inference
+    model.load(CONFIG["TRACED_MODEL_PATH"], skip_warmup=True)
+    return model
+
+
+def load_model():
+    """Load pre-compiled model from traced checkpoint."""
+    print(f"\nLoading model from {CONFIG['TRACED_MODEL_PATH']}...")
+    model = NeuronIsaacForConditionalGeneration(CONFIG["TRACED_MODEL_PATH"])
+    model.load(CONFIG["TRACED_MODEL_PATH"], skip_warmup=True)
+    return model
+
+
+def run_text_only(model, tokenizer):
+    """Run text-only generation test."""
+    print("\n=== Text-only Generation ===")
+    prompt = "The capital of France is"
+
+    messages = [{"role": "user", "content": prompt}]
+    # Use tokenizer directly (Isaac's processor requires tensor_stream for images)
+    input_ids = tokenizer.apply_chat_template(
+        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+    )
+    attention_mask = torch.ones_like(input_ids)
+
+    print(f"Input: '{prompt}'")
+    print(f"Input IDs shape: {input_ids.shape}")
+
+    generation_model = HuggingFaceGenerationAdapter(model)
+    sampling_params = prepare_sampling_params(
+        batch_size=CONFIG["BATCH_SIZE"],
+        top_k=[1],
+        top_p=[1.0],
+        temperature=[0.0],
+    )
+
+    outputs = generation_model.generate(
+        input_ids,
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+        sampling_params=sampling_params,
+        max_new_tokens=CONFIG["MAX_NEW_TOKENS"],
+    )
+
+    output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
+    for i, text in enumerate(output_text):
+        print(f"Output {i}: {text}")
+
+
+def main():
+    import sys
+
+    config, tokenizer = setup_model()
+
+    mode = sys.argv[1] if len(sys.argv) > 1 else "auto"
+
+    if mode == "compile":
+        # Force recompilation
+        import shutil
+
+        if os.path.exists(CONFIG["TRACED_MODEL_PATH"]):
+            print(f"Removing old traced model at {CONFIG['TRACED_MODEL_PATH']}...")
+            shutil.rmtree(CONFIG["TRACED_MODEL_PATH"])
+        model = compile_model(config, tokenizer)
+    elif mode == "load":
+        # Load only
+        model = load_model()
+    else:
+        # Auto: compile if not found, else load
+        if not os.path.exists(CONFIG["TRACED_MODEL_PATH"]):
+            model = compile_model(config, tokenizer)
+        else:
+            model = load_model()
+
+    run_text_only(model, tokenizer)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py b/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py
new file mode 100644
index 00000000..cd933cc9
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py
@@ -0,0 +1,357 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Test NKI kernel enablement for Isaac at TP=1.
+
+Incrementally enables kernels and validates:
+1. Compilation succeeds
+2. Accuracy matches baseline (cosine vs CPU reference)
+3. Throughput improvement
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python test_kernels.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import json  # noqa: E402
+import os  # noqa: E402
+import shutil  # noqa: E402
+import sys  # noqa: E402
+import time  # noqa: E402
+import traceback  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+# Kernel configurations to test (incremental enablement)
+KERNEL_CONFIGS = {
+    "baseline": {
+        "description": "No kernels (current default)",
+        "text_config": {
+            "fused_qkv": False,
+            "attn_kernel_enabled": False,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": False,
+            "mlp_kernel_enabled": False,
+        },
+    },
+    "cte_flash_attn": {
+        "description": "CTE flash attention only",
+        "text_config": {
+            "fused_qkv": False,
+            "attn_kernel_enabled": True,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": False,
+            "mlp_kernel_enabled": False,
+        },
+    },
+    "mlp_kernel": {
+        "description": "MLP kernel only",
+        "text_config": {
+            "fused_qkv": False,
+            "attn_kernel_enabled": False,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": False,
+            "mlp_kernel_enabled": True,
+        },
+    },
+    "qkv_kernel": {
+        "description": "QKV kernel (requires fused_qkv)",
+        "text_config": {
+            "fused_qkv": True,
+            "attn_kernel_enabled": False,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": True,
+            "qkv_nki_kernel_enabled": True,
+            "mlp_kernel_enabled": False,
+        },
+    },
+    "cte_flash_plus_mlp": {
+        "description": "CTE flash attention + MLP kernel",
+        "text_config": {
+            "fused_qkv": False,
+            "attn_kernel_enabled": True,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": False,
+            "mlp_kernel_enabled": True,
+        },
+    },
+    "full_suite": {
+        "description": "All kernels: CTE flash + QKV + MLP + fused residual",
+        "text_config": {
+            "fused_qkv": True,
+            "attn_kernel_enabled": True,
+            "attn_tkg_nki_kernel_enabled": False,
+            "attn_tkg_builtin_kernel_enabled": False,
+            "qkv_kernel_enabled": True,
+            "qkv_nki_kernel_enabled": True,
+            "mlp_kernel_enabled": True,
+            "mlp_kernel_fuse_residual_add": True,
+            "qkv_kernel_fuse_residual_add": True,
+            "out_proj_kernel_enabled": True,
+        },
+    },
+}
+
+PROMPTS = [
+    "The capital of France is",
+    "Explain quantum entanglement in simple terms:",
+]
+
+
+def create_config(kernel_name, kernel_cfg):
+    """Create config with specified kernel settings."""
+    traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-kernel-{kernel_name}"
+
+    text_overrides = kernel_cfg["text_config"]
+
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        sequence_parallel_enabled=False,
+        **text_overrides,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    config.image_token_index = 151655
+
+    return config, traced_path
+
+
+def test_kernel_config(kernel_name, kernel_cfg, tokenizer):
+    """Test a single kernel configuration."""
+    print(f"\n{'=' * 70}")
+    print(f"Testing: {kernel_name} — {kernel_cfg['description']}")
+    print(f"{'=' * 70}")
+
+    config, traced_path = create_config(kernel_name, kernel_cfg)
+    result = {
+        "name": kernel_name,
+        "description": kernel_cfg["description"],
+        "compiled": False,
+        "accuracy_pass": False,
+        "prompts": [],
+        "compile_time": None,
+        "error": None,
+    }
+
+    # Clean and compile
+    if os.path.exists(traced_path):
+        shutil.rmtree(traced_path)
+
+    try:
+        t0 = time.time()
+        model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config)
+        model.compile(traced_path, debug=False)
+        tokenizer.save_pretrained(traced_path)
+        compile_time = time.time() - t0
+        model.load(traced_path, skip_warmup=True)
+        result["compiled"] = True
+        result["compile_time"] = compile_time
+        print(f"  Compiled in {compile_time:.1f}s")
+    except Exception as e:
+        result["error"] = str(e)
+        print(f"  COMPILATION FAILED: {e}")
+        traceback.print_exc()
+        return result
+
+    # Validate accuracy
+    generation_model = HuggingFaceGenerationAdapter(model)
+    all_passed = True
+
+    for i, prompt in enumerate(PROMPTS):
+        messages = [{"role": "user", "content": prompt}]
+        input_ids = tokenizer.apply_chat_template(
+            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+        )
+        attention_mask = torch.ones_like(input_ids)
+
+        sampling_params = prepare_sampling_params(
+            batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+        )
+        gen_config = GenerationConfig(
+            do_sample=False,
+            output_scores=True,
+            return_dict_in_generate=True,
+            pad_token_id=tokenizer.eos_token_id,
+            max_new_tokens=50,
+        )
+
+        t0 = time.time()
+        outputs = generation_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=50,
+        )
+        elapsed = time.time() - t0
+
+        generated = outputs.sequences[0, input_ids.shape[1] :]
+        gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+        n_tokens = len(generated)
+        tok_per_sec = n_tokens / elapsed if elapsed > 0 else 0
+
+        # Compare first-token logits
+        neuron_logits = outputs.scores[0][0].float().cpu()
+        ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt")
+        cosine = -1.0
+        if os.path.exists(ref_path):
+            ref_logits = torch.load(ref_path, map_location="cpu")
+            cosine = F.cosine_similarity(
+                neuron_logits.unsqueeze(0), ref_logits.unsqueeze(0)
+            ).item()
+
+        top1_match = neuron_logits.argmax().item() == 151667
+        passed = cosine >= 0.99 and top1_match
+        if not passed:
+            all_passed = False
+
+        prompt_result = {
+            "prompt": prompt,
+            "cosine": cosine,
+            "top1_match": top1_match,
+            "passed": passed,
+            "text": gen_text[:200],
+            "n_tokens": n_tokens,
+            "tok_per_sec": tok_per_sec,
+            "elapsed": elapsed,
+        }
+        result["prompts"].append(prompt_result)
+        print(
+            f"  Prompt {i}: cosine={cosine:.6f}, top1={'OK' if top1_match else 'MISS'}, "
+            f"{n_tokens} tok, {tok_per_sec:.1f} tok/s | {gen_text[:60]!r}"
+        )
+
+    result["accuracy_pass"] = all_passed
+
+    # Cleanup model to free NeuronCores
+    del model
+    del generation_model
+    import gc
+
+    gc.collect()
+
+    return result
+
+
+def main():
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    results = []
+    for name, cfg in KERNEL_CONFIGS.items():
+        r = test_kernel_config(name, cfg, tokenizer)
+        results.append(r)
+
+    # Summary table
+    print(f"\n{'=' * 70}")
+    print("KERNEL TEST SUMMARY")
+    print(f"{'=' * 70}")
+    print(
+        f"{'Config':<25} {'Compiled':>10} {'Accuracy':>10} {'Compile(s)':>12} {'tok/s (avg)':>12}"
+    )
+    print("-" * 70)
+    for r in results:
+        compiled = "YES" if r["compiled"] else "FAIL"
+        accuracy = "PASS" if r["accuracy_pass"] else "FAIL"
+        compile_t = f"{r['compile_time']:.1f}" if r["compile_time"] else "N/A"
+        avg_tps = "N/A"
+        if r["prompts"]:
+            tps_vals = [p["tok_per_sec"] for p in r["prompts"] if p["tok_per_sec"] > 0]
+            if tps_vals:
+                avg_tps = f"{sum(tps_vals) / len(tps_vals):.1f}"
+        print(
+            f"{r['name']:<25} {compiled:>10} {accuracy:>10} {compile_t:>12} {avg_tps:>12}"
+        )
+
+    # Save results
+    out_path = os.path.join(REFERENCE_DIR, "kernel_test_results.json")
+    with open(out_path, "w") as f:
+        json.dump(results, f, indent=2, default=str)
+    print(f"\nResults saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py b/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py
new file mode 100644
index 00000000..08968ab9
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py
@@ -0,0 +1,362 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Test Isaac scaling: sequence length and batch size.
+
+Tests compilation and throughput at various seq_len and batch_size.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+
+    # Test single config
+    python test_scaling.py --seq-len 2048 --batch-size 1
+
+    # Test all configs (sequential)
+    python test_scaling.py --sweep
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import argparse  # noqa: E402
+import json  # noqa: E402
+import os  # noqa: E402
+import shutil  # noqa: E402
+import subprocess  # noqa: E402
+import sys  # noqa: E402
+import time  # noqa: E402
+import traceback  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def get_hbm_usage():
+    """Get current HBM usage from neuron-ls."""
+    try:
+        result = subprocess.run(
+            ["neuron-ls", "--json-output"],
+            capture_output=True,
+            text=True,
+            timeout=10,
+        )
+        if result.returncode == 0:
+            data = json.loads(result.stdout)
+            for device in data:
+                mem = device.get("neuron_device", {}).get("memory", {})
+                used = mem.get("used_bytes", 0)
+                total = mem.get("total_bytes", 0)
+                return used / 1e9, total / 1e9  # GB
+    except Exception:
+        pass
+    return None, None
+
+
+def create_config(seq_len, batch_size, tp=1):
+    """Create configs for a given seq_len and batch_size."""
+    traced_path = f"{DATA_PATH}/traced_model/Isaac-2B-s{seq_len}-b{batch_size}-tp{tp}"
+
+    # Build bucketing: CTE uses the seq_len bucket, TKG uses same
+    cte_buckets = [seq_len]
+    tkg_buckets = [seq_len]
+
+    text_config = NeuronConfig(
+        batch_size=batch_size,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=batch_size,
+        enable_bucketing=True,
+        context_encoding_buckets=cte_buckets,
+        token_generation_buckets=tkg_buckets,
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        # Enable CTE flash attention (verified working)
+        attn_kernel_enabled=True,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=batch_size,
+        seq_len=seq_len,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp,
+        world_size=tp,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=batch_size,
+        enable_bucketing=True,
+        buckets=[batch_size],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    config.image_token_index = 151655
+
+    return config, traced_path
+
+
+def test_config(seq_len, batch_size, tp=1, force_recompile=True):
+    """Test a single seq_len + batch_size configuration."""
+    print(f"\n{'=' * 70}")
+    print(f"Testing: seq_len={seq_len}, batch_size={batch_size}, TP={tp}")
+    print(f"{'=' * 70}")
+
+    result = {
+        "seq_len": seq_len,
+        "batch_size": batch_size,
+        "tp": tp,
+        "compiled": False,
+        "inference_ok": False,
+        "compile_time": None,
+        "hbm_used_gb": None,
+        "hbm_total_gb": None,
+        "ttft_ms": None,
+        "tkg_tok_per_sec": None,
+        "error": None,
+    }
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    config, traced_path = create_config(seq_len, batch_size, tp)
+
+    if force_recompile and os.path.exists(traced_path):
+        shutil.rmtree(traced_path)
+
+    # Compile
+    try:
+        t0 = time.time()
+        model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config)
+        model.compile(traced_path, debug=False)
+        tokenizer.save_pretrained(traced_path)
+        compile_time = time.time() - t0
+        result["compiled"] = True
+        result["compile_time"] = compile_time
+        print(f"  Compiled in {compile_time:.1f}s")
+    except Exception as e:
+        result["error"] = str(e)[:500]
+        print(f"  COMPILATION FAILED: {str(e)[:200]}")
+        traceback.print_exc()
+        return result
+
+    # Load
+    try:
+        model.load(traced_path, skip_warmup=True)
+    except Exception as e:
+        result["error"] = f"Load failed: {str(e)[:400]}"
+        print(f"  LOAD FAILED: {str(e)[:200]}")
+        return result
+
+    # HBM usage
+    hbm_used, hbm_total = get_hbm_usage()
+    result["hbm_used_gb"] = hbm_used
+    result["hbm_total_gb"] = hbm_total
+    if hbm_used:
+        print(f"  HBM: {hbm_used:.1f} / {hbm_total:.1f} GB")
+
+    # Inference test
+    generation_model = HuggingFaceGenerationAdapter(model)
+    prompt = "Explain the theory of relativity in detail, covering both special and general relativity:"
+    messages = [{"role": "user", "content": prompt}]
+    input_ids = tokenizer.apply_chat_template(
+        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+    )
+
+    # For BS > 1, replicate input
+    if batch_size > 1:
+        input_ids = input_ids.repeat(batch_size, 1)
+
+    attention_mask = torch.ones_like(input_ids)
+
+    sampling_params = prepare_sampling_params(
+        batch_size=batch_size,
+        top_k=[1] * batch_size,
+        top_p=[1.0] * batch_size,
+        temperature=[1.0] * batch_size,
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        output_scores=True,
+        return_dict_in_generate=True,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=50,
+    )
+
+    try:
+        # TTFT: first token time
+        t0 = time.time()
+        outputs = generation_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=50,
+        )
+        total_time = time.time() - t0
+
+        generated = outputs.sequences[0, input_ids.shape[1] :]
+        gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+        n_tokens = len(generated)
+
+        # TTFT approximation (first score is first token)
+        if hasattr(outputs, "scores") and len(outputs.scores) > 0:
+            # Rough: total_time / n_tokens gives TPOT, TTFT ≈ total_time - (n_tokens-1)*TPOT
+            tpot = total_time / n_tokens if n_tokens > 1 else total_time
+            ttft = total_time - (n_tokens - 1) * tpot if n_tokens > 1 else total_time
+        else:
+            ttft = total_time
+            tpot = total_time / n_tokens if n_tokens > 0 else 0
+
+        tok_per_sec = (n_tokens * batch_size) / total_time if total_time > 0 else 0
+
+        result["inference_ok"] = True
+        result["ttft_ms"] = ttft * 1000
+        result["tkg_tok_per_sec"] = tok_per_sec
+        result["tpot_ms"] = tpot * 1000
+        result["n_tokens"] = n_tokens
+        result["text_preview"] = gen_text[:100]
+
+        print(f"  Generated: {n_tokens} tokens in {total_time:.3f}s")
+        print(f"  TTFT: ~{ttft * 1000:.1f}ms, TPOT: ~{tpot * 1000:.1f}ms")
+        print(f"  Throughput: {tok_per_sec:.1f} tok/s (total across batch)")
+        print(f"  Text: {gen_text[:80]!r}")
+
+    except Exception as e:
+        result["error"] = f"Inference failed: {str(e)[:400]}"
+        print(f"  INFERENCE FAILED: {str(e)[:200]}")
+        traceback.print_exc()
+
+    # Cleanup
+    del model
+    del generation_model
+    import gc
+
+    gc.collect()
+
+    return result
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--seq-len", type=int, default=1024)
+    parser.add_argument("--batch-size", type=int, default=1)
+    parser.add_argument("--tp", type=int, default=1)
+    parser.add_argument("--sweep", action="store_true", help="Run full sweep")
+    parser.add_argument("--no-recompile", action="store_true")
+    args = parser.parse_args()
+
+    if args.sweep:
+        # Sweep configurations: seq_len first, then batch_size
+        configs = [
+            # Seq len sweep (BS=1)
+            (1024, 1),  # baseline
+            (2048, 1),
+            (4096, 1),
+            (8192, 1),
+            # Batch size sweep (seq_len=1024)
+            (1024, 2),
+            (1024, 4),
+            (1024, 8),
+        ]
+
+        results = []
+        for sl, bs in configs:
+            r = test_config(sl, bs, tp=args.tp, force_recompile=not args.no_recompile)
+            results.append(r)
+
+        # Summary
+        print(f"\n{'=' * 80}")
+        print("SCALING TEST SUMMARY")
+        print(f"{'=' * 80}")
+        print(
+            f"{'seq_len':>8} {'BS':>4} {'Compiled':>10} {'CompileT':>10} "
+            f"{'HBM(GB)':>10} {'TTFT(ms)':>10} {'tok/s':>10} {'TPOT(ms)':>10}"
+        )
+        print("-" * 80)
+        for r in results:
+            comp = "YES" if r["compiled"] else "FAIL"
+            ct = f"{r['compile_time']:.0f}" if r["compile_time"] else "N/A"
+            hbm = f"{r['hbm_used_gb']:.1f}" if r["hbm_used_gb"] else "N/A"
+            ttft = f"{r['ttft_ms']:.1f}" if r["ttft_ms"] else "N/A"
+            tps = f"{r['tkg_tok_per_sec']:.1f}" if r["tkg_tok_per_sec"] else "N/A"
+            tpot = f"{r.get('tpot_ms', 0):.1f}" if r.get("tpot_ms") else "N/A"
+            print(
+                f"{r['seq_len']:>8} {r['batch_size']:>4} {comp:>10} {ct:>10} "
+                f"{hbm:>10} {ttft:>10} {tps:>10} {tpot:>10}"
+            )
+
+        # Save
+        out_path = os.path.join(REFERENCE_DIR, "scaling_test_results.json")
+        with open(out_path, "w") as f:
+            json.dump(results, f, indent=2, default=str)
+        print(f"\nResults saved to {out_path}")
+
+    else:
+        r = test_config(
+            args.seq_len,
+            args.batch_size,
+            tp=args.tp,
+            force_recompile=not args.no_recompile,
+        )
+        print(f"\nResult: {json.dumps(r, indent=2, default=str)}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py b/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py
new file mode 100644
index 00000000..7f99e0b9
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py
@@ -0,0 +1,387 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Test Isaac at TP=2 and TP=4 on trn2.3xlarge (LNC=2, 4 logical cores).
+
+Compiles fresh models at each TP degree, runs text-only + image+text,
+and compares first-token logits against CPU reference.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    # TP=2:
+    python test_tp.py --tp 2
+    # TP=4:
+    python test_tp.py --tp 4
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import argparse  # noqa: E402
+import json  # noqa: E402
+import os  # noqa: E402
+import shutil  # noqa: E402
+import sys  # noqa: E402
+import time  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+import torchvision.transforms as T  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+from transformers.image_utils import load_image  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+
+IMAGE_TOKEN_ID = 151655
+IMAGE_SIZE = 256
+NUM_VISION_TOKENS = (IMAGE_SIZE // 16) ** 2 // 4  # 64
+
+TEXT_PROMPTS = [
+    "The capital of France is",
+    "def fibonacci(n):",
+    "Explain quantum entanglement in simple terms:",
+]
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_configs(tp_degree):
+    """Create neuron configs for a given TP degree."""
+    traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-tp{tp_degree}"
+
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp_degree,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=tp_degree,
+        world_size=tp_degree,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    config.image_token_index = IMAGE_TOKEN_ID
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    return config, tokenizer, traced_path
+
+
+def compile_and_load(config, tokenizer, traced_path, force_recompile=False):
+    """Compile (if needed) and load the model."""
+    if force_recompile and os.path.exists(traced_path):
+        print(f"  Removing old traced model at {traced_path}...")
+        shutil.rmtree(traced_path)
+
+    if not os.path.exists(traced_path):
+        print(f"  Compiling at TP={config.neuron_config.tp_degree}...")
+        t0 = time.time()
+        model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config)
+        model.compile(traced_path, debug=False)
+        tokenizer.save_pretrained(traced_path)
+        compile_time = time.time() - t0
+        print(f"  Compilation complete in {compile_time:.1f}s")
+        model.load(traced_path, skip_warmup=True)
+    else:
+        print(f"  Loading existing model from {traced_path}...")
+        model = NeuronIsaacForConditionalGeneration(traced_path, config)
+        model.load(traced_path, skip_warmup=True)
+
+    return model
+
+
+def validate_text(model, tokenizer, tp_degree):
+    """Run text-only validation and compare against CPU reference."""
+    print(f"\n  --- Text-only validation (TP={tp_degree}) ---")
+    generation_model = HuggingFaceGenerationAdapter(model)
+
+    results = []
+    for i, prompt in enumerate(TEXT_PROMPTS):
+        messages = [{"role": "user", "content": prompt}]
+        input_ids = tokenizer.apply_chat_template(
+            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+        )
+        attention_mask = torch.ones_like(input_ids)
+
+        sampling_params = prepare_sampling_params(
+            batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+        )
+        gen_config = GenerationConfig(
+            do_sample=False,
+            output_scores=True,
+            return_dict_in_generate=True,
+            pad_token_id=tokenizer.eos_token_id,
+            max_new_tokens=20,
+        )
+
+        t0 = time.time()
+        outputs = generation_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=gen_config,
+            max_new_tokens=20,
+        )
+        elapsed = time.time() - t0
+
+        generated = outputs.sequences[0, input_ids.shape[1] :]
+        gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+        n_tokens = len(generated)
+
+        # First-token logits comparison
+        neuron_logits = outputs.scores[0][0].float().cpu()
+        ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt")
+        cosine = -1.0
+        if os.path.exists(ref_path):
+            ref_logits = torch.load(ref_path, map_location="cpu")
+            cosine = F.cosine_similarity(
+                neuron_logits.unsqueeze(0), ref_logits.unsqueeze(0)
+            ).item()
+
+        top1_match = neuron_logits.argmax().item() == 151667  # <think>
+
+        passed = cosine >= 0.99 and top1_match
+        print(
+            f"    Prompt {i}: cosine={cosine:.6f}, top1={'match' if top1_match else 'MISS'}, "
+            f"{n_tokens} tok in {elapsed:.2f}s | {gen_text[:80]!r}"
+        )
+
+        results.append(
+            {
+                "prompt": prompt,
+                "cosine": cosine,
+                "top1_match": top1_match,
+                "passed": passed,
+                "text": gen_text[:200],
+                "n_tokens": n_tokens,
+                "elapsed": elapsed,
+            }
+        )
+
+    all_passed = all(r["passed"] for r in results)
+    return results, all_passed
+
+
+def validate_image_text(model, tokenizer, tp_degree):
+    """Run image+text validation."""
+    print(f"\n  --- Image+text validation (TP={tp_degree}) ---")
+    generation_model = HuggingFaceGenerationAdapter(model)
+
+    try:
+        ref_img = load_image(
+            "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp"
+        )
+    except Exception:
+        ref_img = Image.new("RGB", (256, 256), color="blue")
+
+    # Prepare image inputs
+    transform = T.Compose(
+        [
+            T.Resize(
+                (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC
+            ),
+            T.ToTensor(),
+            T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]),
+        ]
+    )
+    pixel_values = transform(ref_img).unsqueeze(0).to(torch.bfloat16)
+
+    prompt = "Describe this image in detail."
+    messages_with_image = [{"role": "user", "content": f"<image>\n{prompt}"}]
+    text_with_image = tokenizer.apply_chat_template(
+        messages_with_image, tokenize=False, add_generation_prompt=True
+    )
+    full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0]
+
+    # Find and replace <image> tokens
+    image_text_ids = tokenizer.encode("<image>", add_special_tokens=False)
+    image_text_tensor = torch.tensor(image_text_ids)
+    found_pos = -1
+    for idx in range(len(full_ids) - len(image_text_ids) + 1):
+        if torch.equal(full_ids[idx : idx + len(image_text_ids)], image_text_tensor):
+            found_pos = idx
+            break
+
+    if found_pos >= 0:
+        before = full_ids[:found_pos]
+        after = full_ids[found_pos + len(image_text_ids) :]
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0)
+    else:
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0)
+
+    attention_mask = torch.ones_like(input_ids)
+    vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool)
+
+    sampling_params = prepare_sampling_params(
+        batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=30,
+    )
+
+    t0 = time.time()
+    outputs = generation_model.generate(
+        input_ids,
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+        sampling_params=sampling_params,
+        generation_config=gen_config,
+        max_new_tokens=30,
+        pixel_values=pixel_values,
+        vision_mask=vision_mask,
+    )
+    elapsed = time.time() - t0
+
+    generated = outputs[0, input_ids.shape[1] :]
+    gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+    n_tokens = len(generated)
+
+    passed = len(gen_text.strip()) > 0 and n_tokens > 0
+    print(f"    Image+text: {n_tokens} tok in {elapsed:.2f}s | {gen_text[:150]!r}")
+    print(f"    {'PASS' if passed else 'FAIL'}")
+
+    return {
+        "passed": passed,
+        "text": gen_text[:200],
+        "n_tokens": n_tokens,
+        "elapsed": elapsed,
+    }
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--tp", type=int, required=True, choices=[2, 4])
+    parser.add_argument("--force-recompile", action="store_true")
+    args = parser.parse_args()
+
+    tp = args.tp
+    print(f"{'=' * 70}")
+    print(f"TENSOR PARALLELISM TEST: TP={tp}")
+    print(f"{'=' * 70}")
+
+    config, tokenizer, traced_path = create_configs(tp)
+    print(f"  Model path: {MODEL_PATH}")
+    print(f"  Traced path: {traced_path}")
+    print(f"  Text TP={config.neuron_config.tp_degree}")
+    print(f"  Vision TP={config.vision_config.neuron_config.tp_degree}")
+
+    model = compile_and_load(
+        config, tokenizer, traced_path, force_recompile=args.force_recompile
+    )
+
+    text_results, text_passed = validate_text(model, tokenizer, tp)
+    img_result = validate_image_text(model, tokenizer, tp)
+
+    # Summary
+    all_passed = text_passed and img_result["passed"]
+    print(f"\n{'=' * 70}")
+    print(f"TP={tp} SUMMARY")
+    print(f"{'=' * 70}")
+    for r in text_results:
+        print(
+            f'  {"PASS" if r["passed"] else "FAIL"}: "{r["prompt"][:40]}" cosine={r["cosine"]:.6f}'
+        )
+    print(
+        f"  {'PASS' if img_result['passed'] else 'FAIL'}: Image+text ({img_result['n_tokens']} tokens)"
+    )
+
+    if all_passed:
+        print(f"\n  ALL TP={tp} TESTS PASSED")
+    else:
+        print(f"\n  SOME TP={tp} TESTS FAILED")
+        sys.exit(1)
+
+    # Save
+    out_path = os.path.join(REFERENCE_DIR, f"neuron_tp{tp}_validation.json")
+    with open(out_path, "w") as f:
+        json.dump(
+            {"tp_degree": tp, "text_results": text_results, "image_result": img_result},
+            f,
+            indent=2,
+            default=str,
+        )
+    print(f"  Results saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py b/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py
new file mode 100644
index 00000000..0626639b
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py
@@ -0,0 +1,193 @@
+"""Test weight loading: HF -> NxDI state dict conversion for Isaac."""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import torch
+from collections import OrderedDict
+from transformers import AutoConfig, AutoModelForCausalLM
+from neuronx_distributed_inference.models.config import (
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
+from neuronx_distributed.utils import cpu_mode
+from isaac_neuron.modeling_isaac import (
+    IsaacInferenceConfig,
+    NeuronIsaacForConditionalGeneration,
+)
+
+MODEL_PATH = "/mnt/models/Isaac-0.2-2B-Preview"
+
+
+def main():
+    # 1) Load HF model and get state dict
+    print("Loading HF model...")
+    hf_model = AutoModelForCausalLM.from_pretrained(
+        MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16
+    )
+    hf_state_dict = OrderedDict(hf_model.state_dict())
+    print(f"HF state dict keys: {len(hf_state_dict)}")
+    for k in sorted(hf_state_dict.keys())[:15]:
+        print(f"  {k}: {hf_state_dict[k].shape}")
+    print("  ...")
+    del hf_model
+    torch.cuda.empty_cache() if torch.cuda.is_available() else None
+
+    # 2) Create NxDI config
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    text_nc = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+    )
+    vision_nc = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_nc,
+        vision_neuron_config=vision_nc,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+
+    # 3) Run state dict conversion
+    print("\nRunning convert_hf_to_neuron_state_dict...")
+    neuron_sd = NeuronIsaacForConditionalGeneration.convert_hf_to_neuron_state_dict(
+        hf_state_dict, config
+    )
+    print(f"Neuron state dict keys: {len(neuron_sd)}")
+
+    # 4) Compute expected NxDI parameter names analytically
+    print("\nComputing expected NxDI parameter names...")
+
+    # Text model expected keys (28 decoder layers, Qwen3 architecture)
+    num_text_layers = config.text_config.num_hidden_layers  # 28
+    expected_text = set()
+    expected_text.add("embed_tokens.weight")
+    expected_text.add("lm_head.weight")
+    expected_text.add("norm.weight")
+    for i in range(num_text_layers):
+        pfx = f"layers.{i}"
+        expected_text.add(f"{pfx}.input_layernorm.weight")
+        expected_text.add(f"{pfx}.post_attention_layernorm.weight")
+        expected_text.add(f"{pfx}.mlp.gate_proj.weight")
+        expected_text.add(f"{pfx}.mlp.up_proj.weight")
+        expected_text.add(f"{pfx}.mlp.down_proj.weight")
+        # NxDI attention: qkv_proj.{q,k,v}_proj.weight, o_proj.o_proj.weight
+        expected_text.add(f"{pfx}.self_attn.qkv_proj.q_proj.weight")
+        expected_text.add(f"{pfx}.self_attn.qkv_proj.k_proj.weight")
+        expected_text.add(f"{pfx}.self_attn.qkv_proj.v_proj.weight")
+        expected_text.add(f"{pfx}.self_attn.o_proj.o_proj.weight")
+        expected_text.add(f"{pfx}.self_attn.q_layernorm.weight")
+        expected_text.add(f"{pfx}.self_attn.k_layernorm.weight")
+
+    # Vision encoder expected keys (SigLIP2, 27 layers)
+    num_vision_layers = config.vision_config.num_hidden_layers  # 27
+    expected_vision = set()
+    # SigLIP patch embedding
+    expected_vision.add(
+        "vision_encoder.vision_encoder.vision_model.embeddings.patch_embedding.weight"
+    )
+    expected_vision.add(
+        "vision_encoder.vision_encoder.vision_model.embeddings.patch_embedding.bias"
+    )
+    expected_vision.add(
+        "vision_encoder.vision_encoder.vision_model.embeddings.position_embedding.weight"
+    )
+    # SigLIP encoder layers
+    for i in range(num_vision_layers):
+        vpfx = f"vision_encoder.vision_encoder.vision_model.encoder.layers.{i}"
+        expected_vision.add(f"{vpfx}.layer_norm1.weight")
+        expected_vision.add(f"{vpfx}.layer_norm1.bias")
+        expected_vision.add(f"{vpfx}.layer_norm2.weight")
+        expected_vision.add(f"{vpfx}.layer_norm2.bias")
+        # NxDI vision attention: qkv_proj.{q,k,v}_proj.{weight,bias}, o_proj.o_proj.{weight,bias}
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.q_proj.weight")
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.q_proj.bias")
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.k_proj.weight")
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.k_proj.bias")
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.v_proj.weight")
+        expected_vision.add(f"{vpfx}.self_attn.qkv_proj.v_proj.bias")
+        expected_vision.add(f"{vpfx}.self_attn.o_proj.o_proj.weight")
+        expected_vision.add(f"{vpfx}.self_attn.o_proj.o_proj.bias")
+        # MLP
+        expected_vision.add(f"{vpfx}.mlp.fc1.weight")
+        expected_vision.add(f"{vpfx}.mlp.fc1.bias")
+        expected_vision.add(f"{vpfx}.mlp.fc2.weight")
+        expected_vision.add(f"{vpfx}.mlp.fc2.bias")
+    # SigLIP post layer norm
+    expected_vision.add(
+        "vision_encoder.vision_encoder.vision_model.post_layernorm.weight"
+    )
+    expected_vision.add(
+        "vision_encoder.vision_encoder.vision_model.post_layernorm.bias"
+    )
+    # MLP projector
+    expected_vision.add("vision_encoder.multi_modal_projector.fc1.weight")
+    expected_vision.add("vision_encoder.multi_modal_projector.fc2.weight")
+
+    expected_keys = expected_text | expected_vision
+    neuron_keys = set(neuron_sd.keys())
+
+    # Filter runtime keys
+    skip_patterns = ("rank_util", "sampler", "lm_head.bias")
+    neuron_filtered = {k for k in neuron_keys if not any(p in k for p in skip_patterns)}
+
+    missing = expected_keys - neuron_filtered
+    unexpected = neuron_filtered - expected_keys
+
+    print(f"\n=== RESULTS ===")
+    print(f"Expected keys: {len(expected_keys)}")
+    print(f"Neuron state dict keys (filtered): {len(neuron_filtered)}")
+    print(f"Missing (in model, not in weights): {len(missing)}")
+    print(f"Unexpected (in weights, not in model): {len(unexpected)}")
+
+    if missing:
+        print("\nMISSING keys:")
+        for k in sorted(missing):
+            print(f"  {k}")
+
+    if unexpected:
+        print("\nUNEXPECTED keys:")
+        for k in sorted(unexpected):
+            print(f"  {k}")
+
+    if not missing and not unexpected:
+        print("\n*** ALL WEIGHTS MATCH PERFECTLY ***")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py
new file mode 100644
index 00000000..16d96c4d
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py
@@ -0,0 +1,453 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Validate Isaac image+text inference on Neuron.
+
+Tests the full VLM pipeline:
+  pixel_values -> SigLIP2 encoder -> pixel_shuffle -> MLP projector -> text decoder
+
+Since the compiled model uses image_size=256, we use 256x256 images.
+The CPU reference was captured with tensor_stream (different preprocessing),
+so we validate:
+1. E2E generates non-garbage text (qualitative)
+2. Top-1 token is <think> (consistent with model behavior)
+3. Vision encoder produces reasonable embeddings (not NaN/Inf)
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python validate_image_text.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import json  # noqa: E402
+import os  # noqa: E402
+import sys  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+import torchvision.transforms as T  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+from transformers.image_utils import load_image  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B"
+
+# Isaac uses <|image_pad|> = 151655 as placeholder for vision embeddings
+IMAGE_TOKEN_ID = 151655
+IMAGE_SIZE = 256  # Compiled model's vision image_size
+PATCH_SIZE = 16
+PIXEL_SHUFFLE_SCALE = 2
+NUM_VISION_TOKENS = (IMAGE_SIZE // PATCH_SIZE) ** 2 // (PIXEL_SHUFFLE_SCALE**2)  # 64
+
+# SigLIP2 normalization
+IMAGE_MEAN = [0.5, 0.5, 0.5]
+IMAGE_STD = [0.5, 0.5, 0.5]
+
+# Environment
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_neuron_configs():
+    """Create text and vision neuron configurations (must match compilation)."""
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    return text_config, vision_config
+
+
+def load_compiled_model():
+    """Load the pre-compiled Isaac model."""
+    text_config, vision_config = create_neuron_configs()
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+
+    # Set image_token_index (Isaac config doesn't have it by default)
+    config.image_token_index = IMAGE_TOKEN_ID
+
+    print(f"Loading compiled model from {TRACED_MODEL_PATH}...")
+    model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config)
+    model.load(TRACED_MODEL_PATH, skip_warmup=True)
+    print("Model loaded successfully.")
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    return model, tokenizer
+
+
+def preprocess_image(image: Image.Image) -> torch.Tensor:
+    """Preprocess image to pixel_values tensor [1, 3, H, W].
+
+    Matches SigLIP2 normalization: rescale to [0,1], normalize with mean/std=0.5.
+    """
+    transform = T.Compose(
+        [
+            T.Resize(
+                (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC
+            ),
+            T.ToTensor(),  # [C, H, W] in [0, 1]
+            T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD),  # -> [-1, 1]
+        ]
+    )
+    pixel_values = transform(image).unsqueeze(0)  # [1, 3, 256, 256]
+    return pixel_values
+
+
+def prepare_image_text_inputs(prompt: str, image: Image.Image, tokenizer):
+    """Prepare input_ids, attention_mask, pixel_values, and vision_mask.
+
+    Isaac's processor uses -256 as image token placeholder in tensor_stream.
+    For NxDI, we:
+    1. Tokenize with chat template
+    2. Insert IMAGE_TOKEN_ID (151655) for vision token positions
+    3. Create boolean vision_mask
+
+    Returns:
+        input_ids: [1, seq_len] with IMAGE_TOKEN_ID at vision positions
+        attention_mask: [1, seq_len] all ones
+        pixel_values: [1, 3, 256, 256] normalized
+        vision_mask: [1, seq_len, 1] bool
+    """
+    # Build input_ids with image token placeholders
+    # Format: <|im_start|>user\n[64 image tokens]\n{prompt}<|im_end|>\n<|im_start|>assistant\n
+    messages = [{"role": "user", "content": prompt}]
+    text = tokenizer.apply_chat_template(
+        messages, tokenize=False, add_generation_prompt=True
+    )
+
+    # Tokenize the text (without image tokens)
+    # The template produces: <|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n
+    text_ids = tokenizer.encode(text, return_tensors="pt")  # [1, text_len]
+    text_ids = text_ids[0]  # [text_len]
+
+    # Find where to insert image tokens
+    # Isaac inserts image tokens after "user\n" — between the user header and the prompt content
+    # The chat template is: <|im_start|>user\n<image>\n{prompt}<|im_end|>\n<|im_start|>assistant\n
+    # But since we used the prompt directly (without <image>), we need to insert manually
+
+    # Re-create with <image> placeholder in the message
+    messages_with_image = [{"role": "user", "content": f"<image>\n{prompt}"}]
+    text_with_image = tokenizer.apply_chat_template(
+        messages_with_image, tokenize=False, add_generation_prompt=True
+    )
+    # Tokenize fully
+    full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0]  # [seq_len]
+
+    # Now find where "<image>" tokens are and replace with IMAGE_TOKEN_ID blocks
+    # The tokenizer encodes "<image>" as multiple tokens: [27, 1805, 29] = '<', 'image', '>'
+    # We need to replace those 3 tokens with NUM_VISION_TOKENS copies of IMAGE_TOKEN_ID
+
+    # Find the "<image>" token sequence
+    image_text_ids = tokenizer.encode(
+        "<image>", add_special_tokens=False
+    )  # [27, 1805, 29]
+    image_text_tensor = torch.tensor(image_text_ids)
+
+    # Find position of <image> in full_ids
+    found_pos = -1
+    for i in range(len(full_ids) - len(image_text_ids) + 1):
+        if torch.equal(full_ids[i : i + len(image_text_ids)], image_text_tensor):
+            found_pos = i
+            break
+
+    if found_pos >= 0:
+        # Replace <image> tokens with IMAGE_TOKEN_ID * NUM_VISION_TOKENS
+        before = full_ids[:found_pos]
+        after = full_ids[found_pos + len(image_text_ids) :]
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0)
+    else:
+        # Fallback: prepend image tokens after user header
+        print(
+            "WARNING: Could not find <image> in tokenized text, prepending image tokens"
+        )
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        # Insert after position 2 (after <|im_start|>user\n)
+        input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0)
+
+    attention_mask = torch.ones_like(input_ids)
+    pixel_values = preprocess_image(image)
+    vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool)
+
+    return input_ids, attention_mask, pixel_values, vision_mask
+
+
+def run_validation():
+    """Run image+text validation."""
+    model, tokenizer = load_compiled_model()
+    generation_model = HuggingFaceGenerationAdapter(model)
+
+    print(f"\n{'=' * 70}")
+    print("IMAGE+TEXT INFERENCE VALIDATION ON NEURON")
+    print(f"{'=' * 70}")
+    print(f"  Image size: {IMAGE_SIZE}x{IMAGE_SIZE}")
+    print(f"  Vision tokens: {NUM_VISION_TOKENS}")
+    print(f"  Image token ID: {IMAGE_TOKEN_ID}")
+
+    # Test images
+    test_cases = []
+
+    # Test 1: Solid color image (sanity check)
+    img_red = Image.new("RGB", (256, 256), color="red")
+    test_cases.append(("Describe this image in detail.", img_red, "red_square"))
+
+    # Test 2: Reference image (resized to 256x256)
+    try:
+        img_ref = load_image(
+            "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp"
+        )
+        test_cases.append(
+            ("Describe this image in detail.", img_ref, "reference_image")
+        )
+        test_cases.append(
+            ("What text or signs do you see in this image?", img_ref, "reference_ocr")
+        )
+    except Exception as e:
+        print(f"  WARNING: Could not load reference image: {e}")
+
+    results = []
+    all_passed = True
+
+    for i, (prompt, image, label) in enumerate(test_cases):
+        print(f'\n--- Test {i}: [{label}] "{prompt}" ---')
+        print(f"  Image: {image.size} -> will be resized to {IMAGE_SIZE}x{IMAGE_SIZE}")
+
+        try:
+            input_ids, attention_mask, pixel_values, vision_mask = (
+                prepare_image_text_inputs(prompt, image, tokenizer)
+            )
+        except Exception as e:
+            print(f"  ERROR in input preparation: {e}")
+            import traceback
+
+            traceback.print_exc()
+            all_passed = False
+            continue
+
+        seq_len = input_ids.shape[1]
+        n_image_tokens = vision_mask.sum().item()
+        print(f"  input_ids: {input_ids.shape}, seq_len={seq_len}")
+        print(f"  pixel_values: {pixel_values.shape}, dtype={pixel_values.dtype}")
+        print(f"  vision_mask: {n_image_tokens} image tokens")
+        print(
+            f"  pixel_values range: [{pixel_values.min():.4f}, {pixel_values.max():.4f}]"
+        )
+
+        # Verify seq_len fits in bucket
+        if seq_len > 1024:
+            print(f"  SKIP: seq_len {seq_len} > max bucket 1024")
+            continue
+
+        sampling_params = prepare_sampling_params(
+            batch_size=1,
+            top_k=[1],
+            top_p=[1.0],
+            temperature=[1.0],
+        )
+
+        generation_config = GenerationConfig(
+            do_sample=False,
+            output_scores=True,
+            return_dict_in_generate=True,
+            pad_token_id=tokenizer.eos_token_id,
+            max_new_tokens=30,  # Generate enough to see meaningful output
+        )
+
+        try:
+            outputs = generation_model.generate(
+                input_ids,
+                attention_mask=attention_mask,
+                max_length=model.config.neuron_config.max_length,
+                sampling_params=sampling_params,
+                generation_config=generation_config,
+                max_new_tokens=30,
+                pixel_values=pixel_values.to(torch.bfloat16),
+                vision_mask=vision_mask,
+            )
+        except Exception as e:
+            print(f"  ERROR in generate: {e}")
+            import traceback
+
+            traceback.print_exc()
+            all_passed = False
+            results.append({"label": label, "passed": False, "error": str(e)})
+            continue
+
+        # Extract generated tokens
+        if hasattr(outputs, "sequences"):
+            generated = outputs.sequences[0, input_ids.shape[1] :]
+            gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+        else:
+            generated = outputs[0, input_ids.shape[1] :]
+            gen_text = tokenizer.decode(generated, skip_special_tokens=True)
+
+        print(f"  Generated: {gen_text[:200]!r}")
+
+        # Extract first-token logits
+        first_logits = None
+        if (
+            hasattr(outputs, "scores")
+            and outputs.scores is not None
+            and len(outputs.scores) > 0
+        ):
+            first_logits = outputs.scores[0][0].float().cpu()
+            top5 = torch.topk(first_logits, 5)
+            top5_tokens = [tokenizer.decode([tid]) for tid in top5.indices.tolist()]
+            print(f"  Top-5 tokens: {list(zip(top5_tokens, top5.values.tolist()))}")
+            top1 = first_logits.argmax().item()
+            print(f"  Top-1: {top1} ({tokenizer.decode([top1])!r})")
+
+        # Validation checks
+        passed = True
+        failures = []
+
+        # Check 1: Generated text is not empty
+        if len(gen_text.strip()) == 0:
+            passed = False
+            failures.append("Empty generated text")
+
+        # Check 2: No NaN in logits
+        if first_logits is not None and torch.isnan(first_logits).any():
+            passed = False
+            failures.append("NaN in logits")
+
+        # Check 3: No Inf in logits
+        if first_logits is not None and torch.isinf(first_logits).any():
+            passed = False
+            failures.append("Inf in logits")
+
+        # Check 4: Top-1 should be <think> (consistent with model behavior)
+        if first_logits is not None:
+            top1 = first_logits.argmax().item()
+            if top1 != 151667:
+                # Not necessarily a failure for image inputs
+                print(
+                    f"  NOTE: Top-1 is {top1}, not <think> (151667) — may be normal for image input"
+                )
+
+        result = {
+            "label": label,
+            "prompt": prompt,
+            "passed": passed,
+            "generated_text": gen_text[:200],
+            "top1": first_logits.argmax().item() if first_logits is not None else None,
+            "failures": failures,
+        }
+        results.append(result)
+        if not passed:
+            all_passed = False
+
+        status = "PASS" if passed else "FAIL"
+        print(f"  [{status}]")
+        for f in failures:
+            print(f"    FAILURE: {f}")
+
+    # Summary
+    print(f"\n{'=' * 70}")
+    print("SUMMARY")
+    print(f"{'=' * 70}")
+    passed_count = sum(1 for r in results if r["passed"])
+    total = len(results)
+    print(f"  Passed: {passed_count}/{total}")
+
+    if all_passed:
+        print("\n  ALL IMAGE+TEXT TESTS PASSED")
+    else:
+        print("\n  SOME TESTS FAILED — see details above")
+        sys.exit(1)
+
+    # Save results
+    out_path = os.path.join(REFERENCE_DIR, "neuron_image_text_validation.json")
+    with open(out_path, "w") as f:
+        json.dump(results, f, indent=2, default=str)
+    print(f"\n  Results saved to {out_path}")
+
+
+if __name__ == "__main__":
+    run_validation()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py
new file mode 100644
index 00000000..24451bac
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py
@@ -0,0 +1,369 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Validate Isaac text-only logits on Neuron against CPU reference.
+
+Loads the compiled Isaac model, runs all 5 text reference prompts,
+and compares first-token logit distributions against saved CPU reference .pt files.
+
+Metrics:
+- Top-1 token match
+- Top-5 / Top-10 overlap
+- Cosine similarity of full logit vectors
+- Max absolute error
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python validate_text_logits.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import json  # noqa: E402
+import os  # noqa: E402
+import sys  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+# Configuration
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B"
+
+# Same prompts as capture_reference.py
+TEXT_PROMPTS = [
+    "The capital of France is",
+    "def fibonacci(n):",
+    "Explain quantum entanglement in simple terms:",
+    "The meaning of life is",
+    "List three primary colors:",
+]
+
+# Thresholds
+COSINE_SIM_THRESHOLD = 0.99  # BF16 quantization on Neuron vs FP32 CPU
+TOP1_MUST_MATCH = True
+TOP5_MIN_OVERLAP = 3  # At least 3 of 5 should match
+TOP10_MIN_OVERLAP = 5  # At least 5 of 10 should match
+
+# Environment
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_neuron_configs():
+    """Create text and vision neuron configurations (must match compilation)."""
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    return text_config, vision_config
+
+
+def load_compiled_model():
+    """Load the pre-compiled Isaac model from traced checkpoint."""
+    text_config, vision_config = create_neuron_configs()
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+
+    print(f"Loading compiled model from {TRACED_MODEL_PATH}...")
+    model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config)
+    model.load(TRACED_MODEL_PATH, skip_warmup=True)
+    print("Model loaded successfully.")
+
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+
+    return model, tokenizer
+
+
+def compare_logits(neuron_logits, ref_logits, prompt_name):
+    """Compare Neuron vs CPU reference logit vectors.
+
+    Args:
+        neuron_logits: [vocab_size] float tensor from Neuron
+        ref_logits: [vocab_size] float tensor from CPU reference
+        prompt_name: string for logging
+
+    Returns:
+        dict with all comparison metrics, and bool pass/fail
+    """
+    neuron_f = neuron_logits.float()
+    ref_f = ref_logits.float()
+
+    # Top-1 match
+    neuron_top1 = neuron_f.argmax().item()
+    ref_top1 = ref_f.argmax().item()
+    top1_match = neuron_top1 == ref_top1
+
+    # Top-5 overlap
+    neuron_top5 = set(torch.topk(neuron_f, 5).indices.tolist())
+    ref_top5 = set(torch.topk(ref_f, 5).indices.tolist())
+    top5_overlap = len(neuron_top5 & ref_top5)
+
+    # Top-10 overlap
+    neuron_top10 = set(torch.topk(neuron_f, 10).indices.tolist())
+    ref_top10 = set(torch.topk(ref_f, 10).indices.tolist())
+    top10_overlap = len(neuron_top10 & ref_top10)
+
+    # Cosine similarity
+    cosine_sim = F.cosine_similarity(neuron_f.unsqueeze(0), ref_f.unsqueeze(0)).item()
+
+    # Max absolute error
+    max_abs_err = (neuron_f - ref_f).abs().max().item()
+
+    # Mean absolute error
+    mean_abs_err = (neuron_f - ref_f).abs().mean().item()
+
+    # Pass/fail
+    passed = True
+    failures = []
+    if TOP1_MUST_MATCH and not top1_match:
+        passed = False
+        failures.append(f"Top-1 mismatch: Neuron={neuron_top1}, CPU={ref_top1}")
+    if top5_overlap < TOP5_MIN_OVERLAP:
+        passed = False
+        failures.append(f"Top-5 overlap {top5_overlap} < {TOP5_MIN_OVERLAP}")
+    if top10_overlap < TOP10_MIN_OVERLAP:
+        passed = False
+        failures.append(f"Top-10 overlap {top10_overlap} < {TOP10_MIN_OVERLAP}")
+    if cosine_sim < COSINE_SIM_THRESHOLD:
+        passed = False
+        failures.append(f"Cosine sim {cosine_sim:.6f} < {COSINE_SIM_THRESHOLD}")
+
+    result = {
+        "prompt": prompt_name,
+        "passed": passed,
+        "top1_match": top1_match,
+        "neuron_top1": neuron_top1,
+        "ref_top1": ref_top1,
+        "top5_overlap": top5_overlap,
+        "top10_overlap": top10_overlap,
+        "cosine_sim": cosine_sim,
+        "max_abs_err": max_abs_err,
+        "mean_abs_err": mean_abs_err,
+        "failures": failures,
+        "neuron_top10_ids": sorted(neuron_top10),
+        "ref_top10_ids": sorted(ref_top10),
+    }
+
+    return result, passed
+
+
+def run_validation():
+    """Main validation loop."""
+    model, tokenizer = load_compiled_model()
+    generation_model = HuggingFaceGenerationAdapter(model)
+
+    # Load reference results metadata
+    with open(os.path.join(REFERENCE_DIR, "reference_results.json")) as f:
+        ref_metadata = json.load(f)
+
+    print(f"\n{'=' * 70}")
+    print("TEXT-ONLY LOGIT VALIDATION: Neuron vs CPU Reference")
+    print(f"{'=' * 70}")
+    print(f"  Reference dir: {REFERENCE_DIR}")
+    print(
+        f"  Thresholds: cosine>{COSINE_SIM_THRESHOLD}, top1_must_match={TOP1_MUST_MATCH}"
+    )
+    print(f"  Prompts: {len(TEXT_PROMPTS)}")
+
+    results = []
+    all_passed = True
+
+    for i, prompt in enumerate(TEXT_PROMPTS):
+        print(f'\n--- Prompt {i}: "{prompt}" ---')
+
+        # Load CPU reference logits
+        ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt")
+        if not os.path.exists(ref_path):
+            print(f"  SKIP: Reference file not found: {ref_path}")
+            continue
+        ref_logits = torch.load(ref_path, map_location="cpu")  # [151936] float32
+        print(
+            f"  CPU ref: top-1={ref_logits.argmax().item()}, shape={ref_logits.shape}"
+        )
+
+        # Tokenize with chat template (matching capture_reference.py)
+        messages = [{"role": "user", "content": prompt}]
+        input_ids = tokenizer.apply_chat_template(
+            messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+        )
+        attention_mask = torch.ones_like(input_ids)
+        seq_len = input_ids.shape[1]
+        print(f"  Input seq_len: {seq_len}")
+
+        # Generate with logit collection
+        # We only need 1 new token to get the first-token logits (CTE pass)
+        sampling_params = prepare_sampling_params(
+            batch_size=1,
+            top_k=[1],
+            top_p=[1.0],
+            temperature=[1.0],  # temperature=1.0 so scores == raw logits
+        )
+
+        generation_config = GenerationConfig(
+            do_sample=False,
+            output_scores=True,
+            return_dict_in_generate=True,
+            pad_token_id=tokenizer.eos_token_id,
+            max_new_tokens=1,  # Only need first token
+        )
+
+        outputs = generation_model.generate(
+            input_ids,
+            attention_mask=attention_mask,
+            max_length=model.config.neuron_config.max_length,
+            sampling_params=sampling_params,
+            generation_config=generation_config,
+            max_new_tokens=1,
+        )
+
+        # Extract first-token logits from scores
+        # outputs.scores is a tuple of tensors, one per generated token
+        # outputs.scores[0] shape: [batch_size, vocab_size]
+        if (
+            hasattr(outputs, "scores")
+            and outputs.scores is not None
+            and len(outputs.scores) > 0
+        ):
+            neuron_logits = outputs.scores[0][0].float().cpu()  # [vocab_size]
+            print(
+                f"  Neuron: top-1={neuron_logits.argmax().item()}, shape={neuron_logits.shape}"
+            )
+        else:
+            print(
+                "  ERROR: No scores in output. Check output_logits=True in NeuronConfig."
+            )
+            print(f"  Output type: {type(outputs)}")
+            if hasattr(outputs, "__dict__"):
+                print(f"  Output attrs: {list(outputs.__dict__.keys())}")
+            all_passed = False
+            continue
+
+        # Compare
+        result, passed = compare_logits(neuron_logits, ref_logits, prompt)
+        results.append(result)
+        if not passed:
+            all_passed = False
+
+        # Print result
+        status = "PASS" if passed else "FAIL"
+        print(
+            f"  [{status}] cosine={result['cosine_sim']:.6f}, "
+            f"top1={'match' if result['top1_match'] else 'MISMATCH'}, "
+            f"top5={result['top5_overlap']}/5, top10={result['top10_overlap']}/10, "
+            f"max_abs_err={result['max_abs_err']:.4f}"
+        )
+        if not passed:
+            for f in result["failures"]:
+                print(f"    FAILURE: {f}")
+
+    # Summary
+    print(f"\n{'=' * 70}")
+    print("SUMMARY")
+    print(f"{'=' * 70}")
+    passed_count = sum(1 for r in results if r["passed"])
+    total = len(results)
+    print(f"  Passed: {passed_count}/{total}")
+
+    if results:
+        avg_cosine = sum(r["cosine_sim"] for r in results) / len(results)
+        avg_top5 = sum(r["top5_overlap"] for r in results) / len(results)
+        avg_top10 = sum(r["top10_overlap"] for r in results) / len(results)
+        print(f"  Avg cosine sim: {avg_cosine:.6f}")
+        print(f"  Avg top-5 overlap: {avg_top5:.1f}/5")
+        print(f"  Avg top-10 overlap: {avg_top10:.1f}/10")
+
+    if all_passed:
+        print("\n  ALL TEXT PROMPTS PASSED")
+    else:
+        print("\n  SOME PROMPTS FAILED — see details above")
+        sys.exit(1)
+
+    # Save results
+    out_path = os.path.join(REFERENCE_DIR, "neuron_text_validation.json")
+    with open(out_path, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"\n  Results saved to {out_path}")
+
+
+if __name__ == "__main__":
+    run_validation()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py
new file mode 100644
index 00000000..20ed9469
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py
@@ -0,0 +1,710 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Validate Isaac TKG (token generation) on Neuron.
+
+Tests the full CTE+TKG generation loop with:
+1. Multi-token text-only generation (50+ tokens, 5 prompts)
+2. Multi-token image+text generation
+3. Per-step logit extraction at max_new_tokens=32
+4. Edge cases: state reset, consecutive generates, vision clearing
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python validate_tkg.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import json  # noqa: E402
+import os  # noqa: E402
+import sys  # noqa: E402
+import time  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+import torchvision.transforms as T  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoConfig, AutoTokenizer, GenerationConfig  # noqa: E402
+from transformers.image_utils import load_image  # noqa: E402
+
+from neuronx_distributed_inference.models.config import (  # noqa: E402
+    NeuronConfig,
+    OnDeviceSamplingConfig,
+)
+from neuronx_distributed_inference.utils.hf_adapter import (  # noqa: E402
+    load_pretrained_config,
+    HuggingFaceGenerationAdapter,
+)
+from neuronx_distributed_inference.modules.generation.sampling import (  # noqa: E402
+    prepare_sampling_params,
+)
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B"
+
+IMAGE_TOKEN_ID = 151655  # <|image_pad|>
+IMAGE_SIZE = 256
+IMAGE_MEAN = [0.5, 0.5, 0.5]
+IMAGE_STD = [0.5, 0.5, 0.5]
+NUM_VISION_TOKENS = (IMAGE_SIZE // 16) ** 2 // 4  # 64
+
+TEXT_PROMPTS = [
+    "The capital of France is",
+    "def fibonacci(n):",
+    "Explain quantum entanglement in simple terms:",
+    "The meaning of life is",
+    "List three primary colors:",
+]
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def create_neuron_configs():
+    """Create text and vision neuron configurations (must match compilation)."""
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        on_device_sampling_config=OnDeviceSamplingConfig(
+            dynamic=True,
+            do_sample=True,
+            deterministic=True,
+            temperature=1.0,
+            top_p=1.0,
+            top_k=1,
+            global_topk=256,
+            top_k_kernel_enabled=True,
+        ),
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+    return text_config, vision_config
+
+
+def load_compiled_model():
+    text_config, vision_config = create_neuron_configs()
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+    config.image_token_index = IMAGE_TOKEN_ID
+    model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config)
+    model.load(TRACED_MODEL_PATH, skip_warmup=True)
+    tokenizer = AutoTokenizer.from_pretrained(
+        MODEL_PATH, padding_side="right", trust_remote_code=True
+    )
+    tokenizer.pad_token = tokenizer.eos_token
+    return model, tokenizer
+
+
+def preprocess_image(image: Image.Image) -> torch.Tensor:
+    transform = T.Compose(
+        [
+            T.Resize(
+                (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC
+            ),
+            T.ToTensor(),
+            T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD),
+        ]
+    )
+    return transform(image).unsqueeze(0)
+
+
+def prepare_image_text_inputs(prompt, image, tokenizer):
+    """Prepare input_ids with image token placeholders."""
+    messages_with_image = [{"role": "user", "content": f"<image>\n{prompt}"}]
+    text_with_image = tokenizer.apply_chat_template(
+        messages_with_image, tokenize=False, add_generation_prompt=True
+    )
+    full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0]
+
+    # Find <image> tokens and replace with IMAGE_TOKEN_ID placeholders
+    image_text_ids = tokenizer.encode("<image>", add_special_tokens=False)
+    image_text_tensor = torch.tensor(image_text_ids)
+
+    found_pos = -1
+    for i in range(len(full_ids) - len(image_text_ids) + 1):
+        if torch.equal(full_ids[i : i + len(image_text_ids)], image_text_tensor):
+            found_pos = i
+            break
+
+    if found_pos >= 0:
+        before = full_ids[:found_pos]
+        after = full_ids[found_pos + len(image_text_ids) :]
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0)
+    else:
+        image_tokens = torch.full(
+            (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long
+        )
+        input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0)
+
+    attention_mask = torch.ones_like(input_ids)
+    pixel_values = preprocess_image(image).to(torch.bfloat16)
+    vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool)
+    return input_ids, attention_mask, pixel_values, vision_mask
+
+
+def generate_text(
+    model,
+    tokenizer,
+    prompt,
+    max_new_tokens=50,
+    collect_logits=False,
+    pixel_values=None,
+    vision_mask=None,
+):
+    """Run generation and optionally collect per-step logits."""
+    generation_model = HuggingFaceGenerationAdapter(model)
+
+    messages = [{"role": "user", "content": prompt}]
+    input_ids = tokenizer.apply_chat_template(
+        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
+    )
+    attention_mask = torch.ones_like(input_ids)
+
+    sampling_params = prepare_sampling_params(
+        batch_size=1,
+        top_k=[1],
+        top_p=[1.0],
+        temperature=[1.0],
+    )
+
+    gen_config = GenerationConfig(
+        do_sample=False,
+        output_scores=collect_logits,
+        return_dict_in_generate=collect_logits,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=max_new_tokens,
+    )
+
+    kwargs = dict(
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+        sampling_params=sampling_params,
+        generation_config=gen_config,
+        max_new_tokens=max_new_tokens,
+    )
+    if pixel_values is not None:
+        kwargs["pixel_values"] = pixel_values
+    if vision_mask is not None:
+        kwargs["vision_mask"] = vision_mask
+
+    start = time.time()
+    outputs = generation_model.generate(input_ids, **kwargs)
+    elapsed = time.time() - start
+
+    if collect_logits and hasattr(outputs, "sequences"):
+        generated_ids = outputs.sequences[0, input_ids.shape[1] :]
+        scores = outputs.scores if outputs.scores else []
+    else:
+        if hasattr(outputs, "sequences"):
+            generated_ids = outputs.sequences[0, input_ids.shape[1] :]
+        else:
+            generated_ids = outputs[0, input_ids.shape[1] :]
+        scores = []
+
+    gen_text = tokenizer.decode(generated_ids, skip_special_tokens=False)
+    gen_text_clean = tokenizer.decode(generated_ids, skip_special_tokens=True)
+
+    return {
+        "input_ids": input_ids,
+        "generated_ids": generated_ids,
+        "text_raw": gen_text,
+        "text_clean": gen_text_clean,
+        "scores": scores,
+        "elapsed": elapsed,
+        "num_tokens": len(generated_ids),
+        "tokens_per_sec": len(generated_ids) / elapsed if elapsed > 0 else 0,
+    }
+
+
+# ===========================================================================
+# Test functions
+# ===========================================================================
+
+
+def test_multi_token_text(model, tokenizer):
+    """Test 1: Multi-token text-only generation for all 5 prompts."""
+    print(f"\n{'=' * 70}")
+    print("TEST 1: Multi-token text-only generation (50 tokens)")
+    print(f"{'=' * 70}")
+
+    results = []
+    all_passed = True
+
+    for i, prompt in enumerate(TEXT_PROMPTS):
+        print(f'\n--- Prompt {i}: "{prompt}" ---')
+        result = generate_text(model, tokenizer, prompt, max_new_tokens=50)
+
+        # Validation
+        passed = True
+        failures = []
+
+        # Non-empty
+        if len(result["text_clean"].strip()) == 0:
+            passed = False
+            failures.append("Empty output")
+
+        # Generated expected number of tokens (or hit EOS)
+        if result["num_tokens"] == 0:
+            passed = False
+            failures.append("Zero tokens generated")
+
+        # Should start with <think> (Isaac thinking model)
+        first_token = (
+            result["generated_ids"][0].item() if result["num_tokens"] > 0 else -1
+        )
+        if first_token != 151667:
+            failures.append(
+                f"First token {first_token} != <think> (151667) — may be normal"
+            )
+
+        # Check for repetition (degenerate TKG)
+        if result["num_tokens"] >= 10:
+            last_10 = result["generated_ids"][-10:].tolist()
+            if len(set(last_10)) <= 2:
+                passed = False
+                failures.append(f"Degenerate repetition in last 10 tokens: {last_10}")
+
+        result["passed"] = passed
+        result["failures"] = failures
+        results.append(result)
+        if not passed:
+            all_passed = False
+
+        status = "PASS" if passed else "FAIL"
+        print(
+            f"  [{status}] {result['num_tokens']} tokens in {result['elapsed']:.2f}s ({result['tokens_per_sec']:.1f} tok/s)"
+        )
+        print(f"  Output: {result['text_clean'][:200]!r}")
+        for f in failures:
+            print(f"    NOTE: {f}")
+
+    return results, all_passed
+
+
+def test_logit_collection(model, tokenizer):
+    """Test 2: Collect per-step logits at max_new_tokens=32."""
+    print(f"\n{'=' * 70}")
+    print("TEST 2: Per-step logit collection (32 tokens)")
+    print(f"{'=' * 70}")
+
+    results = []
+    all_passed = True
+
+    for i, prompt in enumerate(TEXT_PROMPTS[:3]):  # First 3 prompts
+        print(f'\n--- Prompt {i}: "{prompt}" ---')
+        result = generate_text(
+            model, tokenizer, prompt, max_new_tokens=32, collect_logits=True
+        )
+
+        passed = True
+        failures = []
+
+        # Check we got scores
+        n_scores = len(result["scores"])
+        print(
+            f"  Generated {result['num_tokens']} tokens, collected {n_scores} score tensors"
+        )
+
+        if n_scores == 0:
+            passed = False
+            failures.append("No scores collected (output_logits may not be working)")
+        else:
+            # Check each score tensor
+            for step_idx, score in enumerate(result["scores"]):
+                s = score[0].float()  # [vocab_size]
+                if torch.isnan(s).any():
+                    passed = False
+                    failures.append(f"NaN at step {step_idx}")
+                    break
+                if torch.isinf(s).any():
+                    passed = False
+                    failures.append(f"Inf at step {step_idx}")
+                    break
+
+            # Compare first-token logits against saved reference
+            ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt")
+            if os.path.exists(ref_path) and n_scores > 0:
+                ref_logits = torch.load(ref_path, map_location="cpu")
+                neuron_first = result["scores"][0][0].float().cpu()
+                cosine = F.cosine_similarity(
+                    neuron_first.unsqueeze(0), ref_logits.unsqueeze(0)
+                ).item()
+                print(f"  First-token cosine vs CPU ref: {cosine:.6f}")
+                if cosine < 0.99:
+                    passed = False
+                    failures.append(f"First-token cosine {cosine:.6f} < 0.99")
+
+            # Check that later tokens also have reasonable logits
+            if n_scores >= 5:
+                for step in [0, n_scores // 2, n_scores - 1]:
+                    s = result["scores"][step][0].float()
+                    top1 = s.argmax().item()
+                    top1_val = s.max().item()
+                    print(
+                        f"  Step {step}: top-1={top1} ({tokenizer.decode([top1])!r}), logit={top1_val:.2f}"
+                    )
+
+        result["passed"] = passed
+        result["failures"] = failures
+        result["n_scores"] = n_scores
+        results.append(result)
+        if not passed:
+            all_passed = False
+
+        status = "PASS" if passed else "FAIL"
+        print(f"  [{status}]")
+        for f in failures:
+            print(f"    FAILURE: {f}")
+
+    return results, all_passed
+
+
+def test_state_reset(model, tokenizer):
+    """Test 3: Verify state resets between consecutive generate() calls."""
+    print(f"\n{'=' * 70}")
+    print("TEST 3: State reset between consecutive generates")
+    print(f"{'=' * 70}")
+
+    passed = True
+    failures = []
+
+    # Run same prompt twice — should get identical output
+    print("\n  Running same prompt twice...")
+    r1 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20)
+    r2 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20)
+
+    ids1 = r1["generated_ids"].tolist()
+    ids2 = r2["generated_ids"].tolist()
+    match = ids1 == ids2
+    print(f"  Run 1: {r1['text_clean'][:100]!r}")
+    print(f"  Run 2: {r2['text_clean'][:100]!r}")
+    print(f"  Token sequences match: {match}")
+    if not match:
+        # Check how many match
+        min_len = min(len(ids1), len(ids2))
+        matching = sum(1 for a, b in zip(ids1[:min_len], ids2[:min_len]) if a == b)
+        print(f"  Matching: {matching}/{min_len} tokens")
+        if matching < min_len * 0.9:
+            failures.append(
+                f"Same prompt gave different outputs: {matching}/{min_len} match"
+            )
+            passed = False
+
+    # Run different prompts — verify no cross-contamination
+    print("\n  Running different prompts...")
+    r3 = generate_text(model, tokenizer, "def fibonacci(n):", max_new_tokens=20)
+    r4 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20)
+
+    ids4 = r4["generated_ids"].tolist()
+    match_after = ids4 == ids2
+    print(f"  After different prompt, re-running 'France': {r4['text_clean'][:100]!r}")
+    print(f"  Matches original: {match_after}")
+    if not match_after:
+        min_len = min(len(ids4), len(ids2))
+        matching = sum(1 for a, b in zip(ids4[:min_len], ids2[:min_len]) if a == b)
+        if matching < min_len * 0.9:
+            failures.append(
+                f"State contamination: re-run after different prompt gives different output ({matching}/{min_len})"
+            )
+            passed = False
+
+    status = "PASS" if passed else "FAIL"
+    print(f"\n  [{status}]")
+    for f in failures:
+        print(f"    FAILURE: {f}")
+
+    return {"passed": passed, "failures": failures}
+
+
+def test_image_text_generation(model, tokenizer):
+    """Test 4: Multi-token image+text generation."""
+    print(f"\n{'=' * 70}")
+    print("TEST 4: Image+text multi-token generation")
+    print(f"{'=' * 70}")
+
+    passed = True
+    failures = []
+
+    try:
+        ref_img = load_image(
+            "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp"
+        )
+    except Exception as e:
+        print(f"  WARNING: Could not load reference image: {e}")
+        ref_img = Image.new("RGB", (256, 256), color="blue")
+
+    prompt = "Describe this image in detail."
+    input_ids, attention_mask, pixel_values, vision_mask = prepare_image_text_inputs(
+        prompt, ref_img, tokenizer
+    )
+
+    print(f"  Input: {input_ids.shape}, vision tokens: {vision_mask.sum().item()}")
+
+    generation_model = HuggingFaceGenerationAdapter(model)
+    sampling_params = prepare_sampling_params(
+        batch_size=1,
+        top_k=[1],
+        top_p=[1.0],
+        temperature=[1.0],
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        output_scores=False,
+        return_dict_in_generate=False,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=50,
+    )
+
+    start = time.time()
+    outputs = generation_model.generate(
+        input_ids,
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+        sampling_params=sampling_params,
+        generation_config=gen_config,
+        max_new_tokens=50,
+        pixel_values=pixel_values,
+        vision_mask=vision_mask,
+    )
+    elapsed = time.time() - start
+
+    generated_ids = outputs[0, input_ids.shape[1] :]
+    gen_text = tokenizer.decode(generated_ids, skip_special_tokens=True)
+    n_tokens = len(generated_ids)
+
+    print(
+        f"  Generated {n_tokens} tokens in {elapsed:.2f}s ({n_tokens / elapsed:.1f} tok/s)"
+    )
+    print(f"  Output: {gen_text[:300]!r}")
+
+    if len(gen_text.strip()) == 0:
+        passed = False
+        failures.append("Empty image+text output")
+
+    if n_tokens == 0:
+        passed = False
+        failures.append("Zero tokens generated")
+
+    # Check for degenerate repetition
+    if n_tokens >= 10:
+        last_10 = generated_ids[-10:].tolist()
+        if len(set(last_10)) <= 2:
+            passed = False
+            failures.append(f"Degenerate repetition: {last_10}")
+
+    status = "PASS" if passed else "FAIL"
+    print(f"  [{status}]")
+    for f in failures:
+        print(f"    FAILURE: {f}")
+
+    return {
+        "passed": passed,
+        "failures": failures,
+        "text": gen_text[:300],
+        "n_tokens": n_tokens,
+    }
+
+
+def test_vision_state_reset(model, tokenizer):
+    """Test 5: Vision state resets between image and text-only prompts."""
+    print(f"\n{'=' * 70}")
+    print("TEST 5: Vision state reset (image -> text -> image)")
+    print(f"{'=' * 70}")
+
+    passed = True
+    failures = []
+
+    # 1. Run text-only
+    r1 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20)
+    print(f"  Text-only: {r1['text_clean'][:100]!r}")
+
+    # 2. Run image+text
+    img = Image.new("RGB", (256, 256), color="red")
+    input_ids, attention_mask, pixel_values, vision_mask = prepare_image_text_inputs(
+        "Describe this image.", img, tokenizer
+    )
+
+    generation_model = HuggingFaceGenerationAdapter(model)
+    sampling_params = prepare_sampling_params(
+        batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0]
+    )
+    gen_config = GenerationConfig(
+        do_sample=False,
+        pad_token_id=tokenizer.eos_token_id,
+        max_new_tokens=20,
+    )
+    outputs = generation_model.generate(
+        input_ids,
+        attention_mask=attention_mask,
+        max_length=model.config.neuron_config.max_length,
+        sampling_params=sampling_params,
+        generation_config=gen_config,
+        max_new_tokens=20,
+        pixel_values=pixel_values,
+        vision_mask=vision_mask,
+    )
+    img_text = tokenizer.decode(
+        outputs[0, input_ids.shape[1] :], skip_special_tokens=True
+    )
+    print(f"  Image+text: {img_text[:100]!r}")
+
+    # 3. Run text-only again — should match run 1
+    r3 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20)
+    print(f"  Text-only (after image): {r3['text_clean'][:100]!r}")
+
+    ids1 = r1["generated_ids"].tolist()
+    ids3 = r3["generated_ids"].tolist()
+    match = ids1 == ids3
+    print(f"  Text outputs match (pre/post image): {match}")
+
+    if not match:
+        min_len = min(len(ids1), len(ids3))
+        matching = sum(1 for a, b in zip(ids1[:min_len], ids3[:min_len]) if a == b)
+        if matching < min_len * 0.9:
+            passed = False
+            failures.append(
+                f"Vision state leaked: text output changed after image prompt ({matching}/{min_len})"
+            )
+
+    status = "PASS" if passed else "FAIL"
+    print(f"  [{status}]")
+    for f in failures:
+        print(f"    FAILURE: {f}")
+
+    return {"passed": passed, "failures": failures}
+
+
+# ===========================================================================
+# Main
+# ===========================================================================
+
+
+def main():
+    print(f"{'=' * 70}")
+    print("TKG VALIDATION: Isaac on Neuron")
+    print(f"{'=' * 70}")
+
+    model, tokenizer = load_compiled_model()
+
+    # Run all tests
+    test_results = {}
+
+    r1, p1 = test_multi_token_text(model, tokenizer)
+    test_results["multi_token_text"] = {
+        "results": [
+            {
+                "prompt": TEXT_PROMPTS[i],
+                "passed": r["passed"],
+                "n_tokens": r["num_tokens"],
+                "text": r["text_clean"][:200],
+                "tok_per_sec": r["tokens_per_sec"],
+            }
+            for i, r in enumerate(r1)
+        ],
+        "all_passed": p1,
+    }
+
+    r2, p2 = test_logit_collection(model, tokenizer)
+    test_results["logit_collection"] = {
+        "results": [
+            {
+                "prompt": TEXT_PROMPTS[i],
+                "passed": r["passed"],
+                "n_scores": r.get("n_scores", 0),
+            }
+            for i, r in enumerate(r2)
+        ],
+        "all_passed": p2,
+    }
+
+    r3 = test_state_reset(model, tokenizer)
+    test_results["state_reset"] = r3
+
+    r4 = test_image_text_generation(model, tokenizer)
+    test_results["image_text_generation"] = r4
+
+    r5 = test_vision_state_reset(model, tokenizer)
+    test_results["vision_state_reset"] = r5
+
+    # Overall summary
+    all_tests = [p1, p2, r3["passed"], r4["passed"], r5["passed"]]
+    all_passed = all(all_tests)
+
+    print(f"\n{'=' * 70}")
+    print("OVERALL SUMMARY")
+    print(f"{'=' * 70}")
+    test_names = [
+        "Multi-token text",
+        "Logit collection",
+        "State reset",
+        "Image+text generation",
+        "Vision state reset",
+    ]
+    for name, p in zip(test_names, all_tests):
+        print(f"  {'PASS' if p else 'FAIL'}: {name}")
+
+    if all_passed:
+        print(f"\n  ALL TKG TESTS PASSED")
+    else:
+        print(f"\n  SOME TESTS FAILED")
+        sys.exit(1)
+
+    # Save results
+    out_path = os.path.join(REFERENCE_DIR, "neuron_tkg_validation.json")
+    with open(out_path, "w") as f:
+        json.dump(test_results, f, indent=2, default=str)
+    print(f"  Results saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py
new file mode 100644
index 00000000..8cd31c06
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py
@@ -0,0 +1,250 @@
+# Copyright 2025 © Amazon.com and Affiliates
+"""Validate Isaac vision encoder on Neuron vs CPU reference.
+
+Approach: Since the HF Isaac model uses a different vision input format
+(packed_seq_patches via tensor_stream) than the NxDI model (standard pixel_values
+through Conv2d), we can't directly compare vision encoder outputs.
+
+Instead, we validate the Neuron vision encoder by:
+1. Running the NxDI vision encoder on a test image
+2. Checking that output embeddings are numerically reasonable (no NaN/Inf)
+3. Checking that different images produce different embeddings (not degenerate)
+4. Running a manual Conv2d + encoder comparison using reshaped weights
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+    export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH
+    python validate_vision_encoder.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import json  # noqa: E402
+import os  # noqa: E402
+import sys  # noqa: E402
+
+import torch  # noqa: E402
+import torch.nn.functional as F  # noqa: E402
+import torchvision.transforms as T  # noqa: E402
+from PIL import Image  # noqa: E402
+from transformers import AutoConfig  # noqa: E402
+from transformers.image_utils import load_image  # noqa: E402
+
+from neuronx_distributed_inference.models.config import NeuronConfig  # noqa: E402
+from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config  # noqa: E402
+
+from isaac_neuron.modeling_isaac import (  # noqa: E402
+    NeuronIsaacForConditionalGeneration,
+    IsaacInferenceConfig,
+)
+
+# ---------------------------------------------------------------------------
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B"
+REFERENCE_DIR = f"{DATA_PATH}/reference_outputs"
+
+IMAGE_SIZE = 256
+IMAGE_MEAN = [0.5, 0.5, 0.5]
+IMAGE_STD = [0.5, 0.5, 0.5]
+
+os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0"
+torch.manual_seed(42)
+
+
+def preprocess_image(image: Image.Image) -> torch.Tensor:
+    transform = T.Compose(
+        [
+            T.Resize(
+                (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC
+            ),
+            T.ToTensor(),
+            T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD),
+        ]
+    )
+    return transform(image).unsqueeze(0)
+
+
+def load_neuron_model():
+    """Load the compiled Neuron model and return the full model object."""
+    text_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        cp_degree=1,
+        save_sharded_checkpoint=True,
+        skip_sharding=False,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        context_encoding_buckets=[1024],
+        token_generation_buckets=[1024],
+        async_mode=False,
+        output_logits=True,
+        fused_qkv=False,
+        sequence_parallel_enabled=False,
+        attn_kernel_enabled=False,
+        attn_tkg_nki_kernel_enabled=False,
+        attn_tkg_builtin_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+    vision_config = NeuronConfig(
+        batch_size=1,
+        seq_len=1024,
+        torch_dtype=torch.bfloat16,
+        tp_degree=1,
+        world_size=1,
+        save_sharded_checkpoint=True,
+        is_continuous_batching=True,
+        ctx_batch_size=1,
+        enable_bucketing=True,
+        buckets=[1],
+        fused_qkv=False,
+        attn_kernel_enabled=False,
+        qkv_kernel_enabled=False,
+        mlp_kernel_enabled=False,
+    )
+
+    hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True)
+    config = IsaacInferenceConfig(
+        text_neuron_config=text_config,
+        vision_neuron_config=vision_config,
+        load_config=load_pretrained_config(hf_config=hf_config),
+    )
+
+    model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config)
+    model.load(TRACED_MODEL_PATH, skip_warmup=True)
+    return model
+
+
+def main():
+    print(f"{'=' * 70}")
+    print("VISION ENCODER VALIDATION: Neuron")
+    print(f"{'=' * 70}")
+
+    # Prepare test images
+    images = {
+        "red": Image.new("RGB", (256, 256), color="red"),
+        "blue": Image.new("RGB", (256, 256), color="blue"),
+        "black": Image.new("RGB", (256, 256), color="black"),
+    }
+    try:
+        images["reference"] = load_image(
+            "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp"
+        )
+    except Exception as e:
+        print(f"  WARNING: Could not load reference image: {e}")
+
+    # Load model
+    print("\nLoading compiled Neuron model...")
+    model = load_neuron_model()
+    print("  Model loaded.")
+
+    # Run vision encoder on each image
+    embeddings = {}
+    all_passed = True
+    results = []
+
+    for label, img in images.items():
+        print(f"\n--- {label} ({img.size}) ---")
+        pixel_values = preprocess_image(img).to(torch.bfloat16)
+        print(f"  pixel_values: {pixel_values.shape}")
+
+        with torch.no_grad():
+            output = model.vision_encoder_model(pixel_values)
+
+        output_f = output.float().cpu()
+        embeddings[label] = output_f
+
+        # Check 1: Shape
+        expected_tokens = (IMAGE_SIZE // 16) ** 2 // 4  # 64
+        expected_dim = 2048  # text hidden size
+        shape_ok = output_f.shape == torch.Size([1, expected_tokens, expected_dim])
+        print(
+            f"  Output shape: {output_f.shape} (expected [1, {expected_tokens}, {expected_dim}]): {'OK' if shape_ok else 'FAIL'}"
+        )
+
+        # Check 2: No NaN
+        has_nan = torch.isnan(output_f).any().item()
+        print(f"  NaN check: {'FAIL' if has_nan else 'OK'}")
+
+        # Check 3: No Inf
+        has_inf = torch.isinf(output_f).any().item()
+        print(f"  Inf check: {'FAIL' if has_inf else 'OK'}")
+
+        # Check 4: Non-zero variance (not degenerate)
+        variance = output_f.var().item()
+        variance_ok = variance > 1e-6
+        print(
+            f"  Variance: {variance:.6f} {'OK' if variance_ok else 'FAIL (degenerate)'}"
+        )
+
+        # Check 5: Reasonable value range
+        val_min = output_f.min().item()
+        val_max = output_f.max().item()
+        val_mean = output_f.mean().item()
+        range_ok = abs(val_min) < 100 and abs(val_max) < 100
+        print(
+            f"  Range: [{val_min:.4f}, {val_max:.4f}], mean={val_mean:.4f} {'OK' if range_ok else 'SUSPICIOUS'}"
+        )
+
+        passed = shape_ok and not has_nan and not has_inf and variance_ok and range_ok
+        if not passed:
+            all_passed = False
+        results.append(
+            {
+                "label": label,
+                "passed": passed,
+                "shape": list(output_f.shape),
+                "has_nan": has_nan,
+                "has_inf": has_inf,
+                "variance": variance,
+                "range": [val_min, val_max],
+                "mean": val_mean,
+            }
+        )
+
+    # Cross-image comparison: different images should produce different embeddings
+    print(f"\n--- Cross-image comparison ---")
+    labels = list(embeddings.keys())
+    for i in range(len(labels)):
+        for j in range(i + 1, len(labels)):
+            a, b = labels[i], labels[j]
+            cos = F.cosine_similarity(
+                embeddings[a].reshape(1, -1), embeddings[b].reshape(1, -1)
+            ).item()
+            different = cos < 0.999  # Different images should have cosine < 0.999
+            print(
+                f"  {a} vs {b}: cosine={cos:.6f} {'OK (different)' if different else 'WARNING (too similar)'}"
+            )
+            if not different:
+                print(f"    WARNING: Very similar embeddings for different images!")
+
+    # Summary
+    print(f"\n{'=' * 70}")
+    print("SUMMARY")
+    print(f"{'=' * 70}")
+    for r in results:
+        status = "PASS" if r["passed"] else "FAIL"
+        print(
+            f"  [{status}] {r['label']}: shape={r['shape']}, var={r['variance']:.6f}, range=[{r['range'][0]:.3f}, {r['range'][1]:.3f}]"
+        )
+
+    if all_passed:
+        print(f"\n  ALL VISION ENCODER CHECKS PASSED")
+    else:
+        print(f"\n  SOME CHECKS FAILED")
+        sys.exit(1)
+
+    out_path = os.path.join(REFERENCE_DIR, "neuron_vision_encoder_validation.json")
+    with open(out_path, "w") as f:
+        json.dump(results, f, indent=2)
+    print(f"  Results saved to {out_path}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/vllm/README.md b/contrib/models/Isaac-0.2-2B/vllm/README.md
new file mode 100644
index 00000000..e6541f52
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/README.md
@@ -0,0 +1,162 @@
+# Running Isaac-0.2-2B with vLLM on AWS Neuron
+
+## Setup
+
+### 1. Download Model Weights
+
+```bash
+huggingface-cli download PerceptronAI/Isaac-0.2-2B-Preview --local-dir /mnt/models/Isaac-0.2-2B-Preview
+```
+
+### 2. Activate vLLM Environment
+
+Use the DLAMI venv that includes vLLM 0.16.0 + vllm-neuron 0.5.0:
+
+```bash
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+```
+
+### 3. Apply vLLM Patches
+
+Isaac is a contrib model and requires patching vllm-neuron to register the model:
+
+```bash
+NXDI_ROOT="/mnt/models/neuronx-distributed-inference"
+PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \
+    python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py
+```
+
+This patches 3 files in the installed vllm-neuron package:
+1. `constants.py` — Registers `IsaacForConditionalGeneration` as a multimodal model
+2. `neuronx_distributed_model_loader.py` — Adds Isaac wrapper class with `load_weights()` and custom `execute_model()` override
+3. `neuronx_distributed_model_runner.py` — Adds multimodal data routing for `"isaac"` model type
+
+### 3.5. Patch modular_isaac.py (Required)
+
+Isaac's HuggingFace `modular_isaac.py` imports the proprietary `perceptron.tensorstream` package, which
+is unavailable on Neuron instances. This must be patched before vLLM can load the model config:
+
+```bash
+NXDI_ROOT="/mnt/models/neuronx-distributed-inference"
+python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py \
+    /mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py
+```
+
+**Important**: If HuggingFace has already cached the model code, also patch the cached copy:
+
+```bash
+python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py \
+    ~/.cache/huggingface/modules/transformers_modules/Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py
+```
+
+### 4. Compile Model (if not already compiled)
+
+The model must be compiled via NxDI before vLLM can serve it:
+
+```bash
+source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate
+PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \
+    python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py compile
+```
+
+## Running
+
+### Offline Inference
+
+```bash
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+NXDI_ROOT="/mnt/models/neuronx-distributed-inference"
+PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \
+    python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py
+```
+
+### Online Serving
+
+1. Start the server:
+
+```bash
+source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+NXDI_ROOT="/mnt/models/neuronx-distributed-inference"
+PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \
+    bash ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh
+```
+
+2. Query the server:
+
+```bash
+python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py --base-url http://localhost:8080
+```
+
+Or use curl:
+
+```bash
+curl http://localhost:8080/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Isaac-0.2-2B-Preview",
+    "messages": [{"role": "user", "content": "What is quantum computing?"}],
+    "max_tokens": 100,
+    "temperature": 0
+  }'
+```
+
+## Configuration
+
+Key vLLM parameters for Isaac:
+
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| `tensor-parallel-size` | 1 | 2B model fits on single core |
+| `max-model-len` | 1024 | Adjust based on compiled buckets |
+| `max-num-seqs` | 1 | VLM framework limitation |
+| `trust-remote-code` | Required | Isaac uses custom model code |
+| `attn_kernel_enabled` | true | CTE flash attention (+2%) |
+
+## Tested Results
+
+| Mode | Status | Throughput | Notes |
+|------|--------|------------|-------|
+| Text-only (offline) | **Working** | ~78 tok/s | Correct output verified |
+| Image+text (offline) | Not working | N/A | pixel_values format mismatch |
+| Online API server | Not tested | N/A | Text-only expected to work |
+
+**Example output** (text-only):
+```
+Prompt: "What is the capital of France?"
+Output: "<think>\n\n</think>\n\nThe capital of France is Paris."
+```
+
+## Known Limitations
+
+1. **Image+text is not supported via vLLM**: vLLM-neuron delivers `pixel_values` in pre-flattened
+   patch format `[num_patches, patch_dim]`, but Isaac's NxDI model expects raw image tensors
+   `[B, 3, 256, 256]`. Fixing this requires adapting vLLM's multimodal preprocessing or adding
+   a reshape layer in the wrapper.
+
+2. **On-device sampling mismatch**: Isaac's NxDI model returns logits (not on-device sampled tokens).
+   The `execute_model()` override in the wrapper handles this by extracting
+   `output.logits[:, -1, :]` and applying `torch.argmax()`. This means sampling parameters
+   like `temperature` and `top_p` are NOT respected — generation is always greedy.
+
+3. **`modular_isaac.py` must be patched**: The proprietary `perceptron.tensorstream` import must be
+   removed before vLLM can load the model. See step 3.5 above.
+
+4. **Single sequence only**: `max-num-seqs=1` is required due to the NxDI VLM framework limitation
+   (shared with all VLM contrib models).
+
+## Architecture
+
+The vLLM integration uses a 3-file patch approach:
+
+```
+vllm-neuron (installed package)
+├── worker/constants.py               + "IsaacForConditionalGeneration" in NEURON_MULTI_MODAL_MODELS
+├── worker/neuronx_distributed_model_loader.py  + NeuronIsaacForConditionalGeneration class
+│                                                  + get_neuron_model() dispatch
+└── worker/neuronx_distributed_model_runner.py  + "isaac" multimodal routing
+```
+
+The `NeuronIsaacForConditionalGeneration` wrapper:
+- Loads the compiled NxDI Isaac model via `load_weights()`
+- Overrides `execute_model()` to handle the logits→token ID conversion
+- Uses `vision_token_id = 151655` (`<|image_pad|>`) for vision mask construction
diff --git a/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py b/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py
new file mode 100644
index 00000000..d003c0e2
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py
@@ -0,0 +1,88 @@
+#!/usr/bin/env python3
+"""Add execute_model and forward overrides to NeuronIsaacForConditionalGeneration in model_loader.py."""
+
+import sys
+
+path = (
+    sys.argv[1]
+    if len(sys.argv) > 1
+    else ("/vllm/vllm_neuron/worker/neuronx_distributed_model_loader.py")
+)
+
+with open(path, "r") as f:
+    content = f.read()
+
+# We need to add execute_model and forward methods to NeuronIsaacForConditionalGeneration
+# The class currently only has load_weights.
+
+# The old code ends with:
+OLD_END = """        self.vision_token_id = tokenizer(
+            "<|image_pad|>", add_special_tokens=False
+        ).input_ids[0]
+        return success, compiled_model_path
+
+
+def _get_model_configs"""
+
+# The new code adds execute_model and forward after load_weights
+NEW_END = '''        self.vision_token_id = tokenizer(
+            "<|image_pad|>", add_special_tokens=False
+        ).input_ids[0]
+        return success, compiled_model_path
+
+    def execute_model(self, model_input, **kwargs):
+        """Execute model forward pass for Isaac VLM.
+
+        Unlike Llama4, Isaac uses vision_token_id (set during load_weights)
+        instead of model.config.image_token_index for vision mask creation.
+        """
+        vision_mask = (
+            model_input.input_tokens == self.vision_token_id
+        ).unsqueeze(-1)
+
+        pixel_values = None
+        if (
+            model_input.multi_modal_kwargs is not None
+            and model_input.multi_modal_kwargs.get("pixel_values") is not None
+        ):
+            pixel_values = model_input.multi_modal_kwargs["pixel_values"]
+
+        # Call the base NeuronMultiModalCausalLM.forward directly
+        # (skip Llama4's forward which assumes Llama4-specific pixel_values format)
+        hidden_states = NeuronMultiModalCausalLM.forward(
+            self,
+            input_ids=model_input.input_tokens,
+            positions=model_input.position_ids,
+            input_block_ids=model_input.input_block_ids,
+            sampling_params=model_input.sampling_params,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+        )
+        return hidden_states
+
+
+def _get_model_configs'''
+
+if OLD_END in content:
+    content = content.replace(OLD_END, NEW_END)
+    with open(path, "w") as f:
+        f.write(content)
+    print(
+        f"SUCCESS: Added execute_model override to NeuronIsaacForConditionalGeneration in {path}"
+    )
+else:
+    print(f"ERROR: Could not find the expected code block in {path}")
+    # Show what's around the class
+    import re
+
+    match = re.search(
+        r"class NeuronIsaacForConditionalGeneration.*?(?=\nclass |\ndef _get_model_configs)",
+        content,
+        re.DOTALL,
+    )
+    if match:
+        print(f"Found class at positions {match.start()}-{match.end()}")
+        print("Last 200 chars of class:")
+        print(match.group()[-200:])
+    else:
+        print("Could not find the class at all")
diff --git a/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py b/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py
new file mode 100644
index 00000000..9b0f932b
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py
@@ -0,0 +1,346 @@
+#!/usr/bin/env python3
+# Copyright 2025 (c) Amazon.com and Affiliates
+"""Patch vllm-neuron 0.5.0 to support Isaac-0.2-2B VLM.
+
+Applies the 4-layer registration:
+1. constants.py — Add to NEURON_MULTI_MODAL_MODELS
+2. model_loader.py — Add NeuronIsaacForConditionalGeneration wrapper class
+3. model_loader.py — Add architecture dispatch in get_neuron_model() + fix Sampler import
+4. model_runner.py — Add multimodal data routing
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+    python patch_vllm_isaac.py
+"""
+
+import importlib
+import os
+import sys
+
+
+def find_vllm_neuron_path():
+    """Find the installed vllm_neuron package path."""
+    try:
+        spec = importlib.util.find_spec("vllm_neuron")
+        if spec and spec.origin:
+            return os.path.dirname(spec.origin)
+    except (ModuleNotFoundError, AttributeError):
+        pass
+
+    # Fallback: search common locations
+    for base in sys.path:
+        candidate = os.path.join(base, "vllm_neuron")
+        if os.path.isdir(candidate):
+            return candidate
+
+    raise FileNotFoundError(
+        "Cannot find vllm_neuron package. Is vllm-neuron installed?"
+    )
+
+
+def patch_constants(worker_dir):
+    """Layer 1: Add Isaac to NEURON_MULTI_MODAL_MODELS."""
+    path = os.path.join(worker_dir, "constants.py")
+    with open(path, "r") as f:
+        content = f.read()
+
+    if "IsaacForConditionalGeneration" in content:
+        print("[constants.py] Already patched — skipping")
+        return
+
+    # Add Isaac to the NEURON_MULTI_MODAL_MODELS list
+    # Try various insertion points
+    for marker in [
+        '"Qwen3VLForConditionalGeneration",',
+        '"Qwen2VLForConditionalGeneration",',
+        '"Llama4ForConditionalGeneration",',
+        '"LlavaForConditionalGeneration",',
+    ]:
+        if marker in content:
+            content = content.replace(
+                marker,
+                marker + '\n    "IsaacForConditionalGeneration",',
+            )
+            break
+
+    if "IsaacForConditionalGeneration" not in content:
+        print("[constants.py] WARNING: Could not find insertion point")
+        return
+
+    with open(path, "w") as f:
+        f.write(content)
+    print(
+        "[constants.py] Added IsaacForConditionalGeneration to NEURON_MULTI_MODAL_MODELS"
+    )
+
+
+def patch_model_loader(worker_dir):
+    """Layer 2+3: Fix Sampler import, add Isaac wrapper class, add architecture dispatch."""
+    path = os.path.join(worker_dir, "neuronx_distributed_model_loader.py")
+    with open(path, "r") as f:
+        content = f.read()
+
+    # Fix Sampler import (shared issue with Gemma3)
+    if "from vllm.v1.sample import sampler as Sampler" in content:
+        content = content.replace(
+            "from vllm.v1.sample import sampler as Sampler",
+            "from vllm.v1.sample.sampler import Sampler",
+        )
+        print("[model_loader.py] Fixed Sampler import")
+
+    if "NeuronIsaacForConditionalGeneration" in content:
+        print("[model_loader.py] Already patched — skipping")
+        with open(path, "w") as f:
+            f.write(content)
+        return
+
+    # --- Add Isaac wrapper class before get_neuron_model or _get_model_configs ---
+    isaac_class = '''
+
+class NeuronIsaacForConditionalGeneration(NeuronLlama4ForCausalLM):
+    """Isaac VLM using dynamically loaded NeuronIsaacForConditionalGeneration from contrib."""
+
+    def load_weights(self, model_name_or_path: str, architecture: str, **kwargs):
+        import importlib
+
+        neuronx_module = importlib.import_module("isaac_neuron.modeling_isaac")
+        neuronx_model_cls = getattr(neuronx_module, "NeuronIsaacForConditionalGeneration")
+
+        default_neuron_config = kwargs["neuron_config"]
+        override_neuron_config = _validate_image_to_text_override_neuron_config(
+            kwargs["override_neuron_config"]
+        )
+
+        vision_neuron_config = copy.deepcopy(default_neuron_config)
+        vision_neuron_config.update(
+            override_neuron_config.get("vision_neuron_config", {})
+        )
+        vision_neuron_config = neuronx_model_cls.get_neuron_config_cls()(
+            **vision_neuron_config
+        )
+
+        text_neuron_config = copy.deepcopy(default_neuron_config)
+        text_neuron_config.update(override_neuron_config.get("text_neuron_config", {}))
+        text_neuron_config = neuronx_model_cls.get_neuron_config_cls()(
+            **text_neuron_config
+        )
+
+        from transformers import AutoConfig
+        hf_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True)
+
+        config = neuronx_model_cls.get_config_cls()(
+            text_neuron_config=text_neuron_config,
+            vision_neuron_config=vision_neuron_config,
+            load_config=load_pretrained_config(hf_config=hf_config),
+        )
+
+        success, compiled_model_path, _ = self._load_weights_common(
+            model_name_or_path, neuronx_model_cls, config=config, **kwargs
+        )
+
+        if not success:
+            if not os.path.exists(model_name_or_path):
+                model_name_or_path = self._save_pretrained_model(model_name_or_path)
+
+            self._compile_and_load_model(
+                model_name_or_path, neuronx_model_cls, config, compiled_model_path
+            )
+
+        # Load tokenizer to get vision token ID
+        from transformers import AutoTokenizer
+        tokenizer = AutoTokenizer.from_pretrained(
+            model_name_or_path, trust_remote_code=True
+        )
+        self.vision_token_id = tokenizer(
+            "<|image_pad|>", add_special_tokens=False
+        ).input_ids[0]
+        return success, compiled_model_path
+
+    def execute_model(self, model_input, **kwargs):
+        """Execute model forward pass for Isaac VLM.
+
+        Uses vision_token_id for vision mask (not model.config.image_token_index),
+        calls base forward directly, and handles logits->token_id conversion since
+        the Isaac compiled model returns logits (not on-device sampled tokens).
+        """
+        vision_mask = (
+            model_input.input_tokens == self.vision_token_id
+        ).unsqueeze(-1)
+
+        pixel_values = None
+        if (
+            model_input.multi_modal_kwargs is not None
+            and model_input.multi_modal_kwargs.get("pixel_values") is not None
+        ):
+            pixel_values = model_input.multi_modal_kwargs["pixel_values"]
+
+        # Call base forward with Isaac-specific args
+        with self._reordered(
+            model_input.input_block_ids,
+            input_ids=model_input.input_tokens,
+            positions=model_input.position_ids,
+            sampling_params=model_input.sampling_params,
+            pixel_values=pixel_values,
+            vision_mask=vision_mask,
+        ) as (sorted_ids, inputs, restore):
+            output = self.model(
+                inputs["input_ids"].to(torch.int32),
+                attention_mask=None,
+                position_ids=inputs["positions"].to(torch.int32),
+                seq_ids=sorted_ids.flatten().to(torch.int32),
+                pixel_values=inputs.get("pixel_values"),
+                vision_mask=inputs.get("vision_mask"),
+                sampling_params=inputs["sampling_params"],
+            )
+
+            # Isaac model returns logits (not on-device sampled tokens)
+            # Extract last-token logits and argmax to get token IDs
+            if hasattr(output, "hidden_states") and isinstance(output.hidden_states, torch.Tensor) and output.hidden_states.numel() > 0:
+                result = output.hidden_states
+            else:
+                logits = output.logits[:, -1, :]  # [batch, vocab]
+                result = torch.argmax(logits, dim=-1)  # [batch] - token IDs
+
+            return restore(result)
+
+'''
+
+    # Insert class before _get_model_configs or get_neuron_model
+    for marker in ["def _get_model_configs(", "def get_neuron_model("]:
+        if marker in content:
+            idx = content.index(marker)
+            content = content[:idx] + isaac_class + "\n" + content[idx:]
+            print("[model_loader.py] Added NeuronIsaacForConditionalGeneration class")
+            break
+    else:
+        print("[model_loader.py] WARNING: Could not find insertion point for class")
+
+    # --- Add architecture dispatch in get_neuron_model() ---
+    # This function is in model_loader.py and dispatches based on architecture string
+    dispatch_markers = [
+        'elif architecture == "Qwen3VLForConditionalGeneration":',
+        'elif architecture == "Qwen2VLForConditionalGeneration":',
+        'elif architecture == "Llama4ForConditionalGeneration":',
+    ]
+
+    for marker in dispatch_markers:
+        if marker in content:
+            # Find the line after this elif + its body
+            idx = content.index(marker)
+            # Find next elif or else
+            search_start = idx + len(marker)
+            next_elif = content.find("\n    elif ", search_start)
+            next_else = content.find("\n    else:", search_start)
+
+            # Pick the closest one
+            candidates = [c for c in [next_elif, next_else] if c > 0]
+            if candidates:
+                insert_point = min(candidates)
+                insert_text = (
+                    '\n    elif architecture == "IsaacForConditionalGeneration":'
+                    "\n        model = NeuronIsaacForConditionalGeneration(model_config.hf_config)"
+                )
+                content = content[:insert_point] + insert_text + content[insert_point:]
+                print(
+                    "[model_loader.py] Added Isaac architecture dispatch in get_neuron_model()"
+                )
+                break
+    else:
+        print("[model_loader.py] WARNING: Could not find dispatch insertion point")
+
+    with open(path, "w") as f:
+        f.write(content)
+
+
+def patch_model_runner(worker_dir):
+    """Layer 4: Add multimodal data routing for Isaac model_type."""
+    path = os.path.join(worker_dir, "neuronx_distributed_model_runner.py")
+    with open(path, "r") as f:
+        content = f.read()
+
+    if '"isaac"' in content or "'isaac'" in content:
+        print("[model_runner.py] Already patched — skipping")
+        return
+
+    changed = False
+
+    # Add multimodal data routing for Isaac
+    # Isaac uses pass-through (no special multimodal preprocessing needed, like Llama4)
+    # Look for existing qwen3_vl routing and add after it
+    routing_markers = [
+        'elif self.model.model.config.model_type == "qwen3_vl":',
+        'elif self.model.model.config.model_type == "qwen2_vl":',
+        'elif self.model.model.config.model_type == "llava":',
+    ]
+
+    for marker in routing_markers:
+        if marker in content:
+            # Find the line(s) after this elif
+            idx = content.index(marker)
+            search_start = idx + len(marker)
+            # Find next elif or else
+            next_elif = content.find("\n        elif ", search_start)
+            next_else = content.find("\n        else:", search_start)
+
+            candidates = [c for c in [next_elif, next_else] if c > 0]
+            if candidates:
+                insert_point = min(candidates)
+                insert_text = (
+                    '\n        elif self.model.model.config.model_type == "isaac":'
+                    "\n            pass  # Isaac does not require special multimodal preprocessing"
+                )
+                content = content[:insert_point] + insert_text + content[insert_point:]
+                print("[model_runner.py] Added Isaac multimodal data routing")
+                changed = True
+                break
+
+    if not changed:
+        # Try alternative: check if there's a list-style routing
+        for list_marker in [
+            "in ['llama4'",
+            'in ["llama4"',
+            "in ['llama4', 'gemma3'",
+            'in ["llama4", "gemma3"',
+        ]:
+            if list_marker in content:
+                content = content.replace(
+                    list_marker,
+                    list_marker.rstrip("'\"") + "', 'isaac'"
+                    if "'" in list_marker
+                    else list_marker.rstrip("'\"") + '", "isaac"',
+                )
+                print("[model_runner.py] Added Isaac to multimodal list routing")
+                changed = True
+                break
+
+    if not changed:
+        print(
+            "[model_runner.py] WARNING: Could not add multimodal routing — may need manual patch"
+        )
+
+    with open(path, "w") as f:
+        f.write(content)
+
+
+def main():
+    vllm_neuron_path = find_vllm_neuron_path()
+    worker_dir = os.path.join(vllm_neuron_path, "worker")
+    print(f"Found vllm_neuron at: {vllm_neuron_path}")
+    print(f"Worker directory: {worker_dir}")
+    print()
+
+    patch_constants(worker_dir)
+    patch_model_loader(worker_dir)
+    patch_model_runner(worker_dir)
+
+    print()
+    print("All patches applied. To use Isaac with vLLM:")
+    print("  export VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'")
+    print("  export NEURON_COMPILED_ARTIFACTS='/mnt/models/traced_model/Isaac-0.2-2B'")
+    print(
+        "  PYTHONPATH='.../Isaac-0.2-2B/src:$PYTHONPATH' python -m vllm.entrypoints.openai.api_server ..."
+    )
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py b/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py
new file mode 100644
index 00000000..0182161a
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+# Copyright 2025 (c) Amazon.com and Affiliates
+"""Offline inference for Isaac-0.2-2B via vLLM on Neuron.
+
+Usage:
+    source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+    export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
+    export NEURON_COMPILED_ARTIFACTS="/mnt/models/traced_model/Isaac-0.2-2B"
+    PYTHONPATH="/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:/mnt/models/neuronx-distributed-inference/src:$PYTHONPATH" \
+        python run_offline_inference.py
+"""
+
+from isaac_neuron.ndxi_patch import apply_patch
+
+apply_patch()
+
+import os  # noqa: E402
+from pathlib import Path  # noqa: E402
+
+from vllm import LLM, SamplingParams  # noqa: E402
+
+HOME_DIR = Path.home()
+DATA_PATH = os.getenv("DATA_HOME", "/mnt/models")
+MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview"
+COMPILED_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B"
+
+os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference"
+os.environ["NEURON_COMPILED_ARTIFACTS"] = COMPILED_PATH
+
+
+def main(max_seq_len: int = 1024) -> None:
+    llm = LLM(
+        model=MODEL_PATH,
+        max_num_seqs=1,
+        max_model_len=max_seq_len,
+        tensor_parallel_size=1,
+        limit_mm_per_prompt={"image": 1},
+        allowed_local_media_path=HOME_DIR.as_posix(),
+        enable_prefix_caching=False,
+        enable_chunked_prefill=False,
+        trust_remote_code=True,
+        additional_config={
+            "override_neuron_config": {
+                "text_neuron_config": {
+                    "attn_kernel_enabled": True,
+                    "enable_bucketing": True,
+                    "context_encoding_buckets": [max_seq_len],
+                    "token_generation_buckets": [max_seq_len],
+                    "is_continuous_batching": True,
+                    "async_mode": False,
+                },
+                "vision_neuron_config": {
+                    "enable_bucketing": True,
+                    "buckets": [1],
+                    "is_continuous_batching": True,
+                },
+            },
+        },
+    )
+
+    sampling_params = SamplingParams(top_k=1, max_tokens=100)
+
+    # Test 1: Text-only
+    print("=" * 60)
+    print("Test 1: Text-only")
+    print("=" * 60)
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "What is the capital of France? Explain briefly.",
+                },
+            ],
+        }
+    ]
+    for output in llm.chat(conversation, sampling_params):
+        print(f"Generated: {output.outputs[0].text!r}")
+
+    # Test 2: Text-only (longer)
+    print("\n" + "=" * 60)
+    print("Test 2: Text-only (longer)")
+    print("=" * 60)
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "text",
+                    "text": "Explain quantum entanglement in simple terms.",
+                },
+            ],
+        }
+    ]
+    for output in llm.chat(conversation, sampling_params):
+        print(f"Generated: {output.outputs[0].text!r}")
+
+    # Test 3: Image+text (requires a test image)
+    print("\n" + "=" * 60)
+    print("Test 3: Image+text")
+    print("=" * 60)
+    test_image = Path(__file__).resolve().parent / "data" / "test_image.jpg"
+    if test_image.exists():
+        image_url = f"file://{test_image.as_posix()}"
+    else:
+        # Use a publicly accessible image URL
+        image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
+
+    conversation = [
+        {
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": image_url}},
+                {"type": "text", "text": "Describe this image in detail."},
+            ],
+        }
+    ]
+    try:
+        for output in llm.chat(conversation, sampling_params):
+            print(f"Generated: {output.outputs[0].text!r}")
+    except Exception as e:
+        print(f"Image+text failed (may need local image): {e}")
+
+    print("\nAll tests completed.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py b/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py
new file mode 100644
index 00000000..5b8f9eb1
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py
@@ -0,0 +1,104 @@
+#!/usr/bin/env python3
+# Copyright 2025 (c) Amazon.com and Affiliates
+"""Online inference client for Isaac vLLM server.
+
+Sends requests to a running vLLM OpenAI-compatible API server.
+
+Usage:
+    # Start server first (see start-vllm-server.sh)
+    python run_online_inference.py [--base-url http://localhost:8080]
+"""
+
+import argparse
+import json
+import time
+
+import requests
+
+
+def chat_completion(base_url, messages, max_tokens=100, temperature=0):
+    """Send a chat completion request to the vLLM server."""
+    url = f"{base_url}/v1/chat/completions"
+    payload = {
+        "model": "Isaac-0.2-2B-Preview",
+        "messages": messages,
+        "max_tokens": max_tokens,
+        "temperature": temperature,
+    }
+    t0 = time.time()
+    response = requests.post(url, json=payload, timeout=120)
+    elapsed = time.time() - t0
+    response.raise_for_status()
+    result = response.json()
+    return result, elapsed
+
+
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--base-url", default="http://localhost:8080")
+    args = parser.parse_args()
+
+    # Test 1: Text-only
+    print("=" * 60)
+    print("Test 1: Text-only")
+    print("=" * 60)
+    messages = [
+        {"role": "user", "content": "What is the capital of France? Explain briefly."}
+    ]
+    result, elapsed = chat_completion(args.base_url, messages)
+    text = result["choices"][0]["message"]["content"]
+    usage = result.get("usage", {})
+    print(f"Response: {text[:200]}")
+    print(f"Latency: {elapsed:.2f}s")
+    print(f"Usage: {usage}")
+
+    # Test 2: Text-only (longer)
+    print("\n" + "=" * 60)
+    print("Test 2: Text-only (longer)")
+    print("=" * 60)
+    messages = [
+        {
+            "role": "user",
+            "content": "Explain quantum entanglement in simple terms.",
+        }
+    ]
+    result, elapsed = chat_completion(args.base_url, messages)
+    text = result["choices"][0]["message"]["content"]
+    usage = result.get("usage", {})
+    print(f"Response: {text[:200]}")
+    print(f"Latency: {elapsed:.2f}s")
+    print(f"Usage: {usage}")
+
+    # Test 3: Image+text
+    print("\n" + "=" * 60)
+    print("Test 3: Image+text")
+    print("=" * 60)
+    messages = [
+        {
+            "role": "user",
+            "content": [
+                {
+                    "type": "image_url",
+                    "image_url": {
+                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png"
+                    },
+                },
+                {"type": "text", "text": "Describe this image."},
+            ],
+        }
+    ]
+    try:
+        result, elapsed = chat_completion(args.base_url, messages)
+        text = result["choices"][0]["message"]["content"]
+        usage = result.get("usage", {})
+        print(f"Response: {text[:200]}")
+        print(f"Latency: {elapsed:.2f}s")
+        print(f"Usage: {usage}")
+    except Exception as e:
+        print(f"Image+text failed: {e}")
+
+    print("\nAll online tests completed.")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh b/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh
new file mode 100644
index 00000000..92e3a517
--- /dev/null
+++ b/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh
@@ -0,0 +1,32 @@
+#!/bin/bash
+# Copyright 2025 (c) Amazon.com and Affiliates
+# Start vLLM server for Isaac-0.2-2B on Neuron
+#
+# Prerequisites:
+#   1. Apply vLLM patches: python patch_vllm_isaac.py
+#   2. Model compiled at NEURON_COMPILED_ARTIFACTS path
+#
+# Usage:
+#   source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate
+#   bash start-vllm-server.sh
+
+export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference"
+export NEURON_COMPILED_ARTIFACTS="/mnt/models/traced_model/Isaac-0.2-2B"
+export VLLM_RPC_TIMEOUT=100000
+
+NXDI_ROOT="/mnt/models/neuronx-distributed-inference"
+ISAAC_SRC="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src"
+export PYTHONPATH="${ISAAC_SRC}:${NXDI_ROOT}/src:${PYTHONPATH}"
+
+python -m vllm.entrypoints.openai.api_server \
+    --port=8080 \
+    --model="/mnt/models/Isaac-0.2-2B-Preview" \
+    --max-num-seqs=1 \
+    --max-model-len=1024 \
+    --limit-mm-per-prompt='{"image": 1}' \
+    --allowed-local-media-path="/mnt/models" \
+    --tensor-parallel-size=1 \
+    --trust-remote-code \
+    --no-enable-chunked-prefill \
+    --no-enable-prefix-caching \
+    --additional-config='{"override_neuron_config":{"text_neuron_config":{"attn_kernel_enabled":true,"enable_bucketing":true,"context_encoding_buckets":[1024],"token_generation_buckets":[1024],"is_continuous_batching":true,"async_mode":false},"vision_neuron_config":{"enable_bucketing":true,"buckets":[1],"is_continuous_batching":true}}}'