diff --git a/contrib/models/Isaac-0.2-2B/README.md b/contrib/models/Isaac-0.2-2B/README.md new file mode 100644 index 00000000..c9ee7e67 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/README.md @@ -0,0 +1,215 @@ +# Contrib Model: PerceptronAI Isaac-0.2-2B-Preview VLM + +NeuronX Distributed Inference implementation for the PerceptronAI Isaac-0.2-2B-Preview Vision-Language Model. Isaac combines a Qwen3 text backbone with a SigLIP2 vision encoder and 2-layer MLP projector with pixel shuffle. + +## Model Information + +- **HuggingFace ID:** [`PerceptronAI/Isaac-0.2-2B-Preview`](https://huggingface.co/PerceptronAI/Isaac-0.2-2B-Preview) +- **Model Type:** VLM with SigLIP2 vision encoder, pixel shuffle, MLP projector, and Qwen3 text decoder +- **License:** CC-BY-NC-4.0 (non-commercial) +- **Requires:** `trust_remote_code=True` + +## Architecture Details + +### Text Backbone (Qwen3) + +| Spec | Isaac 2B | +|---|---:| +| **Layers** | 28 | +| **Hidden Size** | 2048 | +| **Head Dim** | 128 | +| **Attention Heads** | 16 | +| **KV Heads** | 8 | +| **Intermediate Size** | 6144 | +| **Vocabulary Size** | 151,936 | +| **Max Position Embeddings** | 40,960 | +| **Position Encoding** | RoPE (mRoPE-capable) | +| **Normalization** | RMSNorm | +| **Activation** | SiLU | +| **Total Parameters** | 2.57B | + +### SigLIP2 Vision Encoder + +| Spec | Value | +|---|---:| +| **Layers** | 27 | +| **Hidden Size** | 1152 | +| **Head Dim** | 72 | +| **Attention Heads** | 16 | +| **KV Heads** | 16 | +| **Intermediate Size** | 4304 | +| **Activation** | GELU (approximate) | +| **Image Size** | 256×256 | +| **Patch Size** | 16 | +| **Pixel Shuffle Scale** | 2 | +| **Vision Tokens per Image** | 64 | + +### MLP Projector + +| Spec | Value | +|---|---:| +| **Layer 1** | Linear(4608 → 18432, no bias) + SiLU | +| **Layer 2** | Linear(18432 → 2048, no bias) | +| **Parameters** | ~122M | + +## Validation Results + +**Validated:** 2026-04-30 +**Configuration:** trn2.3xlarge, TP=1, batch_size=1, seq_len=1024, bfloat16 + +### Accuracy + +| Test | Status | Result | +|------|--------|--------| +| Text logit cosine (5 prompts) | PASS | avg 0.99998 vs CPU ref | +| Top-1 token match | PASS | 100% match (8/8 prompts) | +| Image+text generation | PASS | Coherent descriptions | +| TP=2 accuracy | PASS | cosine 0.99997 | +| TP=4 accuracy | PASS | cosine 0.99997 | + +### Performance (trn2.3xlarge, TP=1, BS=1) + +| Metric | seq_len=1024 | seq_len=4096 | +|--------|-------------|-------------| +| **TKG Throughput** | 110.7 tok/s | 94.0 tok/s | +| **TPOT** | 9.0 ms | 10.6 ms | +| **TTFT** | 9.0 ms | 10.6 ms | +| **Image+text tok/s** | 108.7 tok/s | 93.1 tok/s | +| **Projected DP=4** | ~443 tok/s | ~376 tok/s | + +**Compilation time:** ~196s (one-time, seq_len=1024) + +### GPU Comparison (L40S, vLLM 0.20.0, CUDA graphs enabled) + +| Metric | L40S GPU | trn2 Neuron (TP=1) | trn2 Neuron (DP=4) | +|--------|----------|---------------------|---------------------| +| **TPOT (short input)** | 5.75 ms | 9.0 ms | — | +| **Throughput (short input)** | 174 tok/s | 111 tok/s | ~443 tok/s | +| **TPOT (long input)** | 6.09 ms | 9.0 ms | — | +| **Throughput (long input)** | 164 tok/s | 111 tok/s | ~443 tok/s | + +- **Per-core:** L40S is ~1.5x faster than a single NeuronCore +- **Per-device (DP=4):** trn2.3xlarge is ~2.5x faster than L40S +- GPU benchmark: L40S with vLLM 0.20.0, batch_size=1, CUDA graphs enabled (default) +- Neuron benchmark: trn2.3xlarge, TP=1, batch_size=1, bfloat16, CTE flash attention + +## Usage + +```python +import torch +from transformers import AutoConfig, AutoTokenizer +from neuronx_distributed_inference.models.config import NeuronConfig, OnDeviceSamplingConfig +from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config + +from isaac_neuron.modeling_isaac import ( + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +model_path = "/path/to/Isaac-0.2-2B-Preview" +compiled_path = "/path/to/compiled/model" + +# Configure +text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, do_sample=True, deterministic=True, + top_k=1, global_topk=256, top_k_kernel_enabled=True, + ), + attn_kernel_enabled=True, # CTE flash attention + fused_qkv=False, + mlp_kernel_enabled=False, +) + +vision_config = NeuronConfig( + batch_size=1, seq_len=1024, torch_dtype=torch.bfloat16, + tp_degree=1, is_continuous_batching=True, ctx_batch_size=1, + enable_bucketing=True, buckets=[1], +) + +hf_config = AutoConfig.from_pretrained(model_path, trust_remote_code=True) +config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), +) +config.image_token_index = 151655 # <|image_pad|> + +# Compile and load +model = NeuronIsaacForConditionalGeneration(model_path, config) +model.compile(compiled_path, debug=False) +model.load(compiled_path) + +# Generate (see integration tests for full examples) +``` + +## Compatibility Matrix + +| Instance/Version | SDK 2.29 | SDK 2.28 and earlier | +|------------------|----------|----------------------| +| trn2.3xlarge (TP=1) | Tested | Not tested | +| trn2.3xlarge (TP=2) | Tested | Not tested | +| trn2.3xlarge (TP=4) | Tested | Not tested | +| trn1 | Not tested | Not tested | +| inf2 | Not tested | Not tested | + +## Known Limitations + +- **Batch size:** Only BS=1 supported (NxDI VLM framework limitation, shared with all VLM contribs) +- **MLP NKI kernel:** Not compatible at TP=1 (intermediate=6144 exceeds SBUF capacity). Use default kernels. +- **QKV NKI kernel:** Not compatible (Q/K layernorm incompatible with fused QKV kernel) +- **Image size:** Fixed at 256×256 (64 vision tokens per image) +- **License:** CC-BY-NC-4.0 — non-commercial use only + +## Testing + +Run integration tests: + +```bash +# Set up environment +source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate +export PYTHONPATH=/path/to/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + +# Run validation +cd contrib/models/Isaac-0.2-2B +python test/integration/run_isaac.py +``` + +## Module Structure + +``` +contrib/models/Isaac-0.2-2B/ +├── README.md +├── src/ +│ └── isaac_neuron/ +│ ├── __init__.py +│ ├── modeling_isaac.py # VLM orchestrator + config + state dict mapping +│ ├── modeling_isaac_text.py # Text model (NeuronBaseModel + Qwen3 layers) +│ ├── modeling_isaac_vision.py # Vision wrapper + MLP projector + pixel shuffle +│ ├── ndxi_patch.py # SDK 2.29 compatibility patches +│ ├── utils.py # QKV fusion + pixel shuffle utilities +│ └── siglip/ +│ ├── modeling_siglip.py # SigLIP2 vision encoder +│ └── layers.py # OutputChannelParallelConv2d +└── test/ + └── integration/ + ├── run_isaac.py # Main compilation + generation test + ├── benchmark.py # Formal benchmark script + ├── test_tp.py # TP=2/4 validation + ├── validate_text_logits.py # Text logit validation vs CPU + ├── validate_tkg.py # TKG multi-token validation + ├── validate_image_text.py # Image+text E2E validation + └── validate_vision_encoder.py # Vision encoder sanity checks +``` + +## Example Checkpoint + +* [`PerceptronAI/Isaac-0.2-2B-Preview`](https://huggingface.co/PerceptronAI/Isaac-0.2-2B-Preview) diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py new file mode 100644 index 00000000..aa6b938f --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/benchmark_gpu.py @@ -0,0 +1,341 @@ +#!/usr/bin/env python3 +# Copyright 2025 (c) Amazon.com and Affiliates +"""GPU benchmark for Isaac-0.2-2B-Preview using vLLM. + +Measures TTFT, TPOT, tok/s across multiple workloads to match Neuron benchmark. +Follows GPU Benchmark Standard (steering/gpu-benchmark-standard.md). + +Usage: + pip install vllm transformers torch pillow + python benchmark_gpu.py [--model PerceptronAI/Isaac-0.2-2B-Preview] [--warmup 5] [--iterations 10] +""" + +import argparse +import json +import os +import statistics +import time +from pathlib import Path + +import torch +from transformers import AutoTokenizer + + +# ── Workload definitions matching Neuron benchmark ────────────────────── + +WORKLOADS = { + "short-short": {"input_tokens": 128, "output_tokens": 128}, + "short-long": {"input_tokens": 128, "output_tokens": 512}, + "long-short": {"input_tokens": 2048, "output_tokens": 128}, + "long-long": {"input_tokens": 2048, "output_tokens": 512}, +} + +FILLER_TEXT = ( + "The quick brown fox jumps over the lazy dog. " + "A journey of a thousand miles begins with a single step. " + "To be or not to be, that is the question. " + "All that glitters is not gold. " + "The only thing we have to fear is fear itself. " +) + + +def build_prompt(tokenizer, target_tokens: int) -> str: + """Build a synthetic prompt of approximately target_tokens length.""" + repeated = FILLER_TEXT * (target_tokens // 10 + 10) + token_ids = tokenizer.encode(repeated)[:target_tokens] + return tokenizer.decode(token_ids, skip_special_tokens=True) + + +def percentiles(values, pcts=(50, 95, 99)): + """Calculate percentiles.""" + if not values: + return {f"p{p}": None for p in pcts} + s = sorted(values) + n = len(s) + return {f"p{p}": s[min(int(p / 100 * n), n - 1)] for p in pcts} + + +def benchmark_vllm_offline(model_path, workloads, warmup, iterations, dtype): + """Run benchmark using vLLM offline (Python API).""" + from vllm import LLM, SamplingParams + + print(f"Loading model: {model_path}") + print(f"dtype: {dtype}") + + llm = LLM( + model=model_path, + dtype=dtype, + trust_remote_code=True, + max_model_len=4096, + gpu_memory_utilization=0.90, + ) + tokenizer = llm.get_tokenizer() + + results = {} + + for wl_name, wl_config in workloads.items(): + input_tokens = wl_config["input_tokens"] + output_tokens = wl_config["output_tokens"] + print(f"\n{'=' * 60}") + print(f"Workload: {wl_name} (input={input_tokens}, output={output_tokens})") + print(f"{'=' * 60}") + + prompt = build_prompt(tokenizer, input_tokens) + actual_input = len(tokenizer.encode(prompt)) + print(f" Actual input tokens: {actual_input}") + + sampling_params = SamplingParams( + temperature=0, # Greedy for reproducibility + max_tokens=output_tokens, + ) + + # Warmup + print(f" Warming up ({warmup} runs)...") + for _ in range(warmup): + llm.generate([prompt], sampling_params) + + # Timed iterations + print(f" Benchmarking ({iterations} runs)...") + ttfts = [] + tpots = [] + throughputs = [] + e2e_latencies = [] + output_lengths = [] + + for i in range(iterations): + t_start = time.perf_counter() + outputs = llm.generate([prompt], sampling_params) + t_end = time.perf_counter() + + output = outputs[0] + n_output_tokens = len(output.outputs[0].token_ids) + e2e = t_end - t_start + + # Extract TTFT from metrics if available + metrics = output.metrics + if ( + metrics + and hasattr(metrics, "first_token_time") + and metrics.first_token_time + ): + ttft = metrics.first_token_time - metrics.arrival_time + else: + # Approximate: E2E - decode time + ttft = e2e / (n_output_tokens + 1) if n_output_tokens > 0 else e2e + + # TPOT = decode time / (output tokens - 1) + decode_time = e2e - ttft + tpot = decode_time / max(n_output_tokens - 1, 1) + tps = n_output_tokens / e2e if e2e > 0 else 0 + + ttfts.append(ttft * 1000) # to ms + tpots.append(tpot * 1000) # to ms + throughputs.append(tps) + e2e_latencies.append(e2e * 1000) # to ms + output_lengths.append(n_output_tokens) + + results[wl_name] = { + "input_tokens": actual_input, + "target_output_tokens": output_tokens, + "avg_output_tokens": statistics.mean(output_lengths), + "ttft_ms": percentiles(ttfts), + "tpot_ms": percentiles(tpots), + "throughput_tok_s": percentiles(throughputs), + "e2e_latency_ms": percentiles(e2e_latencies), + "raw_ttfts": ttfts, + "raw_tpots": tpots, + "raw_throughputs": throughputs, + "raw_e2e": e2e_latencies, + } + + print(f" TTFT (P50): {percentiles(ttfts)['p50']:.1f} ms") + print(f" TPOT (P50): {percentiles(tpots)['p50']:.2f} ms") + print(f" Throughput (P50): {percentiles(throughputs)['p50']:.1f} tok/s") + print(f" E2E (P50): {percentiles(e2e_latencies)['p50']:.1f} ms") + print(f" Avg output tokens: {statistics.mean(output_lengths):.0f}") + + return results + + +def benchmark_image_text(model_path, warmup, iterations, dtype): + """Benchmark image+text workload.""" + from vllm import LLM, SamplingParams + + print(f"\n{'=' * 60}") + print("Image+Text Benchmark") + print(f"{'=' * 60}") + + llm = LLM( + model=model_path, + dtype=dtype, + trust_remote_code=True, + max_model_len=4096, + gpu_memory_utilization=0.90, + limit_mm_per_prompt={"image": 1}, + ) + + sampling_params = SamplingParams(temperature=0, max_tokens=128) + + # Use a simple test prompt with image URL + image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png" + + messages = [ + { + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": "Describe this image in detail."}, + ], + } + ] + + # Warmup + print(f" Warming up ({warmup} runs)...") + for _ in range(warmup): + try: + llm.chat(messages, sampling_params) + except Exception as e: + print(f" Warmup error (may be expected): {e}") + return None + + # Timed iterations + print(f" Benchmarking ({iterations} runs)...") + e2e_latencies = [] + output_lengths = [] + + for i in range(iterations): + t_start = time.perf_counter() + outputs = list(llm.chat(messages, sampling_params)) + t_end = time.perf_counter() + + output = outputs[0] + n_tokens = len(output.outputs[0].token_ids) + e2e = (t_end - t_start) * 1000 + + e2e_latencies.append(e2e) + output_lengths.append(n_tokens) + + avg_tokens = statistics.mean(output_lengths) + avg_e2e = statistics.mean(e2e_latencies) + avg_tps = avg_tokens / (avg_e2e / 1000) if avg_e2e > 0 else 0 + + result = { + "avg_output_tokens": avg_tokens, + "e2e_latency_ms": percentiles(e2e_latencies), + "throughput_tok_s": avg_tps, + "text_preview": outputs[0].outputs[0].text[:150] if outputs else "", + } + + print(f" Output tokens: {avg_tokens:.0f}") + print(f" E2E (P50): {percentiles(e2e_latencies)['p50']:.1f} ms") + print(f" Throughput: {avg_tps:.1f} tok/s") + + return result + + +def get_gpu_info(): + """Get GPU information.""" + info = {} + if torch.cuda.is_available(): + info["gpu_name"] = torch.cuda.get_device_name(0) + info["gpu_count"] = torch.cuda.device_count() + props = torch.cuda.get_device_properties(0) + info["gpu_memory_gb"] = ( + getattr(props, "total_memory", getattr(props, "total_mem", 0)) / 1e9 + ) + return info + + +def main(): + parser = argparse.ArgumentParser(description="GPU benchmark for Isaac-0.2-2B") + parser.add_argument( + "--model", + default="PerceptronAI/Isaac-0.2-2B-Preview", + help="HuggingFace model ID or local path", + ) + parser.add_argument("--warmup", type=int, default=5) + parser.add_argument("--iterations", type=int, default=10) + parser.add_argument( + "--dtype", default="bfloat16", choices=["bfloat16", "float16", "auto"] + ) + parser.add_argument( + "--workloads", + nargs="+", + default=["short-short", "short-long", "long-short", "long-long"], + choices=list(WORKLOADS.keys()), + ) + parser.add_argument( + "--skip-image", action="store_true", help="Skip image+text benchmark" + ) + parser.add_argument("--output", default="gpu_benchmark_results.json") + args = parser.parse_args() + + gpu_info = get_gpu_info() + print(f"GPU: {gpu_info.get('gpu_name', 'unknown')}") + print(f"GPU Memory: {gpu_info.get('gpu_memory_gb', 0):.1f} GB") + print(f"Model: {args.model}") + print(f"dtype: {args.dtype}") + print(f"Workloads: {args.workloads}") + print(f"Warmup: {args.warmup}, Iterations: {args.iterations}") + + # Select workloads + selected = {k: WORKLOADS[k] for k in args.workloads} + + # Run text benchmarks + text_results = benchmark_vllm_offline( + args.model, selected, args.warmup, args.iterations, args.dtype + ) + + # Run image+text benchmark + image_result = None + if not args.skip_image: + image_result = benchmark_image_text( + args.model, args.warmup, args.iterations, args.dtype + ) + + # Compile all results + all_results = { + "metadata": { + "model": args.model, + "dtype": args.dtype, + "warmup": args.warmup, + "iterations": args.iterations, + "gpu": gpu_info, + "framework": "vLLM", + "timestamp": time.strftime("%Y-%m-%d %H:%M:%S"), + }, + "text_benchmarks": text_results, + "image_text_benchmark": image_result, + } + + # Summary table + print(f"\n{'=' * 80}") + print("GPU BENCHMARK SUMMARY") + print(f"{'=' * 80}") + print( + f"{'Workload':<15} {'In':>5} {'Out':>5} {'TTFT P50':>10} {'TPOT P50':>10} " + f"{'tok/s P50':>10} {'E2E P50':>10}" + ) + print("-" * 70) + for wl_name, r in text_results.items(): + print( + f"{wl_name:<15} {r['input_tokens']:>5} {r['avg_output_tokens']:>5.0f} " + f"{r['ttft_ms']['p50']:>10.1f} {r['tpot_ms']['p50']:>10.2f} " + f"{r['throughput_tok_s']['p50']:>10.1f} {r['e2e_latency_ms']['p50']:>10.1f}" + ) + if image_result: + print( + f"{'image+text':<15} {'N/A':>5} {image_result['avg_output_tokens']:>5.0f} " + f"{'N/A':>10} {'N/A':>10} " + f"{image_result['throughput_tok_s']:>10.1f} " + f"{image_result['e2e_latency_ms']['p50']:>10.1f}" + ) + + # Save + with open(args.output, "w") as f: + json.dump(all_results, f, indent=2, default=str) + print(f"\nResults saved to {args.output}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py new file mode 100644 index 00000000..442b9183 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/fix_indent.py @@ -0,0 +1,46 @@ +#!/usr/bin/env python3 +"""Remove leftover indented stubs from modular_isaac.py.""" + +import sys + +paths = ( + sys.argv[1:] + if len(sys.argv) > 1 + else [ + "/mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py", + "/home/ubuntu/.cache/huggingface/modules/transformers_modules/" + "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py", + ] +) + +INDENTED_STUBS = ( + "\n\n" + " class Event: pass\n" + " class Stream: pass\n" + " class TensorStream: pass\n" + " class TextType: pass\n" + " class VisionType: pass\n" + " def create_stream(*a, **kw): return None\n" + " def group_streams(*a, **kw): return None\n" + " def compute_mrope_pos_tensor(*a, **kw): return None\n" + " def modality_mask(*a, **kw): return None\n" + " def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None\n" + " def tensor_stream_token_view(*a, **kw): return None\n" + " def ts_slice(*a, **kw): return None" +) + +for path in paths: + try: + with open(path, "r") as f: + content = f.read() + except FileNotFoundError: + print(f"SKIP: {path}") + continue + + if INDENTED_STUBS in content: + content = content.replace(INDENTED_STUBS, "") + with open(path, "w") as f: + f.write(content) + print(f"FIXED: removed indented stubs from {path}") + else: + print(f"OK: no indented stubs found in {path}") diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json b/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json new file mode 100644 index 00000000..5654fb81 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/gpu_benchmark_results.json @@ -0,0 +1,310 @@ +{ + "metadata": { + "model": "/home/ubuntu/Isaac-0.2-2B-Preview", + "dtype": "bfloat16", + "warmup": 5, + "iterations": 10, + "gpu": { + "gpu_name": "NVIDIA L40S", + "gpu_count": 1, + "gpu_memory_gb": 47.665709056 + }, + "framework": "vLLM", + "timestamp": "2026-05-01 02:03:04" + }, + "text_benchmarks": { + "short-short": { + "input_tokens": 128, + "target_output_tokens": 128, + "avg_output_tokens": 128, + "ttft_ms": { + "p50": 5.725635930232787, + "p95": 5.727955108527073, + "p99": 5.727955108527073 + }, + "tpot_ms": { + "p50": 5.770719677714935, + "p95": 5.773057117255633, + "p99": 5.773057117255633 + }, + "throughput_tok_s": { + "p50": 173.30585065401496, + "p95": 173.38756231629444, + "p99": 173.38756231629444 + }, + "e2e_latency_ms": { + "p50": 738.6070350000296, + "p95": 738.9062089999925, + "p99": 738.9062089999925 + }, + "raw_ttfts": [ + 5.727955108527073, + 5.7257079457365085, + 5.725635930232787, + 5.724229116279367, + 5.724252604651145, + 5.722717643411124, + 5.726226426356489, + 5.726308906976418, + 5.725303279069792, + 5.725415837209167 + ], + "raw_tpots": [ + 5.773057117255633, + 5.770792260269867, + 5.770719677714935, + 5.769301786486291, + 5.769325459805879, + 5.76777841225688, + 5.771314823414414, + 5.771397953488043, + 5.770384407251444, + 5.770497851675381 + ], + "raw_throughputs": [ + 173.22902208824354, + 173.29700910685713, + 173.2991887898751, + 173.34177962821394, + 173.34106835349465, + 173.38756231629444, + 173.28131794586687, + 173.27882203606558, + 173.3092578069886, + 173.30585065401496 + ], + "raw_e2e": [ + 738.9062089999925, + 738.6163250000095, + 738.6070350000296, + 738.4255560000383, + 738.4285859999977, + 738.230576000035, + 738.6832089999871, + 738.6938489999579, + 738.5641230000033, + 738.5786429999825 + ] + }, + "short-long": { + "input_tokens": 128, + "target_output_tokens": 512, + "avg_output_tokens": 512, + "ttft_ms": { + "p50": 5.739147309941538, + "p95": 5.742040413255263, + "p99": 5.742040413255263 + }, + "tpot_ms": { + "p50": 5.750378517984477, + "p95": 5.753277282948521, + "p99": 5.753277282948521 + }, + "throughput_tok_s": { + "p50": 173.90271333810932, + "p95": 173.90904504703894, + "p99": 173.90904504703894 + }, + "e2e_latency_ms": { + "p50": 2944.1825700000095, + "p95": 2945.6667319999497, + "p99": 2945.6667319999497 + }, + "raw_ttfts": [ + 5.739037317738782, + 5.739147309941538, + 5.739028274853798, + 5.740372360623822, + 5.739132317738796, + 5.742040413255263, + 5.7412690487330265, + 5.739214384015546, + 5.738923366471569, + 5.739001226120805 + ], + "raw_tpots": [ + 5.750268310532791, + 5.750378517984477, + 5.750259249951359, + 5.751605966026217, + 5.750363496442785, + 5.753277282948521, + 5.752504408906673, + 5.7504457233189035, + 5.75015413626897, + 5.750232148285424 + ], + "raw_throughputs": [ + 173.90559200169955, + 173.9022590572562, + 173.905866021654, + 173.8651466422899, + 173.90271333810932, + 173.81463912327226, + 173.8379918776767, + 173.900226665327, + 173.90904504703894, + 173.90668566485508 + ], + "raw_e2e": [ + 2944.126143999995, + 2944.1825700000095, + 2944.1215049999983, + 2944.811021000021, + 2944.174879000002, + 2945.6667319999497, + 2945.2710220000426, + 2944.2169789999753, + 2944.067686999915, + 2944.107628999973 + ] + }, + "long-short": { + "input_tokens": 2048, + "target_output_tokens": 128, + "avg_output_tokens": 128, + "ttft_ms": { + "p50": 6.052418666666749, + "p95": 6.054631875969768, + "p99": 6.054631875969768 + }, + "tpot_ms": { + "p50": 6.100075506561763, + "p95": 6.102306142709688, + "p99": 6.102306142709688 + }, + "throughput_tok_s": { + "p50": 163.94669556515132, + "p95": 164.06837099200277, + "p99": 164.06837099200277 + }, + "e2e_latency_ms": { + "p50": 780.7620080000106, + "p95": 781.0475120001001, + "p99": 781.0475120001001 + }, + "raw_ttfts": [ + 6.054631875969768, + 6.052931550387207, + 6.052418666666749, + 6.0518397131778165, + 6.053834852713224, + 6.052260209301938, + 6.053383906977044, + 6.051427062015693, + 6.047771767441205, + 6.0498680697673635 + ], + "raw_tpots": [ + 6.102306142709688, + 6.100592428736713, + 6.100075506561763, + 6.099491994383941, + 6.10150284367947, + 6.099915801501166, + 6.101048347189462, + 6.099076094000069, + 6.095392017578537, + 6.097504826222225 + ], + "raw_throughputs": [ + 163.88247582047683, + 163.9285119541835, + 163.94240330402738, + 163.9580869689748, + 163.90405192021308, + 163.94669556515132, + 163.91626192283178, + 163.96926738880532, + 164.06837099200277, + 164.01152067662508 + ], + "raw_e2e": [ + 781.0475120001001, + 780.8281699999498, + 780.7620080000106, + 780.6873229999383, + 780.9446960000059, + 780.74156699995, + 780.8865240000387, + 780.6340910000245, + 780.1625579999154, + 780.4329809999899 + ] + }, + "long-long": { + "input_tokens": 2048, + "target_output_tokens": 512, + "avg_output_tokens": 512, + "ttft_ms": { + "p50": 6.079098984405498, + "p95": 6.080241068226007, + "p99": 6.080241068226007 + }, + "tpot_ms": { + "p50": 6.090995459913141, + "p95": 6.092139778731342, + "p99": 6.092139778731342 + }, + "throughput_tok_s": { + "p50": 164.1837129551535, + "p95": 164.20382107301052, + "p99": 164.20382107301052 + }, + "e2e_latency_ms": { + "p50": 3118.57777900002, + "p95": 3119.163667999942, + "p99": 3119.163667999942 + }, + "raw_ttfts": [ + 6.0786242748539125, + 6.078865341130544, + 6.079913033138502, + 6.0791121345028465, + 6.078488198830433, + 6.079595109161773, + 6.078705553606205, + 6.080241068226007, + 6.079098984405498, + 6.078120933723229 + ], + "raw_tpots": [ + 6.090519821380045, + 6.090761359410643, + 6.091811101696503, + 6.091008635744536, + 6.090383479062978, + 6.091492555559351, + 6.090601259190561, + 6.092139778731342, + 6.090995459913141, + 6.090015495237364 + ], + "raw_throughputs": [ + 164.19022415811258, + 164.1837129551535, + 164.15542078009074, + 164.17704759822297, + 164.19389980114536, + 164.16400505967496, + 164.18802876035227, + 164.14656443094012, + 164.1774027403524, + 164.20382107301052 + ], + "raw_e2e": [ + 3118.3342530000573, + 3118.457919999969, + 3118.9953860000514, + 3118.58452499996, + 3118.264446000012, + 3118.8322909999897, + 3118.375948999983, + 3119.163667999942, + 3118.57777900002, + 3118.076039000016 + ] + } + }, + "image_text_benchmark": null +} \ No newline at end of file diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py new file mode 100644 index 00000000..01ac91b4 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py @@ -0,0 +1,97 @@ +#!/usr/bin/env python3 +"""Remove perceptron.tensorstream import entirely from modular_isaac.py. +Replaces the try/except import block with direct stub definitions.""" + +import sys + +paths = ( + sys.argv[1:] + if len(sys.argv) > 1 + else [ + "/mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py", + "/home/ubuntu/.cache/huggingface/modules/transformers_modules/" + "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py", + ] +) + +# Replacement: just the stubs, no try/except, no import +REPLACEMENT = """# perceptron.tensorstream stubs (not available outside Perceptron environment) +class Event: pass +class Stream: pass +class TensorStream: pass +class TextType: pass +class VisionType: pass +def create_stream(*a, **kw): return None +def group_streams(*a, **kw): return None +def compute_mrope_pos_tensor(*a, **kw): return None +def modality_mask(*a, **kw): return None +def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None +def tensor_stream_token_view(*a, **kw): return None +def ts_slice(*a, **kw): return None""" + +for path in paths: + try: + with open(path, "r") as f: + lines = f.readlines() + except FileNotFoundError: + print(f"SKIP: {path} not found") + continue + + # Find the try block that imports from perceptron + try_start = None + except_end = None + in_except = False + + for i, line in enumerate(lines): + if ( + line.strip() == "try:" + and i + 1 < len(lines) + and "perceptron" in lines[i + 1] + ): + try_start = i + if try_start is not None and line.strip().startswith( + "except ModuleNotFoundError" + ): + in_except = True + if in_except and try_start is not None: + # Find end of except block (next non-indented, non-blank line after except body) + if i > try_start + 5: # we're past the except line itself + # Check if this line is NOT indented (new top-level statement) + stripped = line.strip() + if ( + stripped + and not line.startswith(" ") + and not line.startswith("\t") + and "def " not in lines[i - 1] + if i > 0 + else True + ): + # But also check it's not a continuation of the except body + pass + + # Simpler approach: find by content markers + content = "".join(lines) + + # Pattern 1: Original unpatched try/except + import re + + # Match everything from "try:\n from perceptron" to the end of the except block + pattern = r"try:\n from perceptron\.tensorstream\.tensorstream import \(.*?\n(?:.*?\n)*?except ModuleNotFoundError.*?\n(?: .*\n)*" + match = re.search(pattern, content) + if match: + old_block = match.group(0) + # Remove trailing newlines from old_block to be precise + content = content.replace(old_block, REPLACEMENT + "\n\n") + with open(path, "w") as f: + f.write(content) + print(f"SUCCESS: Replaced try/import block in {path}") + else: + # Check if already replaced + if "# perceptron.tensorstream stubs" in content: + print(f"ALREADY PATCHED: {path}") + else: + print(f"WARN: Could not find try/import block in {path}") + # Show perceptron references + for i, line in enumerate(lines): + if "perceptron" in line.lower(): + print(f" Line {i + 1}: {line.rstrip()}") diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py b/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py new file mode 100644 index 00000000..43b1457f --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/patch_gpu_modular.py @@ -0,0 +1,86 @@ +#!/usr/bin/env python3 +"""Patch modular_isaac.py on GPU to handle missing imports.""" + +import sys + +path = ( + sys.argv[1] + if len(sys.argv) > 1 + else ( + "/home/ubuntu/.cache/huggingface/modules/transformers_modules/" + "Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py" + ) +) + +with open(path, "r") as f: + content = f.read() + +fixes = 0 + +# Fix 1: DefaultFastImageProcessorKwargs +old1 = ( + "from transformers.image_processing_utils_fast import (\n" + " BaseImageProcessorFast,\n" + " DefaultFastImageProcessorKwargs,\n" + " SizeDict,\n" + " group_images_by_shape,\n" + " reorder_images,\n" + ")" +) +new1 = ( + "from transformers.image_processing_utils_fast import (\n" + " BaseImageProcessorFast,\n" + " SizeDict,\n" + " group_images_by_shape,\n" + " reorder_images,\n" + ")\n" + "try:\n" + " from transformers.image_processing_utils_fast import DefaultFastImageProcessorKwargs\n" + "except ImportError:\n" + " from typing import TypedDict\n" + " class DefaultFastImageProcessorKwargs(TypedDict, total=False):\n" + " pass" +) +if old1 in content: + content = content.replace(old1, new1) + fixes += 1 + print("Fix 1 applied: DefaultFastImageProcessorKwargs") +else: + print("Fix 1: not found (may already be patched)") + +# Fix 2: perceptron soft-fail +old2 = ( + "except ModuleNotFoundError as exc: # pragma: no cover - import guard\n" + " raise ModuleNotFoundError(\n" + ' "perceptron.tensorstream is required for the Isaac HuggingFace integration. "\n' + ' "Ensure the TensorStream package is installed and on PYTHONPATH."\n' + " ) from exc" +) +new2 = ( + "except ModuleNotFoundError: # pragma: no cover - import guard\n" + " import warnings as _warnings\n" + ' _warnings.warn("perceptron.tensorstream not available; TensorStream features disabled")\n' + "\n" + " class Event: pass\n" + " class Stream: pass\n" + " class TensorStream: pass\n" + " class TextType: pass\n" + " class VisionType: pass\n" + " def create_stream(*a, **kw): return None\n" + " def group_streams(*a, **kw): return None\n" + " def compute_mrope_pos_tensor(*a, **kw): return None\n" + " def modality_mask(*a, **kw): return None\n" + " def reconstruct_tensor_stream_from_compact_dict(*a, **kw): return None\n" + " def tensor_stream_token_view(*a, **kw): return None\n" + " def ts_slice(*a, **kw): return None" +) +if old2 in content: + content = content.replace(old2, new2) + fixes += 1 + print("Fix 2 applied: perceptron soft-fail") +else: + print("Fix 2: not found (may already be patched)") + +with open(path, "w") as f: + f.write(content) +print(f"Done: {fixes} fixes applied to {path}") diff --git a/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh b/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh new file mode 100644 index 00000000..47549197 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/gpu_benchmark/setup_gpu.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# Setup script for GPU benchmark of Isaac-0.2-2B +# Run on a fresh GPU DLAMI (g6e.xlarge with L40S) +# +# Usage: +# bash setup_gpu.sh + +set -e + +echo "=== Isaac GPU Benchmark Setup ===" + +# Use the PyTorch 2.7 virtual environment from DLAMI +echo "Setting up Python environment..." +source /opt/dlami/nvme/pytorch-2.7/bin/activate 2>/dev/null || { + echo "DLAMI venv not found, using system Python..." + python3 -m venv ~/gpu_bench_env + source ~/gpu_bench_env/bin/activate +} + +# Install vLLM and dependencies +echo "Installing vLLM..." +pip install -U vllm transformers torch pillow requests 2>&1 | tail -5 + +# Download model (Isaac requires trust_remote_code) +echo "Downloading Isaac-0.2-2B-Preview..." +pip install -U "huggingface_hub[cli]" 2>&1 | tail -3 +huggingface-cli download PerceptronAI/Isaac-0.2-2B-Preview --local-dir ~/Isaac-0.2-2B-Preview + +echo "" +echo "=== Setup complete ===" +echo "To run benchmark:" +echo " python benchmark_gpu.py --model ~/Isaac-0.2-2B-Preview" diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py new file mode 100644 index 00000000..667cd6a4 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/__init__.py @@ -0,0 +1,23 @@ +# Copyright 2025 © Amazon.com and Affiliates + +from .modeling_isaac import ( + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) +from .modeling_isaac_vision import ( + NeuronIsaacVisionModel, + NeuronIsaacMultiModalProjector, + IsaacVisionModelWrapper, +) +from .modeling_isaac_text import ( + NeuronIsaacTextModel, +) + +__all__ = [ + "NeuronIsaacForConditionalGeneration", + "IsaacInferenceConfig", + "NeuronIsaacVisionModel", + "NeuronIsaacMultiModalProjector", + "IsaacVisionModelWrapper", + "NeuronIsaacTextModel", +] diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py new file mode 100644 index 00000000..826acb54 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac.py @@ -0,0 +1,624 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Isaac NxDI orchestrator: VLM model combining vision encoder and Qwen3 text decoder. + +Isaac-0.2-2B-Preview architecture: +- Vision: SigLIP2 (27 layers) -> pixel shuffle (2x2) -> 2-layer MLP projector +- Text: Qwen3 (28 layers, 2048 hidden, GQA 16/8) +- mRoPE: interleaved, section=(2,1,1) weighting -> ~[32,16,16] +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import copy # noqa: E402 +import logging # noqa: E402 +from typing import Any, Callable, Dict, List, Optional, Tuple, Type, Union # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +import torch.nn.utils.rnn as rnn_utils # noqa: E402 +from transformers.modeling_outputs import CausalLMOutputWithPast # noqa: E402 + +import neuronx_distributed_inference.modules.autobucketing as autobucketing # noqa: E402 +from neuronx_distributed_inference.models.config import InferenceConfig, NeuronConfig # noqa: E402 +from neuronx_distributed_inference.models.image_to_text_model_base import ( # noqa: E402 + ImageToTextInferenceConfig, + NeuronBaseForImageToText, +) +from neuronx_distributed_inference.models.image_to_text_model_wrapper import ( # noqa: E402 + ImageToTextModelWrapper, + IMAGE_TO_TEXT_MODEL_WRAPPER_INPUT_KEYS, +) +from neuronx_distributed_inference.models.llama4.utils.encoder_utils import ( # noqa: E402 + pad_vision_embeddings, +) +from neuronx_distributed_inference.models.model_wrapper import ( # noqa: E402 + CONTEXT_ENCODING_MODEL_TAG, + TOKEN_GENERATION_MODEL_TAG, + VISION_ENCODER_MODEL_TAG, +) +from neuronx_distributed_inference.modules.flashdecode.utils import ( # noqa: E402 + calculate_num_cores_per_group, +) + +from isaac_neuron.modeling_isaac_text import NeuronIsaacTextModel # noqa: E402 +from isaac_neuron.modeling_isaac_vision import ( # noqa: E402 + NeuronIsaacVisionModel, + IsaacVisionModelWrapper, +) +from isaac_neuron.utils import convert_state_dict_to_fused_qkv, StateDict # noqa: E402 + +logger = logging.getLogger("Neuron") + + +class IsaacInferenceConfig(ImageToTextInferenceConfig): + """Isaac-specific inference configuration. + + Extends ImageToTextInferenceConfig with: + - pixel_shuffle_scale from model config + - projector_intermediate_size from model config + - Isaac-specific required attributes + """ + + def __init__( + self, + text_neuron_config, + vision_neuron_config, + fused_spec_config=None, + load_config=None, + metadata: Optional[Dict] = None, + **kwargs, + ): + super().__init__( + text_neuron_config=text_neuron_config, + vision_neuron_config=vision_neuron_config, + fused_spec_config=fused_spec_config, + load_config=load_config, + metadata=metadata, + **kwargs, + ) + + # Isaac uses hidden_act for the text model MLP (SiLU) + if not hasattr(self.text_config, "hidden_act"): + self.text_config.hidden_act = "silu" + + # Isaac's SigLIP2 encoder does NOT use a pooling head + # (no head weights in the checkpoint; features go to pixel shuffle + MLP projector) + if not hasattr(self.vision_config, "vision_use_head"): + self.vision_config.vision_use_head = False + + # Extract Isaac-specific config values + # pixel_shuffle_scale is in the vision_config or top-level config + if not hasattr(self, "pixel_shuffle_scale"): + self.pixel_shuffle_scale = getattr( + self.vision_config, "pixel_shuffle_scale", 2 + ) + + # Projector intermediate size + if not hasattr(self, "projector_intermediate_size"): + vision_hidden = self.vision_config.hidden_size # 1152 + self.projector_intermediate_size = ( + vision_hidden * (self.pixel_shuffle_scale**2) * 4 + ) # 18432 + + # Validation + if self.text_config.neuron_config.is_block_kv_layout: + raise ValueError("Isaac does not yet support block_kv_layout.") + if self.text_config.neuron_config.is_prefix_caching: + raise ValueError("Isaac does not yet support prefix_caching.") + if self.text_config.neuron_config.is_chunked_prefill: + raise ValueError("Isaac does not yet support chunked_prefill.") + if self.text_config.neuron_config.is_medusa: + raise ValueError("Isaac does not yet support medusa.") + if self.text_config.neuron_config.enable_fused_speculation: + raise ValueError("Isaac does not yet support fused speculation.") + + if self.neuron_config.flash_decoding_enabled: + num_attn_heads = self.text_config.num_attention_heads + num_kv_heads = self.text_config.num_key_value_heads + num_attn_heads = ( + num_attn_heads // self.neuron_config.tp_degree + 1 + ) * self.neuron_config.tp_degree + self.text_config.num_cores_per_group = calculate_num_cores_per_group( + num_attn_heads, num_kv_heads, self.neuron_config.tp_degree + ) + + def get_required_attributes(self) -> List[str]: + return [ + "text_config", + "vision_config", + "text_config.hidden_size", + "text_config.num_attention_heads", + "text_config.num_hidden_layers", + "text_config.num_key_value_heads", + "text_config.head_dim", + "text_config.rope_theta", + "text_config.rms_norm_eps", + "vision_config.hidden_size", + "vision_config.image_size", + "vision_config.num_attention_heads", + "vision_config.num_hidden_layers", + "vision_config.patch_size", + ] + + @classmethod + def get_neuron_config_cls(cls) -> Type[NeuronConfig]: + return NeuronConfig + + +class NeuronIsaacForConditionalGeneration(NeuronBaseForImageToText): + """Isaac VLM orchestrator for NxDI. + + Combines: + - NeuronIsaacVisionModel (SigLIP2 + pixel shuffle + MLP projector) + - NeuronIsaacTextModel (Qwen3 decoder) + - ImageToTextModelWrapper (text model tracing wrapper) + - IsaacVisionModelWrapper (vision model tracing wrapper) + """ + + # Model classes + text_model_cls = NeuronIsaacTextModel + vision_model_cls = NeuronIsaacVisionModel + + # Model wrappers + text_model_wrapper = ImageToTextModelWrapper + vision_model_wrapper = IsaacVisionModelWrapper + + def __init__(self, *args, **kwargs): + super().__init__( + self.text_model_cls, + self.vision_model_cls, + self.text_model_wrapper, + self.vision_model_wrapper, + *args, + **kwargs, + ) + + @classmethod + def get_config_cls(cls): + return IsaacInferenceConfig + + def enable_vision_encoder( + self, enable_wlt_optimization: bool = True, **model_init_kwargs + ): + """Enable and configure the vision encoder for compilation.""" + self.compile_tag = VISION_ENCODER_MODEL_TAG + + new_config = copy.deepcopy(self.config) + if new_config.vision_config.neuron_config.enable_bucketing: + if ( + new_config.vision_config.neuron_config.buckets + == [new_config.vision_config.neuron_config.seq_len] + or new_config.vision_config.neuron_config.buckets is None + ): + if new_config.vision_config.neuron_config.seq_len > 1024: + new_config.vision_config.neuron_config.buckets = ( + autobucketing.generate_buckets( + 1024, new_config.vision_config.neuron_config.seq_len + ) + ) + else: + new_config.vision_config.neuron_config.buckets = [ + new_config.vision_config.neuron_config.seq_len + ] + + new_config.neuron_config = copy.deepcopy(new_config.vision_config.neuron_config) + + self.vision_encoder_model = self.vision_model_wrapper( + config=new_config, + model_cls=self.vision_model_cls, + tag=VISION_ENCODER_MODEL_TAG, + compiler_args=self.get_compiler_args(), + model_init_kwargs=model_init_kwargs, + priority_model_idx=(0 if enable_wlt_optimization else None), + pipeline_execution=True, + return_ranked_to_cpu=True, + ) + self.vision_models.append(self.vision_encoder_model) + + @staticmethod + def update_state_dict_for_tied_weights(state_dict: StateDict) -> None: + """Isaac ties embed_tokens and lm_head weights.""" + try: + state_dict["lm_head.weight"] = state_dict["embed_tokens.weight"].clone() + except KeyError: + state_dict["embed_tokens.weight"] = state_dict["lm_head.weight"].clone() + + @staticmethod + def convert_hf_to_neuron_state_dict( + state_dict: StateDict, inference_config: InferenceConfig + ) -> StateDict: + """Convert HuggingFace Isaac state dict to NxDI format. + + NOTE: The base class ApplicationBase.get_state_dict strips the leading + "model." prefix BEFORE calling this method. So incoming keys are: + - text_model.embed_tokens.weight (was model.text_model.embed_tokens.weight) + - text_model.layers.{i}.* (was model.text_model.layers.{i}.*) + - text_model.norm.weight (was model.text_model.norm.weight) + - lm_head.weight (unchanged) + - vision_embedding.0.* (was model.vision_embedding.0.*) + - vision_embedding.1.weight (was model.vision_embedding.1.weight) + - vision_embedding.3.weight (was model.vision_embedding.3.weight) + - rotary_emb.* (was model.rotary_emb.*) + + Key mappings applied here: + - text_model.* -> * (strip text_model prefix) + - vision_embedding.0.* -> vision_encoder.vision_encoder.vision_model.* + - vision_embedding.1.weight -> vision_encoder.multi_modal_projector.fc1.weight + - vision_embedding.3.weight -> vision_encoder.multi_modal_projector.fc2.weight + - rotary_emb.* -> skipped + + Also renames attention keys for NxDI format: + - .self_attn.q_proj. -> .self_attn.qkv_proj.q_proj. + - .self_attn.k_proj. -> .self_attn.qkv_proj.k_proj. + - .self_attn.v_proj. -> .self_attn.qkv_proj.v_proj. + - .self_attn.o_proj. -> .self_attn.o_proj.o_proj. + - .self_attn.q_norm. -> .self_attn.q_layernorm. + - .self_attn.k_norm. -> .self_attn.k_layernorm. + """ + neuron_config = inference_config.neuron_config + + attention_keys = { + ".self_attn.q_proj.": ".self_attn.qkv_proj.q_proj.", + ".self_attn.k_proj.": ".self_attn.qkv_proj.k_proj.", + ".self_attn.v_proj.": ".self_attn.qkv_proj.v_proj.", + ".self_attn.o_proj.": ".self_attn.o_proj.o_proj.", + ".self_attn.out_proj.": ".self_attn.o_proj.o_proj.", # for siglip + ".self_attn.q_norm.": ".self_attn.q_layernorm.", + ".self_attn.k_norm.": ".self_attn.k_layernorm.", + } + + new_state_dict = {} + for key, weights in state_dict.items(): + new_key = key + + # Text model weights: text_model.* -> * + # (base class already stripped leading "model." prefix) + if new_key.startswith("text_model."): + new_key = new_key.replace("text_model.", "", 1) + # Rename attention keys + for attn_key, replacement in attention_keys.items(): + if attn_key in new_key: + new_key = new_key.replace(attn_key, replacement) + break + + # LM head: lm_head.weight -> lm_head.weight (no change) + # (already handled by tied weights) + + # Vision encoder: vision_embedding.0.* -> vision_encoder.vision_model.* + # NeuronIsaacVisionModel.vision_encoder = NeuronSiglipVisionModel + # NeuronSiglipVisionModel.vision_model = NeuronSiglipVisionTransformer + elif new_key.startswith("vision_embedding.0."): + new_key = new_key.replace( + "vision_embedding.0.", + "vision_encoder.vision_model.", + 1, + ) + # Rename attention keys for vision encoder + for attn_key, replacement in attention_keys.items(): + if attn_key in new_key: + new_key = new_key.replace(attn_key, replacement) + break + + # MLP projector fc1: vision_embedding.1.weight + elif new_key == "vision_embedding.1.weight": + new_key = "multi_modal_projector.fc1.weight" + + # MLP projector fc2: vision_embedding.3.weight + elif new_key == "vision_embedding.3.weight": + new_key = "multi_modal_projector.fc2.weight" + + # Skip rotary_emb (handled by NxDI internally) + elif new_key.startswith("rotary_emb"): + continue + + new_state_dict[new_key] = weights + + # Reshape patch_embedding weight from HF 2D [out_ch, in_ch*kH*kW] to Conv2d 4D + patch_key = "vision_encoder.vision_model.embeddings.patch_embedding.weight" + if patch_key in new_state_dict: + w = new_state_dict[patch_key] + if w.dim() == 2: + patch_size = inference_config.vision_config.patch_size + num_channels = inference_config.vision_config.num_channels + out_channels = w.shape[0] + new_state_dict[patch_key] = w.reshape( + out_channels, num_channels, patch_size, patch_size + ) + + # Add lm_head.bias if needed for LNC > 1 + if ( + "lm_head.bias" not in new_state_dict + and inference_config.neuron_config.lm_head_pad + ): + new_state_dict["lm_head.bias"] = torch.zeros( + new_state_dict["embed_tokens.weight"].shape[0], + dtype=torch.float32, + ) + + # Fuse QKV for text model + if inference_config.text_config.neuron_config.fused_qkv: + new_state_dict = convert_state_dict_to_fused_qkv( + state_dict=new_state_dict, + num_layers=inference_config.text_config.num_hidden_layers, + neuron_config=inference_config.text_config.neuron_config, + prefix="layers.{layer_num}.self_attn", + ) + + # Fuse QKV for vision model + if inference_config.vision_config.neuron_config.fused_qkv: + new_state_dict = convert_state_dict_to_fused_qkv( + state_dict=new_state_dict, + num_layers=inference_config.vision_config.num_hidden_layers, + neuron_config=inference_config.vision_config.neuron_config, + prefix="vision_encoder.vision_model.encoder.layers.{layer_num}.self_attn", + ) + + # Add rank utilities + if neuron_config.vocab_parallel: + new_state_dict["embed_tokens.rank_util.rank"] = torch.arange( + 0, neuron_config.local_ranks_size + ) + + tp_degree = neuron_config.tp_degree + for i in range(inference_config.text_config.num_hidden_layers): + new_state_dict[f"layers.{i}.self_attn.rank_util.rank"] = torch.arange( + 0, tp_degree, dtype=torch.int32 + ) + + new_state_dict["rank_util.rank"] = torch.arange(0, tp_degree, dtype=torch.int32) + + return new_state_dict + + @staticmethod + def _convert_input_dict_to_ordered_tuple(input_dict: Dict[str, Any]): + """Convert input dictionary to ordered tuple for model wrapper.""" + args = [] + for key in IMAGE_TO_TEXT_MODEL_WRAPPER_INPUT_KEYS: + if key in input_dict and input_dict[key] is not None: + arg = input_dict[key] + else: + arg = torch.empty(0) + args.append(arg) + return tuple(args) + + def _select_buckets_for_padding_length(self, position_ids): + """Select appropriate buckets based on whether prefill or decode.""" + neuron_config = self.config.neuron_config + context_encoding_buckets = ( + neuron_config.context_encoding_buckets + if neuron_config.context_encoding_buckets is not None + else neuron_config.buckets + ) + token_generation_buckets = ( + neuron_config.token_generation_buckets + if neuron_config.token_generation_buckets is not None + else neuron_config.buckets + ) + + if self._is_prefill(position_ids): + return context_encoding_buckets + return token_generation_buckets + + @staticmethod + def get_padding_length(buckets, position_ids): + """Find the smallest bucket that fits the input.""" + max_position_id = torch.max(position_ids).item() + for val in buckets: + if val > max_position_id: + return val + raise ValueError("No bucket found for provided input_ids!") + + @staticmethod + def get_required_kwargs() -> List[str]: + """Additional kwargs for HuggingFaceGenerationAdapter.""" + return [ + "pixel_values", + "vision_mask", + ] + + @staticmethod + def generate_positions_from_mask(mask: torch.Tensor) -> torch.Tensor: + """Generate position indices from a boolean vision mask.""" + if mask.dim() == 1: + return torch.nonzero(mask).squeeze() + else: + rows, cols = torch.nonzero(mask, as_tuple=True) + row_counts = torch.bincount(rows, minlength=mask.shape[0]) + cols_per_row = torch.split(cols, row_counts.tolist()) + return rnn_utils.pad_sequence( + cols_per_row, batch_first=True, padding_value=0 + ) + + @staticmethod + def pad_positions( + positions: torch.LongTensor, target_size: int, fill_value: float + ) -> torch.LongTensor: + """Pad positions tensor to target size.""" + positions_2d = positions.unsqueeze(0) if positions.dim() == 1 else positions + padding_size = target_size - positions_2d.shape[1] + assert padding_size >= 0, ( + "Text model sequence length is not enough to handle all vision embeddings" + ) + positions_padded = F.pad(positions_2d, (0, padding_size), value=fill_value) + return positions_padded.unsqueeze(-1) + + @staticmethod + def _create_position_ids( + attention_mask_2d: torch.LongTensor, is_prefill: bool + ) -> torch.LongTensor: + """Create position IDs from attention mask.""" + position_ids = attention_mask_2d.long().cumsum(-1) - 1 + position_ids.masked_fill_(attention_mask_2d == 0, 1) + if is_prefill: + return position_ids + else: + return torch.amax(position_ids, dim=1, keepdim=True) + 1 + + def forward( + self, + input_ids: torch.LongTensor = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + seq_ids: Optional[torch.LongTensor] = None, + sampling_params: Optional[torch.FloatTensor] = None, + pixel_values: Optional[torch.FloatTensor] = None, + vision_mask: Optional[torch.FloatTensor] = None, + image_sizes: Optional[torch.FloatTensor] = None, + adapter_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + use_cache: Optional[bool] = None, + medusa_args=None, + input_capture_hook: Optional[Callable] = None, + tensor_capture_hook: Optional[Callable] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, CausalLMOutputWithPast]: + """Forward pass combining vision encoder and text decoder.""" + is_prefill = input_ids.shape[-1] > 1 + include_images = ( + pixel_values is not None + and vision_mask is not None + and pixel_values.sum() != 0 + ) + + if position_ids is None: + position_ids = self._create_position_ids( + attention_mask_2d=attention_mask, is_prefill=is_prefill + ) + + buckets = self._select_buckets_for_padding_length(position_ids=position_ids) + pad_target_size = self.get_padding_length( + buckets=buckets, position_ids=position_ids + ) + pad_fill_value = pad_target_size - 1 + + if is_prefill and include_images: + assert vision_mask.dtype == torch.bool, ( + f"vision_mask must be bool, got {vision_mask.dtype}" + ) + + # Run vision encoder + vision_embeddings = self.vision_encoder_model( + pixel_values.to(self.vision_config.neuron_config.torch_dtype), + ).to(self.text_config.neuron_config.torch_dtype) + + # Flatten vision embeddings for multi-image support + batch_sz = 1 if vision_mask.dim() == 1 else vision_mask.shape[0] + num_images, seq_len, embedding_dim = vision_embeddings.shape + img_per_sample = num_images // batch_sz + vision_embeddings = vision_embeddings.view( + batch_sz, img_per_sample * seq_len, embedding_dim + ) + + # Pad to bucket size + vision_embeddings = pad_vision_embeddings( + vision_embeddings=vision_embeddings, pad_limit=pad_target_size + ) + + # Create scatter positions from vision mask + vision_mask = self.generate_positions_from_mask(mask=vision_mask.squeeze()) + vision_mask = self.pad_positions( + positions=vision_mask, + target_size=pad_target_size, + fill_value=pad_fill_value, + ) + else: + # Text-only or token generation -> dummy vision inputs + vision_embeddings, vision_mask = ( + self.context_encoding_model.get_dummy_vision_inputs( + config=self.text_config, + input_ids=input_ids, + n_active_tokens=pad_target_size, + fill_value=pad_fill_value, + ) + ) + + return super().forward( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + seq_ids=seq_ids, + sampling_params=sampling_params, + vision_embeddings=vision_embeddings, + vision_mask=vision_mask, + ) + + def enable_token_generation(self): + self.compile_tag = TOKEN_GENERATION_MODEL_TAG + super().enable_token_generation() + + def enable_context_encoding(self): + self.compile_tag = CONTEXT_ENCODING_MODEL_TAG + super().enable_context_encoding() + + def get_compiler_args(self) -> str: + """Get compiler arguments based on compilation phase.""" + logical_nc_config = self.text_config.neuron_config.logical_nc_config + + if self.compile_tag == CONTEXT_ENCODING_MODEL_TAG: + optimization_level = "-O1" + elif self.compile_tag == TOKEN_GENERATION_MODEL_TAG: + optimization_level = "-O2" + elif self.compile_tag == VISION_ENCODER_MODEL_TAG: + return ( + f"-O1 --model-type=transformer " + f"--tensorizer-options='--enable-ccop-compute-overlap' " + f"--auto-cast=none --lnc={logical_nc_config}" + ) + else: + raise ValueError( + f"get_compiler_args() Invalid compile tag: {self.compile_tag}" + ) + + args = ( + f"--auto-cast=none --model-type=transformer " + f"--tensorizer-options='--enable-ccop-compute-overlap " + f"--cc-pipeline-tiling-factor=1 --vectorize-strided-dma " + f"--enable-scalar-dge-vectorization' " + f"--lnc={logical_nc_config} {optimization_level} " + ) + return args + + def _get_constructed_outputs(self, outputs, is_run_on_neuron): + """Process model outputs into the expected format.""" + if ( + self.on_device_sampling + and self.text_config.neuron_config.output_logits + and not ( + self.text_config.neuron_config.enable_fused_speculation + or self.text_config.neuron_config.is_medusa + ) + ): + logits_or_next_tokens = outputs[:2] + constructed_outputs = self._construct_output_with_tokens_and_logits( + next_tokens=logits_or_next_tokens[0], + logits=logits_or_next_tokens[1], + ) + else: + if is_run_on_neuron: + logits_or_next_tokens = ( + outputs[0] if isinstance(outputs, (list, tuple)) else outputs + ) + else: + logits_or_next_tokens, *_ = outputs + constructed_outputs = self._construct_output(logits_or_next_tokens) + + if logging.root.isEnabledFor(logging.DEBUG): + logging.debug("---output---") + logging.debug( + f"{'tokens' if self.on_device_sampling else 'logits'} = %s", + logits_or_next_tokens, + ) + + return constructed_outputs + + @staticmethod + def load_hf_model(model_path, **kwargs): + """Load the HuggingFace Isaac model for weight extraction.""" + from transformers import AutoModelForCausalLM + + model = AutoModelForCausalLM.from_pretrained( + model_path, trust_remote_code=True, **kwargs + ).eval() + return model diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py new file mode 100644 index 00000000..52f861ef --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_text.py @@ -0,0 +1,576 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Isaac text model for NxDI: Qwen3 decoder layers adapted for VLM. + +Isaac's text backbone is a standard Qwen3 model (28 layers, 2048 hidden, GQA 16/8 heads). +This module wraps Qwen3 decoder layers in the NeuronBaseModel VLM pattern, supporting: +- Vision embedding injection via scatter_by_index_put +- Standard NxDI KV cache management +- On-device sampling +""" + +import logging +from typing import Optional, Tuple + +import torch +import torch.nn as nn + +from neuronx_distributed.parallel_layers.layers import ( + ColumnParallelLinear, + ParallelEmbedding, +) +from neuronx_distributed.parallel_layers.mappings import _gather_along_dim +from neuronx_distributed.utils import cpu_mode +from neuronx_distributed_inference.models.config import InferenceConfig +from neuronx_distributed_inference.models.model_base import NeuronBaseModel +from neuronx_distributed_inference.models.llama.modeling_llama import NeuronLlamaMLP +from neuronx_distributed_inference.modules.attention.attention_base import ( + NeuronAttentionBase, + QKNormPlacement, +) +from neuronx_distributed_inference.modules.attention.utils import RotaryEmbedding +from neuronx_distributed_inference.modules.custom_calls import CustomRMSNorm +from neuronx_distributed_inference.modules.flashdecode.utils import ( + get_cache_size, + mask_util, + turn_2d_mask_to_4d, +) +from neuronx_distributed_inference.modules.generation.sampling import ( + Sampler, + mask_padded_logits, +) +from neuronx_distributed_inference.modules.kvcache.kv_cache_manager import ( + KVCacheManager, +) +from neuronx_distributed_inference.modules.kvcache.block_kv_cache_manager import ( + generate_tokengen_slot_mapping, +) +from neuronx_distributed_inference.modules.custom_calls import neuron_cumsum +from neuronx_distributed_inference.utils.distributed import get_tp_group + +# Use HF Qwen3RMSNorm for CPU, CustomRMSNorm for Neuron +from transformers.models.qwen3.modeling_qwen3 import Qwen3RMSNorm + +logger = logging.getLogger("Neuron") + + +def get_rmsnorm_cls(): + """Return appropriate RMSNorm class based on execution mode.""" + return Qwen3RMSNorm if cpu_mode() else CustomRMSNorm + + +class NeuronIsaacAttention(NeuronAttentionBase): + """Isaac attention: standard Qwen3 GQA with QK normalization. + + Qwen3 applies QK norm BEFORE RoPE (pre-rope), same as NxDI built-in Qwen3. + Config: 16 attention heads, 8 KV heads, head_dim=128, rope_theta=1M + """ + + def __init__(self, config: InferenceConfig): + head_dim = getattr( + config, "head_dim", config.hidden_size // config.num_attention_heads + ) + rotary_emb = RotaryEmbedding( + dim=head_dim, + max_position_embeddings=config.max_position_embeddings, + base=config.rope_theta, + ) + + super().__init__( + config=config, + hidden_size=config.hidden_size, + num_attention_heads=config.num_attention_heads, + num_key_value_heads=config.num_key_value_heads, + head_dim=head_dim, + rotary_emb=rotary_emb, + num_cores_per_group=getattr(config, "num_cores_per_group", 1), + rms_norm_eps=config.rms_norm_eps, + qk_norm_placement=QKNormPlacement.PRE_ROPE, + q_layernorm=get_rmsnorm_cls()( + hidden_size=head_dim, eps=config.rms_norm_eps + ), + k_layernorm=get_rmsnorm_cls()( + hidden_size=head_dim, eps=config.rms_norm_eps + ), + ) + + +class NeuronIsaacDecoderLayer(nn.Module): + """Isaac decoder layer: Qwen3 architecture (RMSNorm -> Attn -> RMSNorm -> MLP). + + Identical to NeuronQwen3DecoderLayer from NxDI built-in, but adapted + for the VLM text model pattern. + """ + + def __init__(self, config: InferenceConfig): + super().__init__() + self.config = config + self.neuron_config = config.neuron_config + self.hidden_size = config.hidden_size + + self.self_attn = NeuronIsaacAttention(config) + self.mlp = NeuronLlamaMLP(config) # Qwen3 MLP is compatible with LlamaMLP + + self.input_layernorm = get_rmsnorm_cls()( + config.hidden_size, eps=config.rms_norm_eps + ) + self.post_attention_layernorm = get_rmsnorm_cls()( + config.hidden_size, eps=config.rms_norm_eps + ) + + # Kernel enablement flags + self.qkv_kernel_enabled = config.neuron_config.qkv_kernel_enabled + self.mlp_kernel_enabled = config.neuron_config.mlp_kernel_enabled + self.quantized_mlp_kernel_enabled = ( + config.neuron_config.quantized_mlp_kernel_enabled + ) + self.rmsnorm_quantize_kernel_enabled = ( + config.neuron_config.rmsnorm_quantize_kernel_enabled + ) + self.sequence_parallel_enabled = config.neuron_config.sequence_parallel_enabled + + # Fused rmsnorm only when sequence parallelism is disabled + self.qkv_kernel_fused_rmsnorm = not self.sequence_parallel_enabled + self.mlp_kernel_fused_rmsnorm = not self.sequence_parallel_enabled + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_value: Optional[Tuple[torch.Tensor]] = None, + adapter_ids=None, + **kwargs, + ) -> Tuple[torch.FloatTensor, ...]: + residual = hidden_states + + # QKV kernel fusion with RMSNorm + if self.qkv_kernel_enabled and self.qkv_kernel_fused_rmsnorm: + qkv_fused_rmsnorm = self.input_layernorm + else: + hidden_states = self.input_layernorm(hidden_states) + qkv_fused_rmsnorm = None + + # Self Attention + attn_output = self.self_attn( + hidden_states=hidden_states, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_value=past_key_value, + adapter_ids=adapter_ids, + rmsnorm=qkv_fused_rmsnorm, + **kwargs, + ) + hidden_states = attn_output.hidden_states + + # First residual + hidden_states = residual + hidden_states + residual = hidden_states + + # MLP kernel fusion with RMSNorm + if self.mlp_kernel_enabled and self.mlp_kernel_fused_rmsnorm: + mlp_fused_rmsnorm = self.post_attention_layernorm + else: + hidden_states = self.post_attention_layernorm(hidden_states) + mlp_fused_rmsnorm = None + + hidden_states, _ = self.mlp( + hidden_states, + rmsnorm=mlp_fused_rmsnorm, + adapter_ids=adapter_ids, + ) + + # Second residual + hidden_states = residual + hidden_states + + return ( + hidden_states, + attn_output.present_key_value, + attn_output.cos_cache, + attn_output.sin_cache, + None, # residual (not used for Qwen3) + ) + + +class NeuronIsaacTextModel(NeuronBaseModel): + """Isaac text model for VLM: Qwen3 decoder with vision embedding injection. + + Follows the same pattern as NeuronGemma3TextModel: + - Inherits from NeuronBaseModel + - Uses scatter_by_index_put for vision token injection + - Manages KV cache and on-device sampling + """ + + def scatter_by_index_put(self, h_image, encoded_patches_proj, positions): + """Scatter vision embeddings into the input embedding sequence. + + Args: + h_image: (B, max_positions, hidden_dim) - text input embeddings + encoded_patches_proj: (num_patches, patch_size, hidden_dim) - vision embeddings + positions: (B, num_positions, 1) - scatter positions + + Returns: + Updated h_image with vision embeddings scattered in. + """ + B, max_positions, embedding_dim = h_image.shape + h_image_new = h_image.clone() + encoded_patches_flat = encoded_patches_proj.view(-1, embedding_dim) + positions = positions.view(-1) + + num_updates_per_batch = positions.shape[0] // B + batch_idx = torch.arange(B, device=h_image.device, dtype=positions.dtype) + batch_idx = batch_idx.repeat_interleave(num_updates_per_batch) + + h_image_new.index_put_( + (batch_idx.long(), positions.long()), + encoded_patches_flat, + accumulate=False, + ) + return h_image_new + + def encode_vision_to_input( + self, inputs_embeds, vision_embeddings, vision_mask + ) -> torch.Tensor: + """Inject vision embeddings into text input embeddings.""" + return self.scatter_by_index_put(inputs_embeds, vision_embeddings, vision_mask) + + def setup_attr_for_model(self, config: InferenceConfig): + """Set up model attributes needed for inference.""" + self.on_device_sampling = ( + config.neuron_config.on_device_sampling_config is not None + ) + self.tp_degree = config.neuron_config.tp_degree + self.hidden_size = config.hidden_size + self.num_attention_heads = config.num_attention_heads + self.num_key_value_heads = config.num_key_value_heads + self.max_batch_size = config.neuron_config.max_batch_size + self.buckets = config.neuron_config.buckets + self.is_chunked_prefill = config.neuron_config.is_chunked_prefill + + def init_model(self, config: InferenceConfig): + """Initialize the Qwen3 text model components.""" + self.padding_idx = config.pad_token_id + self.vocab_size = config.vocab_size + + # Embedding layer + if parallel_state_initialized(): + self.embed_tokens = ParallelEmbedding( + config.vocab_size, + config.hidden_size, + self.padding_idx, + dtype=config.neuron_config.torch_dtype, + shard_across_embedding=True, + pad=True, + sequence_parallel_enabled=False, + tensor_model_parallel_group=get_tp_group(config), + ) + + lm_head_pad = config.neuron_config.lm_head_pad + lnc = config.neuron_config.logical_nc_config + lm_head_pad_alignment_size = ( + config.neuron_config.lm_head_pad_alignment_size * lnc + ) + self.lm_head = ColumnParallelLinear( + config.hidden_size, + config.vocab_size, + gather_output=not self.on_device_sampling, + bias=lm_head_pad, + pad=True, + pad_alignment_size_per_rank=lm_head_pad_alignment_size + if lm_head_pad + else 1, + keep_padded_output=lm_head_pad, + dtype=config.neuron_config.torch_dtype, + tensor_model_parallel_group=get_tp_group(config), + ) + else: + from transformers.models.qwen3.modeling_qwen3 import ( + Qwen3RMSNorm as HFQwen3RMSNorm, + ) + + self.embed_tokens = nn.Embedding( + config.vocab_size, + config.hidden_size, + self.padding_idx, + ) + self.lm_head = nn.Linear( + config.hidden_size, + config.vocab_size, + bias=False, + ) + + # Decoder layers + self.layers = nn.ModuleList( + [NeuronIsaacDecoderLayer(config) for _ in range(config.num_hidden_layers)] + ) + + # Final norm + self.norm = get_rmsnorm_cls()(config.hidden_size, eps=config.rms_norm_eps) + + def init_inference_optimization(self, config: InferenceConfig): + """Initialize KV cache and sampling for inference.""" + super().init_inference_optimization(config) + + if self.on_device_sampling: + self.sampler = Sampler(config.neuron_config) + + self.kv_mgr = KVCacheManager( + config, + num_kv_head=self.num_key_value_heads, + global_rank=self.rank_util, + ) + + def forward( + self, + input_ids, + attention_mask, + position_ids, + seq_ids, + sampling_params, + prev_hidden=None, + adapter_ids=None, + accepted_indices=None, + current_length=None, + medusa_mask=None, + scatter_index=None, + slot_mapping=None, + active_block_table=None, + num_queries=None, + computed_context_lens=None, + tile_q_indices=None, + tile_block_tables=None, + tile_masks=None, + inputs_embeds: Optional[torch.FloatTensor] = None, + kv_cache: Optional[torch.Tensor] = None, + active_mask=None, + rotary_position_id=None, + vision_embeddings=None, + vision_mask=None, + ): + """Forward pass for Isaac text model with vision support. + + This follows NeuronBaseModel.forward() pattern with vision embedding injection. + The 25 positional arguments match ImageToTextModelWrapper's expected interface. + """ + # Handle optional empty tensors + prev_hidden = self.set_none_if_empty(prev_hidden) + adapter_ids = self.set_none_if_empty(adapter_ids) + accepted_indices = self.set_none_if_empty(accepted_indices) + current_length = self.set_none_if_empty(current_length) + medusa_mask = self.set_none_if_empty(medusa_mask) + scatter_index = self.set_none_if_empty(scatter_index) + slot_mapping = self.set_none_if_empty(slot_mapping) + active_block_table = self.set_none_if_empty(active_block_table) + num_queries = self.set_none_if_empty(num_queries) + computed_context_lens = self.set_none_if_empty(computed_context_lens) + tile_q_indices = self.set_none_if_empty(tile_q_indices) + tile_block_tables = self.set_none_if_empty(tile_block_tables) + tile_masks = self.set_none_if_empty(tile_masks) + inputs_embeds = self.set_none_if_empty(inputs_embeds) + kv_cache = self.set_none_if_empty(kv_cache) + active_mask = self.set_none_if_empty(active_mask) + rotary_position_id = self.set_none_if_empty(rotary_position_id) + vision_embeddings = self.set_none_if_empty(vision_embeddings) + vision_mask = self.set_none_if_empty(vision_mask) + + is_for_token_gen = attention_mask.dim() == 4 + is_for_context_encoding = self._is_context_encoding(input_ids) + is_for_speculation = self._is_for_speculation(input_ids) + + # For non-speculative prefix caching, generate the slot mapping + if ( + not is_for_context_encoding + and not self.neuron_config.enable_fused_speculation + and not self.neuron_config.enable_eagle_speculation + and self.is_prefix_caching + and active_block_table is not None + ): + block_size = torch.tensor( + self.neuron_config.pa_block_size, + device=position_ids.device, + dtype=torch.int32, + ) + slot_mapping = generate_tokengen_slot_mapping( + position_ids, slot_mapping, active_block_table, block_size + ) + + cache_size = ( + get_cache_size( + self.n_positions, self.num_cores_per_group, is_for_context_encoding + ) + if self.neuron_config.flash_decoding_enabled + else self.n_positions + ) + + # Prepare attention mask + if self.is_chunked_prefill: + attn_mask = self.create_attn_mask( + attention_mask, + is_for_context_encoding, + is_for_speculation, + query_lens=num_queries, + key_lens=num_queries + computed_context_lens, + ) + else: + attn_mask = self.create_attn_mask( + attention_mask, + is_for_context_encoding, + is_for_speculation, + position_ids=position_ids, + ) + + active_mask = None + if self.is_prefix_caching: + active_length = ( + self.speculation_length if is_for_speculation else self.n_active_tokens + ) + active_mask = torch.full( + (active_length, active_length), + True, + device=attention_mask.device, + ).tril(diagonal=0) + active_mask = active_mask[None, None, :, :].expand( + self.batch_size, 1, active_length, active_length + ) + if is_for_speculation: + active_mask = torch.full( + (self.speculation_length, self.speculation_length), + True, + device=attention_mask.device, + ).tril(diagonal=0) + active_mask = active_mask[None, None, :, :].expand( + self.batch_size, 1, self.speculation_length, self.speculation_length + ) + + # FlashDecoding masks + active_mask_2d = None + if self.neuron_config.flash_decoding_enabled and not is_for_context_encoding: + rank_id = self.rank_util.get_rank() + active_mask_tmp, attention_mask_tmp = mask_util( + pos_ids=position_ids, + rank_id=rank_id, + num_cores_per_group=self.num_cores_per_group, + cache_size=cache_size, + ) + if is_for_speculation: + active_mask = active_mask_tmp[:, None, :, :].expand( + self.batch_size, 1, -1, -1 + ) + attn_mask = attention_mask_tmp[:, None, :, :].expand( + self.batch_size, 1, -1, -1 + ) + active_mask_2d = active_mask_tmp.sum(dim=-2, keepdims=False).to( + torch.bool + ) + else: + active_mask = turn_2d_mask_to_4d( + active_mask_tmp, n_positions=1, batch_size=self.batch_size + ) + attn_mask = turn_2d_mask_to_4d( + attention_mask_tmp, + n_positions=cache_size, + batch_size=self.batch_size, + ) + active_mask_2d = active_mask_tmp + + # Context encoding or token generation + if is_for_context_encoding: + past_key_values = None + else: + past_key_values = self.kv_mgr.get_cache(self.n_positions) + + hidden_states, updated_kv_cache = self.get_model_output( + input_ids=input_ids, + seq_ids=seq_ids, + attention_mask=attn_mask, + position_ids=position_ids, + past_key_values=past_key_values, + active_mask=active_mask, + inputs_embeds=inputs_embeds, + adapter_ids=adapter_ids, + prev_hidden=prev_hidden, + tile_q_indices=tile_q_indices, + tile_block_tables=tile_block_tables, + tile_masks=tile_masks, + num_queries=num_queries, + is_for_context_encoding=is_for_context_encoding, + scatter_index=slot_mapping if self.is_block_kv_layout else scatter_index, + kvcache_buffer=kv_cache, + is_for_speculation=is_for_speculation, + active_block_table=active_block_table, + kv_active_mask=active_mask_2d, + update_cache=True, + vision_embeddings=vision_embeddings, + vision_mask=vision_mask, + ) + + batch_size = input_ids.shape[0] + if not self.sliced_hidden: + if self.padding_side == "left": + index = torch.tensor( + [hidden_states.shape[1] - 1], device=hidden_states.device + ) + index = index.unsqueeze(1).expand(batch_size, 1, self.hidden_size) + hidden_states = torch.gather(hidden_states, dim=1, index=index) + elif self.is_chunked_prefill: + if is_for_context_encoding: + index = neuron_cumsum(num_queries.reshape(1, -1).float()).int() - 1 + index = index.reshape(1, -1, 1) + index = index.expand(batch_size, -1, self.hidden_size) + hidden_states = torch.gather(hidden_states, dim=1, index=index) + else: + if not ( + position_ids.shape[-1] == self.speculation_length + or position_ids.shape[-1] == 1 + ): + index = torch.max(position_ids, dim=1, keepdim=True).indices + index = index.unsqueeze(1).expand(batch_size, 1, self.hidden_size) + hidden_states = torch.gather(hidden_states, dim=1, index=index) + + logits = self.lm_head(hidden_states) + logits = logits.float() + + if hasattr(self.lm_head, "pad_size"): + if self.lm_head.gather_output: + rank_id = torch.tensor(0, device=logits.device, dtype=torch.int32) + world_size = 1 + else: + rank_id = self.rank_util.get_rank() + world_size = torch.distributed.get_world_size( + group=self.lm_head.tensor_parallel_group + ) + logits = mask_padded_logits( + logits, rank_id, world_size, pad_size=self.lm_head.pad_size + ) + + if self.on_device_sampling: + res = self._sample_on_device( + logits, sampling_params, is_for_speculation, is_for_context_encoding + ) + else: + res = logits + + # Ensure active_block_table and attention_mask not optimized away for prefix caching + if self.is_prefix_caching: + if active_block_table is not None and len(active_block_table.shape) == 1: + res = res + active_block_table[0] * 0 + if attention_mask is not None and self.prefix_size == 0: + res = res + attention_mask[0] * 0 + + outputs = [res] + if self.neuron_config.output_logits: + logits = _gather_along_dim( + logits, + partition_dim=2, + process_group=get_tp_group(self.config), + ) + outputs += [logits] + outputs += updated_kv_cache + + return outputs + + +def parallel_state_initialized(): + """Check if parallel state is initialized.""" + from neuronx_distributed.parallel_layers import parallel_state + + return parallel_state.model_parallel_is_initialized() diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py new file mode 100644 index 00000000..231fb3b7 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/modeling_isaac_vision.py @@ -0,0 +1,271 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Isaac vision model for NxDI: SigLIP2 encoder + pixel shuffle + 2-layer MLP projector. + +Isaac's vision pipeline: + pixel_values -> SigLIP2 encoder -> pixel_shuffle (2x2, 1152->4608) -> MLP projector (4608->2048) + +The MLP projector is a 2-layer network: Linear(4608->18432) -> SiLU -> Linear(18432->2048). +No bias terms, ~122M parameters. + +Pixel shuffle is a deterministic CPU-side operation (channel concatenation of 2x2 patch groups). +""" + +import logging +from typing import List, Tuple + +import torch +from torch import nn + +from neuronx_distributed_inference.models.config import InferenceConfig +from neuronx_distributed_inference.models.llama4.modeling_llama4_vision import ( + Llama4VisionModelWrapper, +) +from neuronx_distributed_inference.modules.async_execution import is_ranked_io + +from isaac_neuron.siglip.modeling_siglip import NeuronSiglipVisionModel +from isaac_neuron.utils import pixel_shuffle_varlen + +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) + + +class NeuronIsaacMultiModalProjector(nn.Module): + """Isaac's 2-layer MLP projector: Linear -> SiLU -> Linear. + + Maps pixel-shuffled vision features (4608-dim) to text hidden size (2048-dim). + No bias terms on either linear layer. + + HF weight keys: + model.vision_embedding.1.weight -> projector_fc1.weight (4608, 18432) + model.vision_embedding.2 -> SiLU (no weights) + model.vision_embedding.3.weight -> projector_fc2.weight (18432, 2048) + """ + + def __init__(self, config: InferenceConfig): + super().__init__() + vision_hidden = config.vision_config.hidden_size # 1152 + pixel_shuffle_scale = getattr(config, "pixel_shuffle_scale", 2) + projector_input_dim = vision_hidden * (pixel_shuffle_scale**2) # 4608 + + # Isaac uses intermediate_size from vision config for the projector + # The HF model has: Linear(4608, 18432) -> SiLU -> Linear(18432, 2048) + projector_intermediate = getattr( + config, + "projector_intermediate_size", + projector_input_dim * 4, # 18432 + ) + text_hidden = config.text_config.hidden_size # 2048 + + self.fc1 = nn.Linear(projector_input_dim, projector_intermediate, bias=False) + self.act = nn.SiLU() + self.fc2 = nn.Linear(projector_intermediate, text_hidden, bias=False) + + def forward(self, vision_outputs: torch.Tensor) -> torch.Tensor: + """Forward pass: project vision features to text embedding space. + + Args: + vision_outputs: (batch, num_patches, 4608) pixel-shuffled features + + Returns: + (batch, num_patches, 2048) projected embeddings + """ + hidden = self.fc1(vision_outputs) + hidden = self.act(hidden) + hidden = self.fc2(hidden) + return hidden + + +class NeuronIsaacVisionModel(nn.Module): + """Isaac vision model: SigLIP2 encoder + pixel shuffle + MLP projector. + + Full pipeline: + pixel_values -> SigLIP2 -> pixel_shuffle(scale=2) -> MLP projector -> vision_embeddings + """ + + def __init__(self, config: InferenceConfig): + super().__init__() + self.config = config + self.vision_config = config.vision_config + self.pixel_shuffle_scale = getattr(config, "pixel_shuffle_scale", 2) + + logger.info(f"NeuronIsaacVisionModel: vision_config={vars(self.vision_config)}") + + # SigLIP2 vision encoder (reused from Gemma3-vision contrib) + self.vision_encoder = NeuronSiglipVisionModel(self.vision_config) + + # MLP projector (2-layer with SiLU) + self.multi_modal_projector = NeuronIsaacMultiModalProjector(config) + + def forward(self, pixel_values: torch.Tensor) -> torch.Tensor: + """Generate vision embeddings from pixel values. + + Args: + pixel_values: (batch, num_channels, image_size, image_size) + + Returns: + vision_embeddings: (batch, num_vision_tokens, text_hidden_size) + where num_vision_tokens = (image_size / patch_size)^2 / pixel_shuffle_scale^2 + """ + # SigLIP2 encoder + encoder_output = self.vision_encoder(pixel_values).last_hidden_state + logger.info(f"encoder_output.shape={encoder_output.shape}") + + # Pixel shuffle: merge 2x2 patches by channel concatenation + # (batch, num_patches, 1152) -> (batch, num_patches/4, 4608) + shuffled = pixel_shuffle_varlen(encoder_output, scale=self.pixel_shuffle_scale) + logger.info(f"pixel_shuffle output.shape={shuffled.shape}") + + # MLP projector: (batch, num_patches/4, 4608) -> (batch, num_patches/4, 2048) + projected = self.multi_modal_projector(shuffled) + logger.info(f"projected_embedding.shape={projected.shape}") + + return projected + + +class IsaacVisionModelWrapper(Llama4VisionModelWrapper): + """Neuron ModelWrapper for Isaac's vision model. + + Inherits from Llama4VisionModelWrapper (same as Gemma3). + Generates input shapes for trace and compilation. + """ + + def __init__( + self, + config: InferenceConfig, + model_cls, + tag="", + compiler_args: str = None, + priority_model_idx: int = None, + pipeline_execution: bool = True, + return_ranked_to_cpu: bool = True, + model_init_kwargs={}, + ) -> None: + super().__init__( + config, + model_cls, + tag, + compiler_args, + priority_model_idx, + pipeline_execution, + return_ranked_to_cpu, + model_init_kwargs, + ) + + def input_generator(self) -> List[Tuple[torch.Tensor]]: + """Generate example inputs for vision encoder tracing. + + Returns: + List of (pixel_values,) tuples for each bucket. + """ + inputs = [] + for bucket in self.neuron_config.buckets: + pixel_values = torch.ones( + [ + self.neuron_config.batch_size, + self.config.vision_config.num_channels, + self.config.vision_config.image_size, + self.config.vision_config.image_size, + ], + dtype=self.config.neuron_config.torch_dtype, + ) + inputs.append((pixel_values,)) + return inputs + + def forward(self, *args): + """Forward pass for vision encoder wrapper. + + Handles batch size padding when input batch < compiled batch. + """ + if self.model is None: + raise RuntimeError( + "Forward called before load. Run load() or load_state_dict() first." + ) + + if not self.neuron_config.on_cpu: + args = self.convert_int64_to_int32(*args) + + pixel_values = args[0] + input_batch_size = pixel_values.shape[0] + + if input_batch_size == self.neuron_config.batch_size: + return self._forward(*args) + + cur_batch = 0 + outputs = [] + + logging.debug( + f"input_batch_size={input_batch_size}, compiled_batch_size={self.neuron_config.batch_size}" + ) + + while cur_batch < input_batch_size: + if cur_batch + self.neuron_config.batch_size <= input_batch_size: + batch_args = [ + arg[cur_batch : cur_batch + self.neuron_config.batch_size] + for arg in args + ] + batch_args = self.vllm_cte_repadding(batch_args) + output = self._forward(*batch_args) + else: + output = self._forward_with_pad( + *[ + arg[cur_batch:input_batch_size] + if not is_ranked_io(arg) + else arg + for arg in args + ] + ) + outputs.append(output) + cur_batch += self.neuron_config.batch_size + + return output + + def _forward_with_pad(self, *args): + """Forward with batch padding for undersized inputs.""" + + def pad_helper(tensor, pad_type="fill_0", batch_sort_indices=None): + if tensor is None or tensor.shape[0] == self.neuron_config.batch_size: + return tensor + + padded_shape = list(tensor.shape) + padded_shape[0] = self.neuron_config.batch_size + + def repeat_first_batchline(tensor, padded_shape): + return tensor[0].repeat(padded_shape[0], 1, 1, 1).to(tensor.dtype) + + def fill_value_tensor(value): + return lambda tensor, padded_shape: torch.full( + padded_shape, fill_value=value, dtype=tensor.dtype + ) + + PAD_TYPES = { + "repeat_first_batchline": repeat_first_batchline, + "fill_0": fill_value_tensor(0), + "fill_1": fill_value_tensor(1), + "fill_-1": fill_value_tensor(-1), + } + + padded_tensor = PAD_TYPES[pad_type](tensor, padded_shape) + padded_tensor[: tensor.shape[0]] = tensor + + if batch_sort_indices is not None: + padded_tensor = torch.index_select(padded_tensor, 0, batch_sort_indices) + + return padded_tensor + + pixel_values = args[0] + orig_batch_size = pixel_values.shape[0] + + padded_args = [] + for arg in args: + if is_ranked_io(arg): + padded_args.append(arg) + else: + padded_arg = pad_helper( + arg, + pad_type="repeat_first_batchline", + batch_sort_indices=None, + ) + padded_args.append(padded_arg) + + outputs = self._forward(*padded_args) + return outputs[:orig_batch_size] diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py new file mode 100644 index 00000000..48f1b17f --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/ndxi_patch.py @@ -0,0 +1,252 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""NxDI patches for Isaac model compatibility. + +These patches fix known issues in the NxDI framework that affect +VLM models. Copied from gemma3-vision contrib with minimal modifications. +""" + +from typing import Callable, List, Optional, Tuple, Union + +from neuronx_distributed_inference.utils.tensor_replacement.registry import ( + TensorReplacementRegister, +) +import torch +from transformers.modeling_outputs import CausalLMOutputWithPast + + +def patched_get_last_kv_window( + window_size, + position_ids, + latest_k, + latest_v, + windowed_context_encoding_window_idx=-1, + spec_len=0, +): + """Fix: Convert index tensor in torch.gather to LongTensor.""" + batch_size, num_head, _, head_dim = latest_k.shape + latest_pos = torch.amax(position_ids, dim=1) + if windowed_context_encoding_window_idx >= 1: + latest_pos -= windowed_context_encoding_window_idx * window_size + + window_size = window_size - 1 + spec_len - 1 if spec_len > 0 else window_size - 1 + + end_idx = (latest_pos + 1).clamp(min=window_size) + start_idx = (end_idx - window_size).clamp(min=0) + orig_indices = start_idx[:, None] + torch.arange(window_size) + + left_shifts = (window_size - (end_idx % window_size)) % window_size + base = torch.arange(window_size).expand(batch_size, window_size) + shifted_idx = (base + left_shifts[:, None]) % window_size + + gather_idx = torch.gather(orig_indices, dim=1, index=shifted_idx.long()) + gather_idx = ( + gather_idx[:, None, :, None] + .expand(batch_size, num_head, window_size, head_dim) + .to(device=latest_k.device) + ) + + latest_k = torch.gather(latest_k, dim=2, index=gather_idx.long()) + latest_v = torch.gather(latest_v, dim=2, index=gather_idx.long()) + return latest_k, latest_v + + +def patched_base_image_to_text_model_forward( + self, + input_ids: torch.LongTensor = None, + seq_ids: Optional[torch.LongTensor] = None, + attention_mask: Optional[torch.Tensor] = None, + position_ids: Optional[torch.LongTensor] = None, + past_key_values: Optional[List[torch.FloatTensor]] = None, + inputs_embeds: Optional[torch.FloatTensor] = None, + sampling_params: Optional[torch.FloatTensor] = None, + prev_hidden: Optional[torch.FloatTensor] = None, + labels: Optional[torch.LongTensor] = None, + use_cache: Optional[bool] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + adapter_ids: Optional[torch.LongTensor] = None, + medusa_args=None, + return_dict: Optional[bool] = None, + llava_args: Optional[List] = [], + input_capture_hook: Optional[Callable] = None, + slot_mapping: Optional[torch.LongTensor] = None, + block_table: Optional[torch.LongTensor] = None, + full_context_lens: Optional[torch.LongTensor] = None, + computed_context_lens: Optional[torch.LongTensor] = None, + vision_embeddings: Optional[torch.FloatTensor] = None, + vision_mask: Optional[torch.BoolTensor] = None, + tensor_capture_hook: Optional[Callable] = None, +) -> Union[Tuple, CausalLMOutputWithPast]: + """Patched forward that includes tensor_capture_hook argument (fixes NameError).""" + if attention_mask is None: + attention_mask = self._infer_attention_mask(position_ids) + + if seq_ids is None: + seq_ids = torch.arange(input_ids.shape[0]) + + self.preprocess_inputs( + input_ids=input_ids, + seq_ids=seq_ids, + attention_mask=attention_mask, + position_ids=position_ids, + past_key_values=past_key_values, + inputs_embeds=inputs_embeds, + sampling_params=sampling_params, + prev_hidden=prev_hidden, + labels=labels, + use_cache=use_cache, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + adapter_ids=adapter_ids, + medusa_args=medusa_args, + return_dict=return_dict, + llava_args=llava_args, + input_capture_hook=input_capture_hook, + slot_mapping=slot_mapping, + block_table=block_table, + full_context_lens=full_context_lens, + computed_context_lens=computed_context_lens, + ) + + if self.async_mode: + outputs, is_run_on_neuron = self._get_model_outputs_async( + input_ids=input_ids, + attention_mask=attention_mask, + position_ids=position_ids, + seq_ids=seq_ids, + sampling_params=sampling_params, + prev_hidden=prev_hidden, + adapter_ids=adapter_ids, + vision_embeddings=vision_embeddings, + vision_mask=vision_mask, + medusa_args=medusa_args, + llava_args=llava_args, + ) + else: + outputs, is_run_on_neuron = self._get_model_outputs( + input_ids, + attention_mask, + position_ids, + seq_ids, + sampling_params, + prev_hidden, + adapter_ids, + vision_embeddings, + vision_mask, + None, # deepstack_vision_embeds (Isaac doesn't use deepstack) + medusa_args, + llava_args, + ) + + generation_model = self.get_generation_model() + if not generation_model.is_neuron(): + self._copy_past_key_values(outputs) + + constructed_outputs = self._get_constructed_outputs(outputs, is_run_on_neuron) + + if tensor_capture_hook and constructed_outputs.captured_tensors: + tensor_capture_hook(self, constructed_outputs.captured_tensors) + + return constructed_outputs + + +def patched_hf_adapter_prepare_inputs_for_generation( + self, + input_ids, + past_key_values=None, + attention_mask=None, + inputs_embeds=None, + sampling_params=None, + adapter_ids=None, + **kwargs, +): + """Patched prepare_inputs_for_generation that avoids tensor_capture_hook NameError.""" + self.prev_kv_cache_populated = self.neuron_model.kv_cache_populated + if self.neuron_model.kv_cache_populated: + input_ids = input_ids[:, -1:] + + accepted_indices = kwargs.get("accepted_indices", None) + current_length = kwargs.get("current_length", None) + medusa_mask = kwargs.get("medusa_mask", None) + scatter_index = kwargs.get("scatter_index", None) + position_ids = kwargs.get("position_ids", None) + input_capture_hook = kwargs.get("input_capture_hook", None) + + if attention_mask is not None and position_ids is None: + position_ids = attention_mask.long().cumsum(-1) - 1 + if self.input_start_offsets: + if len(self.input_start_offsets) > 1: + position_ids += torch.tensor( + self.input_start_offsets, + dtype=position_ids.dtype, + device=position_ids.device, + )[:, None] + else: + position_ids += self.input_start_offsets[0] + for i, offset in enumerate(self.input_start_offsets): + position_ids[i, 0:offset] = torch.arange(offset) + else: + position_ids.masked_fill_(attention_mask == 0, 1) + + if self.neuron_model.kv_cache_populated: + position_ids = torch.amax(position_ids, 1, keepdim=True) + position_ids = position_ids + 1 + + if inputs_embeds is not None and past_key_values is None: + model_inputs = {"inputs_embeds": inputs_embeds} + else: + model_inputs = {"input_ids": input_ids} + + model_inputs.update( + { + "position_ids": position_ids, + "past_key_values": past_key_values, + "use_cache": kwargs.get("use_cache", False), + "attention_mask": attention_mask, + "medusa_args": ( + accepted_indices, + current_length, + medusa_mask, + scatter_index, + ), + "sampling_params": sampling_params, + "input_capture_hook": input_capture_hook, + "adapter_ids": adapter_ids, + } + ) + + tf_args = [] + if self.neuron_config.tensor_replacement_config: + if hasattr(self, "generation_step"): + self.generation_step += 1 + else: + self.generation_step = 1 + reg = TensorReplacementRegister.get_instance() + tf, masks = reg.step_args(self.generation_step) + tf_args = tf + masks + + if tf_args: + model_inputs["tf_args"] = tf_args + + additional_kwargs = self.neuron_model.get_required_kwargs() + for arg in additional_kwargs: + model_inputs.update({arg: kwargs.get(arg, None)}) + + return model_inputs + + +def apply_patch() -> None: + """Apply NxDI patches for Isaac model compatibility.""" + import neuronx_distributed_inference.modules.attention.utils as u + + u.get_last_kv_window = patched_get_last_kv_window + + import neuronx_distributed_inference.models.image_to_text_model_base as mm_base + + mm_base.NeuronBaseForImageToText.forward = patched_base_image_to_text_model_forward + + import neuronx_distributed_inference.utils.hf_adapter as hf_adapter + + hf_adapter.HuggingFaceGenerationAdapter.prepare_inputs_for_generation = ( + patched_hf_adapter_prepare_inputs_for_generation + ) diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py new file mode 100644 index 00000000..36cc4b5e --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/__init__.py @@ -0,0 +1,15 @@ +# Copyright 2025 © Amazon.com and Affiliates + +from .modeling_siglip import ( + NeuronSiglipVisionModel, + NeuronSiglipAttention, +) +from .layers import ( + OutputChannelParallelConv2d, +) + +__all__ = [ + "NeuronSiglipVisionModel", + "NeuronSiglipAttention", + "OutputChannelParallelConv2d", +] diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py new file mode 100644 index 00000000..27fc092d --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/layers.py @@ -0,0 +1,358 @@ +# Copyright 2025 © Amazon.com and Affiliates +# Adapted from Gemma3-vision contrib for Isaac SigLIP2 vision encoder. +import math +from typing import Optional, Tuple, Union, Any, Callable + +from neuronx_distributed.parallel_layers.layers import ( + _as_tuple2, + _initialize_affine_weight_neuron, + _initialize_parameter_cpu, + CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION, + CONV_KERNEL_INPUT_CHANNEL_DIMENSION, + conv2d_with_weight_grad_allreduce, +) +from neuronx_distributed.parallel_layers.mappings import ( + copy_to_tensor_model_parallel_region, + gather_from_tensor_model_parallel_region_with_dim, +) +from neuronx_distributed.parallel_layers.parallel_state import ( + get_tensor_model_parallel_size, +) +from neuronx_distributed.parallel_layers.utils import ( + divide, + get_padding_length, + set_tensor_model_parallel_attributes, +) +import neuronx_distributed.trace.trace as nxd_tracing_utils +import torch +from torch.nn.parameter import Parameter + + +class BaseParallelConv(torch.nn.Module): + def set_weight_shape(self) -> None: + if self.partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION: + if self.partition_pad: + self.partition_pad_size = get_padding_length( + self.out_channels, self.world_size + ) + self.out_channels = self.out_channels + self.partition_pad_size + + self.channels_per_partition = divide(self.out_channels, self.world_size) + self.weight_shape = [ + self.channels_per_partition, + self.in_channels, + *_as_tuple2(self.kernel_size), + ] + elif self.partition_dim == CONV_KERNEL_INPUT_CHANNEL_DIMENSION: + if self.partition_pad: + self.partition_pad_size = get_padding_length( + self.in_channels, self.world_size + ) + self.in_channels = self.in_channels + self.partition_pad_size + + self.channels_per_partition = divide(self.in_channels, self.world_size) + self.weight_shape = [ + self.out_channels, + self.channels_per_partition, + *_as_tuple2(self.kernel_size), + ] + else: + assert False, f"Unsupported partition dim: {self.partition_dim}" + + def set_bias_shape(self) -> None: + if self.add_bias: + self.bias_shape = ( + self.channels_per_partition + if self.partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION + else self.out_channels + ) + else: + self.bias_shape = None + + def __init__( + self, + in_channels: int, + out_channels: int, + kernel_size: Union[int, Tuple[int, int]], + stride: Union[int, Tuple[int, int]], + padding: Union[int, Tuple[int, int]], + dilation: Union[int, Tuple[int, int]], + groups: int, + bias: bool, + padding_mode: str, + partition_dim: int, + dtype: torch.dtype, + device: Optional[torch.device] = None, + init_method: Optional[Callable[[Any], torch.Tensor]] = None, + keep_master_params: bool = False, + partition_pad: bool = False, + ): + if not all(d == 1 for d in _as_tuple2(dilation)): + raise NotImplementedError( + f"Non-1 dilation is not yet supported. Received: {dilation}" + ) + if groups != 1: + raise NotImplementedError( + f"Non-1 groups is not yet supported. Received: {groups}" + ) + if padding_mode != "zeros": + raise NotImplementedError( + f"Non-zeros padding is not yet supported. Received: {padding_mode}" + ) + + super().__init__() + self.in_channels = in_channels + self.out_channels = out_channels + self.kernel_size = kernel_size + self.stride = stride + self.padding = padding + self.partition_dim = partition_dim + self.arg_init_method = init_method + self.dtype = dtype + self.device = device + self.keep_master_params = keep_master_params + self.partition_pad = partition_pad + self.add_bias = bias + self.world_size = get_tensor_model_parallel_size() + + self.set_weight_shape() + self.set_bias_shape() + + # Get torch init device if device is not explicitly mentioned + init_device = self.device + self.weight = Parameter( + torch.empty(*self.weight_shape, device=init_device, dtype=self.dtype) + ) + self.device = self.weight.device + + if self.device.type == "cpu": + self.master_weight = _initialize_parameter_cpu( + self.weight, + partition_dim=partition_dim, + num_partitions=self.world_size, + init_method=self._init_weight, + return_master_param=self.keep_master_params, + param_dtype=self.dtype, + stride=1, + ) + elif self.device.type == "meta": + set_tensor_model_parallel_attributes( + tensor=self.weight, + is_parallel=True, + dim=partition_dim, + stride=1, + num_partitions=self.world_size, + ) + else: + assert device and device.type == "xla", ( + "Currently only xla device type is supported" + ) + _initialize_affine_weight_neuron( + self.weight, + self._init_weight, + partition_dim=partition_dim, + num_partitions=self.world_size, + stride=1, + ) + + if self.add_bias: + # Bias is added before running the all-gather collective + # If conv layer is sharded across output channels (partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION), + # then the bias must be sharded + # 1. We initialize the bias to an empty parameter tensor of shape (C_out,) or (C_out/TP,) + self.bias = Parameter( + torch.empty(self.bias_shape, dtype=dtype, device=device) + ) + + # 2. Parameter initialization + # These parallel layers are used for both training and inference. When training from scratch, weight + # initialization must be carefully done, especially when distributed (e.g. ensure the same seed is used on every rank) + # Such careful initialization is not needed when tracing (device.type == meta) or at inference + if self.device.type == "cpu": + if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION: + self.master_bias = _initialize_parameter_cpu( + self.bias, + CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION, + num_partitions=self.world_size, + init_method=self._init_bias, + return_master_param=self.keep_master_params, + param_dtype=self.dtype, + stride=1, + ) + else: + self._init_bias(self.bias) + self.master_bias = self.bias if self.keep_master_params else None + elif self.device.type == "meta": + if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION: + set_tensor_model_parallel_attributes( + self.bias, + is_parallel=True, + dim=self.partition_dim, + stride=1, + num_partitions=self.world_size, + ) + self.master_bias = self.bias if self.keep_master_params else None + else: + assert device and device.type == "xla", ( + "Currently only xla device type is supported" + ) + if partition_dim == CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION: + set_tensor_model_parallel_attributes( + self.bias, + is_parallel=True, + dim=self.partition_dim, + stride=1, + num_partitions=self.world_size, + ) + self._init_bias(self.bias) + self.master_bias = self.bias if self.keep_master_params else None + else: + self.register_parameter("bias", None) + + self._forward_impl = conv2d_with_weight_grad_allreduce + + def _init_weight(self, weight): + if self.arg_init_method is None: + torch.nn.init.kaiming_uniform_(weight, a=math.sqrt(5)) + else: + self.arg_init_method(weight) + + def _init_bias(self, bias): + fan_in, _ = torch.nn.init._calculate_fan_in_and_fan_out(self.weight) + bound = 1 / math.sqrt(fan_in) if fan_in > 0 else 0 + torch.nn.init.uniform_(bias, -bound, bound) + + +class OutputChannelParallelConv2d(BaseParallelConv): + """Conv2d layer with parallelism on its output channels + + The definition of a Conv2d layer can be found at https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html + + This layer parallelizes the Conv2d along the output channel dimension + + .. note:: + Input is expected to be four dimensional, in order [N, C, H, W] + + Arguments: + in_channels: Number of input channels + out_channels: Number of output channels in the original Conv that is being parallelized. Parallelization is handled internally by this class + kernel_size: Size of the kernel. Can be a single number for a square kernel or a tuple of two numbers + stride: Stride of the convolution. Can be a single number for uniform H/W stride or a tuple of two numbers + padding: Padding of the convolution. Can be a single number for uniform H/W padding or a tuple of two numbers + bias: If true, add bias + gather_output: If true, call all-gather on the output to assemble the partial outputs produced by each Neuron device into the full output, and make the full output available on all Neuron devices + dtype: Datatype of the weights + device: Device on which the weights should be initialized + init_method: Method for initializing the weight + keep_master_weight: If device="cpu", whether to keep the original ("master") weight the per-worker weights are split from + partition_pad: Pad the output channel dimension if needed to make the output channel count divisible by the tensor model parallel size + """ + + def __init__( + self, + in_channels: int, + out_channels: int, + kernel_size: Union[int, Tuple[int, int]], + stride: Union[int, Tuple[int, int]] = 1, + padding: Union[int, Tuple[int, int]] = 0, + dilation: Union[int, Tuple[int, int]] = 1, + groups: int = 1, + bias: bool = True, + padding_mode: str = "zeros", + gather_output: bool = True, + dtype: torch.dtype = torch.float32, + device: Optional[torch.device] = None, + init_method: Optional[Callable[[Any], torch.Tensor]] = None, + keep_master_weight: bool = False, + partition_pad: bool = False, + ): + # Base class expects these all to be tuples so it can support N-dimensional convs + kernel_size = _as_tuple2(kernel_size) + stride = _as_tuple2(stride) + padding = _as_tuple2(padding) + dilation = _as_tuple2(dilation) + + super().__init__( + in_channels, + out_channels, + kernel_size, + stride, + padding, + dilation, + groups, + bias, + padding_mode, + CONV_KERNEL_OUTPUT_CHANNEL_DIMENSION, + dtype, + device, + init_method, + keep_master_weight, + partition_pad, + ) + self.kernel_size: Tuple[int, int] + self.stride: Tuple[int, int] + self.padding: Tuple[int, int] + self.dilation: Tuple[int, int] + + self.allreduce_weight_grad = get_tensor_model_parallel_size() > 1 + self.gather_output = gather_output + + def forward(self, in_tensor: torch.Tensor) -> torch.Tensor: + """Forward of OutputChannelParallelConv2d + + Args: + in_tensor: 4D tensor in order [N, C, H ,W] + + Returns: + - output + """ + + if self.allreduce_weight_grad: + input_parallel = in_tensor + else: + input_parallel = copy_to_tensor_model_parallel_region(in_tensor) + + output_parallel = self._forward_impl( + input=input_parallel, + weight=self.weight, + bias=self.bias, + stride=self.stride, + padding=self.padding, + allreduce_weight_grad=self.allreduce_weight_grad, + ) + + # We intentionally did the bias add in _forward_impl to do less work overall + # This way, each worker only has to do 1/world_size of the bias add + if self.gather_output: + # All-gather across the partitions + output = gather_from_tensor_model_parallel_region_with_dim( + output_parallel, gather_dim=1 + ) + if self.partition_pad and self.partition_pad_size > 0: + output = torch.narrow( + output, 1, 0, self.out_channels - self.partition_pad_size + ) + else: + output = output_parallel + + return output + + def preshard_hook(self, model_state_dict: dict, prefix: str) -> None: + if not self.partition_pad or self.partition_pad_size == 0: + return + if ( + self.out_channels + != model_state_dict[prefix].shape[0] + self.partition_pad_size + ): + size = model_state_dict[prefix].shape[0] + raise RuntimeError( + f"State dict {prefix} is of an unexpected size {size} expected {size - self.partition_pad_size}" + ) + model_state_dict[prefix] = torch.nn.functional.pad( + model_state_dict[prefix], (0, 0, 0, 0, 0, 0, 0, self.partition_pad_size) + ) + + +nxd_tracing_utils.__SUPPORTED_SHARDED_MODULES = ( + nxd_tracing_utils.__SUPPORTED_SHARDED_MODULES + (OutputChannelParallelConv2d,) +) diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py new file mode 100644 index 00000000..7cce2da7 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/siglip/modeling_siglip.py @@ -0,0 +1,521 @@ +# Copyright 2025 © Amazon.com and Affiliates +# Adapted from Gemma3-vision contrib SigLIP encoder for Isaac SigLIP2. +from typing import List, Optional, Tuple, Union + +import torch +import torch.nn as nn +from torch import Size +from transformers.activations import ACT2FN +from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling +from transformers.utils import torch_int + +from neuronx_distributed.parallel_layers import parallel_state +from neuronx_distributed.parallel_layers.layers import ( + ColumnParallelLinear, + RowParallelLinear, + ParallelEmbedding, +) +from neuronx_distributed_inference.models.config import NeuronConfig, InferenceConfig +from neuronx_distributed_inference.modules.attention.attention_base import ( + NeuronAttentionBase, +) + +from isaac_neuron.siglip.layers import OutputChannelParallelConv2d + + +class NeuronSiglipConfig(NeuronConfig): + def __init__(self, **kwargs): + super().__init__(**kwargs) + + +class SiglipInferenceConfig(InferenceConfig): + def __init__(self, *args, **kwargs): + super().__init__(*args, **kwargs) + + def get_required_attributes(self) -> List[str]: + # To validate if the config.json include all the configs we need in model. + # Need to manually add what's required in below list + return [ + "hidden_size", + "image_size", + "intermediate_size", + "model_type", + "num_attention_heads", + "num_hidden_layers", + "patch_size", + ] + + +class NeuronSiglipAttention(NeuronAttentionBase): + def __init__(self, config: SiglipInferenceConfig, tensor_model_parallel_group=None): + super().__init__( + config=config, + hidden_size=config.hidden_size, + num_attention_heads=config.num_attention_heads, + num_key_value_heads=config.num_attention_heads, # siglip is MHA, not GQA + head_dim=getattr( + config, "head_dim", config.hidden_size // config.num_attention_heads + ), + qkv_bias=True, + o_bias=True, + num_cores_per_group=config.num_cores_per_group, + tensor_model_parallel_group=tensor_model_parallel_group, + ) + + +class NeuronSiglipMLP(nn.Module): + def __init__(self, config): + super().__init__() + self.config = config + self.activation_fn = ACT2FN[config.hidden_act] + self.fc1 = ColumnParallelLinear( + config.hidden_size, config.intermediate_size, gather_output=False + ) + self.fc2 = RowParallelLinear( + config.intermediate_size, config.hidden_size, input_is_parallel=True + ) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.fc1(hidden_states) + hidden_states = self.activation_fn(hidden_states) + hidden_states = self.fc2(hidden_states) + return hidden_states + + +_shape_t = Union[int, List[int], Size] + + +class LayerNorm(torch.nn.LayerNorm): + """ + Compared to NxD's LayerNorm, always cast input to torch.double to preseve numerical accuracy + """ + + def __init__( + self, + normalized_shape: _shape_t, + eps: float = 1e-5, + elementwise_affine: bool = True, + bias: bool = True, + device=None, + dtype=None, + ): + self.dtype = dtype + super().__init__( + normalized_shape=normalized_shape, + eps=eps, + elementwise_affine=elementwise_affine, + bias=bias, + device=device, + dtype=dtype, + ) + + def forward(self, input: torch.Tensor) -> torch.Tensor: + # Ensure input matches the weight dtype to avoid mixed dtype errors + input = input.to(self.weight.dtype) + output = super().forward(input) + return output + + +class NeuronSiglipEncoderLayer(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + self.embed_dim = config.hidden_size + self.layer_norm1 = LayerNorm(self.embed_dim, eps=config.layer_norm_eps) + self.self_attn = NeuronSiglipAttention(config) + self.layer_norm2 = LayerNorm(self.embed_dim, eps=config.layer_norm_eps) + self.mlp = NeuronSiglipMLP(config) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: torch.tensor, + ) -> torch.FloatTensor: + residual = hidden_states + + hidden_states = self.layer_norm1(hidden_states) + hidden_states = self.self_attn( + hidden_states=hidden_states, + attention_mask=attention_mask, + ).hidden_states + hidden_states = residual + hidden_states + + residual = hidden_states + hidden_states = self.layer_norm2(hidden_states) + hidden_states = self.mlp(hidden_states) + hidden_states = residual + hidden_states + + outputs = (hidden_states,) + + return outputs + + +class NeuronSiglipEncoder(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + self.config = config + self.layers = nn.ModuleList( + [NeuronSiglipEncoderLayer(config) for _ in range(config.num_hidden_layers)] + ) + self.gradient_checkpointing = False + + def forward( + self, + inputs_embeds, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + return_dict: Optional[bool] = None, + ) -> Union[Tuple, BaseModelOutput]: + # Use False defaults since InferenceConfig doesn't have HF PretrainedConfig attrs + output_attentions = ( + output_attentions if output_attentions is not None else False + ) + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else False + ) + return_dict = return_dict if return_dict is not None else True + + encoder_states = () if output_hidden_states else None + all_attentions = () if output_attentions else None + + hidden_states = inputs_embeds + for encoder_layer in self.layers: + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_outputs = torch.utils.checkpoint.checkpoint( + create_custom_forward(encoder_layer), + hidden_states, + attention_mask, + ) + else: + layer_outputs = encoder_layer( + hidden_states, + attention_mask, + ) + + hidden_states = layer_outputs[0] + + if output_attentions: + all_attentions = all_attentions + (layer_outputs[1],) + + if output_hidden_states: + encoder_states = encoder_states + (hidden_states,) + + return BaseModelOutput( + last_hidden_state=hidden_states, + hidden_states=encoder_states, + attentions=all_attentions, + ) + + +class NeuronSiglipMultiheadAttention(NeuronSiglipAttention): + """ + Compared to NeuronSiglipAttention: + 1. Accept three inputs (Query, Key, Value) instead of a single hidden states + """ + + def __init__(self, config: InferenceConfig): + super().__init__(config=config) + self.scale = self.head_dim**-0.5 + self.dropout = 0.0 # No dropout during inference + + def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int): + """Reshape tensor to (bsz, num_heads, seq_len, head_dim).""" + return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2) + + def forward( + self, + query: torch.Tensor, + key: torch.Tensor, + value: torch.Tensor, + attention_mask: Optional[torch.Tensor] = None, + output_attentions: Optional[bool] = True, + ) -> Tuple[torch.Tensor, Optional[torch.Tensor]]: + """Input shape: Batch x Time x Channel""" + + bsz, tgt_len, embed_dim = query.size() + + # get query/key/value projections via NxDI QKV proj + qkv_proj = self.get_qkv_proj() + query_states = qkv_proj.q_proj(query) * self.scale + key_states = self._shape(qkv_proj.k_proj(key), -1, bsz) + value_states = self._shape(qkv_proj.v_proj(value), -1, bsz) + + proj_shape = (bsz * self.num_heads, -1, self.head_dim) + query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape) + key_states = key_states.view(*proj_shape) + value_states = value_states.view(*proj_shape) + + src_len = key_states.size(1) + attn_weights = torch.bmm(query_states, key_states.transpose(1, 2)) + + if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len): + raise ValueError( + f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is" + f" {attn_weights.size()}" + ) + + if attention_mask is not None: + if attention_mask.size() != (bsz, 1, tgt_len, src_len): + raise ValueError( + f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}" + ) + attn_weights = ( + attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + + attention_mask + ) + attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len) + + attn_weights = nn.functional.softmax(attn_weights, dim=-1) + + if output_attentions: + # this operation is a bit akward, but it's required to + # make sure that attn_weights keeps its gradient. + # In order to do so, attn_weights have to reshaped + # twice and have to be reused in the following + attn_weights_reshaped = attn_weights.view( + bsz, self.num_heads, tgt_len, src_len + ) + attn_weights = attn_weights_reshaped.view( + bsz * self.num_heads, tgt_len, src_len + ) + else: + attn_weights_reshaped = None + + attn_probs = nn.functional.dropout( + attn_weights, p=self.dropout, training=self.training + ) + + attn_output = torch.bmm(attn_probs, value_states) + + if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim): + raise ValueError( + f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is" + f" {attn_output.size()}" + ) + + attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim) + attn_output = attn_output.transpose(1, 2) + attn_output = attn_output.reshape(bsz, tgt_len, -1) + + attn_output = self.get_o_proj().o_proj(attn_output) + + return attn_output, attn_weights_reshaped + + +class NeuronSiglipMultiheadAttentionPoolingHead(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + + self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size)) + self.attention = NeuronSiglipMultiheadAttention(config) + self.layernorm = LayerNorm(config.hidden_size, eps=config.layer_norm_eps) + self.mlp = NeuronSiglipMLP(config) + + def forward(self, hidden_state): + batch_size = hidden_state.shape[0] + probe = self.probe.repeat(batch_size, 1, 1) + + hidden_state = self.attention(probe, hidden_state, hidden_state)[0] + + residual = hidden_state + hidden_state = self.layernorm(hidden_state) + hidden_state = residual + self.mlp(hidden_state) + + return hidden_state[:, 0] + + +class NeuronSiglipVisionEmbeddings(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + self.config = config + self.embed_dim = config.hidden_size + self.image_size = config.image_size + self.patch_size = config.patch_size + self.num_patches = (self.image_size // self.patch_size) ** 2 + self.num_positions = self.num_patches + + if parallel_state.model_parallel_is_initialized(): + self.patch_embedding = OutputChannelParallelConv2d( + in_channels=config.num_channels, + out_channels=self.embed_dim, + kernel_size=self.patch_size, + stride=self.patch_size, + padding=0, # padding="valid" in nn.Conv2d + partition_pad=True, + ) + + self.position_embedding = ParallelEmbedding( + self.num_positions, + self.embed_dim, + shard_across_embedding=True, + pad=True, + ) + + else: + self.patch_embedding = nn.Conv2d( + in_channels=config.num_channels, + out_channels=self.embed_dim, + kernel_size=self.patch_size, + stride=self.patch_size, + padding="valid", + ) + self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim) + + self.register_buffer( + "position_ids", + torch.arange(self.num_positions).expand((1, -1)), + persistent=False, + ) + + def interpolate_pos_encoding( + self, embeddings: torch.Tensor, height: int, width: int + ) -> torch.Tensor: + """ + This method allows to interpolate the pre-trained position encodings, to be able to use the model on higher resolution + images. This method is also adapted to support torch.jit tracing and no class embeddings. + + Adapted from: + - https://github.com/facebookresearch/dino/blob/de9ee3df6cf39fac952ab558447af1fa1365362a/vision_transformer.py#L174-L194, and + - https://github.com/facebookresearch/dinov2/blob/e1277af2ba9496fbadf7aec6eba56e8d882d1e35/dinov2/models/vision_transformer.py#L179-L211 + """ + + num_patches = embeddings.shape[1] + num_positions = self.position_embedding.weight.shape[0] + + # always interpolate when tracing to ensure the exported model works for dynamic input shapes + if ( + not torch.jit.is_tracing() + and num_patches == num_positions + and height == width + ): + return self.position_embedding(self.position_ids) + + patch_pos_embed = self.position_embedding.weight.unsqueeze(0) + + dim = embeddings.shape[-1] + + new_height = height // self.patch_size + new_width = width // self.patch_size + + sqrt_num_positions = torch_int(num_positions**0.5) + patch_pos_embed = patch_pos_embed.reshape( + 1, sqrt_num_positions, sqrt_num_positions, dim + ) + patch_pos_embed = patch_pos_embed.permute(0, 3, 1, 2) + + patch_pos_embed = nn.functional.interpolate( + patch_pos_embed, + size=(new_height, new_width), + mode="bicubic", + align_corners=False, + ) + + patch_pos_embed = patch_pos_embed.permute(0, 2, 3, 1).view(1, -1, dim) + return patch_pos_embed + + def forward( + self, pixel_values: torch.FloatTensor, interpolate_pos_encoding=False + ) -> torch.Tensor: + _, _, height, width = pixel_values.shape + target_dtype = self.patch_embedding.weight.dtype + # Convert pixel_values to target dtype before passing to patch_embedding to avoid mixed dtype errors + pixel_values_converted = pixel_values.to(dtype=target_dtype) + patch_embeds = self.patch_embedding( + pixel_values_converted + ) # shape = [*, width, grid, grid] + embeddings = patch_embeds.flatten(2).transpose(1, 2) + + if interpolate_pos_encoding: + embeddings = embeddings + self.interpolate_pos_encoding( + embeddings, height, width + ) + else: + # Ensure position embeddings match the dtype of embeddings + pos_emb = self.position_embedding(self.position_ids) + embeddings = embeddings + pos_emb.to(dtype=embeddings.dtype) + return embeddings + + +class NeuronSiglipVisionTransformer(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + self.config = config + embed_dim = config.hidden_size + + self.embeddings = NeuronSiglipVisionEmbeddings(config) + self.encoder = NeuronSiglipEncoder(config) + self.post_layernorm = LayerNorm(embed_dim, eps=config.layer_norm_eps) + self.use_head = ( + True if not hasattr(config, "vision_use_head") else config.vision_use_head + ) + if self.use_head: + self.head = NeuronSiglipMultiheadAttentionPoolingHead(config) + + def forward( + self, + pixel_values, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + interpolate_pos_encoding: Optional[bool] = False, + ) -> BaseModelOutputWithPooling: + # InferenceConfig doesn't have HF PretrainedConfig defaults, so set them here + output_attentions = ( + output_attentions if output_attentions is not None else False + ) + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else False + ) + + hidden_states = self.embeddings( + pixel_values, interpolate_pos_encoding=interpolate_pos_encoding + ) + + encoder_outputs = self.encoder( + inputs_embeds=hidden_states, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + ) + + last_hidden_state = encoder_outputs.last_hidden_state + last_hidden_state = self.post_layernorm(last_hidden_state) + + pooler_output = self.head(last_hidden_state) if self.use_head else None + + return BaseModelOutputWithPooling( + last_hidden_state=last_hidden_state, + pooler_output=pooler_output, + hidden_states=encoder_outputs.hidden_states, + attentions=encoder_outputs.attentions, + ) + + +class NeuronSiglipVisionModel(nn.Module): + def __init__(self, config: InferenceConfig): + super().__init__() + self.vision_model = NeuronSiglipVisionTransformer(config) + + def get_input_embeddings(self) -> nn.Module: + return self.vision_model.embeddings.patch_embedding + + def forward( + self, + pixel_values, + output_attentions: Optional[bool] = None, + output_hidden_states: Optional[bool] = None, + interpolate_pos_encoding: bool = False, + ): + return self.vision_model( + pixel_values=pixel_values, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + interpolate_pos_encoding=interpolate_pos_encoding, + ) diff --git a/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py new file mode 100644 index 00000000..1168dd4c --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/src/isaac_neuron/utils.py @@ -0,0 +1,109 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Utility functions for Isaac NxDI contrib model.""" + +from collections import OrderedDict +import gc + +import torch +from neuronx_distributed_inference.models.config import NeuronConfig + + +StateDict = OrderedDict[str, torch.FloatTensor] + + +def _helper_concat_and_delete_qkv( + state_dict: StateDict, prefix: str, attr: str +) -> None: + """Concatenate Q, K, V weights into fused Wqkv tensor and delete originals.""" + full_state_key_q_proj = f"{prefix}.qkv_proj.q_proj.{attr}" + full_state_key_k_proj = f"{prefix}.qkv_proj.k_proj.{attr}" + full_state_key_v_proj = f"{prefix}.qkv_proj.v_proj.{attr}" + + if ( + full_state_key_q_proj in state_dict + and full_state_key_k_proj in state_dict + and full_state_key_v_proj in state_dict + ): + state_dict[f"{prefix}.qkv_proj.Wqkv.{attr}"] = torch.cat( + [ + state_dict[full_state_key_q_proj], + state_dict[full_state_key_k_proj], + state_dict[full_state_key_v_proj], + ], + dim=0, + ) + del state_dict[full_state_key_q_proj] + del state_dict[full_state_key_k_proj] + del state_dict[full_state_key_v_proj] + + +def convert_state_dict_to_fused_qkv( + state_dict: StateDict, + num_layers: int, + neuron_config: NeuronConfig, + prefix: str, +) -> StateDict: + """Convert separate Q, K, V weights to fused QKV format for all layers.""" + for layer_num in range(num_layers): + layer_prefix = prefix.format(layer_num=layer_num) + _helper_concat_and_delete_qkv(state_dict, layer_prefix, "weight") + _helper_concat_and_delete_qkv(state_dict, layer_prefix, "bias") + is_qkv_quantized = ( + neuron_config.quantized_mlp_kernel_enabled or neuron_config.quantized + ) and f"{layer_prefix}.qkv_proj.q_proj.scale" in state_dict + if is_qkv_quantized: + _helper_concat_and_delete_qkv(state_dict, layer_prefix, "scale") + + gc.collect() + return state_dict + + +def pixel_shuffle_varlen(hidden_states: torch.Tensor, scale: int = 2) -> torch.Tensor: + """Apply pixel shuffle (channel concatenation) to vision encoder output. + + This is a deterministic CPU-side operation that merges scale x scale patches + by concatenating along the channel dimension. + + Isaac's pixel shuffle: + - Input: (batch, num_patches, hidden_dim) where num_patches = (H/p * W/p) + - After reshape to (batch, H/p, W/p, hidden_dim) + - Group scale x scale patches and concatenate channels + - Output: (batch, num_patches / scale^2, hidden_dim * scale^2) + + For Isaac: hidden_dim=1152, scale=2 -> output hidden_dim=4608 + + Args: + hidden_states: Vision encoder output of shape (batch, num_patches, hidden_dim) + scale: Pixel shuffle scale factor (default: 2) + + Returns: + Shuffled tensor of shape (batch, num_patches // scale^2, hidden_dim * scale^2) + """ + batch_size, num_patches, hidden_dim = hidden_states.shape + + # Compute spatial dimensions + h = w = int(num_patches**0.5) + assert h * w == num_patches, f"num_patches {num_patches} is not a perfect square" + assert h % scale == 0 and w % scale == 0, ( + f"Spatial dims ({h}, {w}) not divisible by scale {scale}" + ) + + # Reshape to spatial: (batch, h, w, hidden_dim) + hidden_states = hidden_states.view(batch_size, h, w, hidden_dim) + + # Group into scale x scale blocks + new_h = h // scale + new_w = w // scale + hidden_states = hidden_states.view( + batch_size, new_h, scale, new_w, scale, hidden_dim + ) + + # Rearrange: (batch, new_h, new_w, scale, scale, hidden_dim) + hidden_states = hidden_states.permute(0, 1, 3, 2, 4, 5).contiguous() + + # Concatenate channels: (batch, new_h * new_w, hidden_dim * scale^2) + hidden_states = hidden_states.view( + batch_size, new_h * new_w, hidden_dim * scale * scale + ) + + return hidden_states diff --git a/contrib/models/Isaac-0.2-2B/test/__init__.py b/contrib/models/Isaac-0.2-2B/test/__init__.py new file mode 100644 index 00000000..fb28dfcd --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/__init__.py @@ -0,0 +1 @@ +# Copyright 2025 © Amazon.com and Affiliates diff --git a/contrib/models/Isaac-0.2-2B/test/integration/__init__.py b/contrib/models/Isaac-0.2-2B/test/integration/__init__.py new file mode 100644 index 00000000..fb28dfcd --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/__init__.py @@ -0,0 +1 @@ +# Copyright 2025 © Amazon.com and Affiliates diff --git a/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py b/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py new file mode 100644 index 00000000..3f0bc0f5 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/benchmark.py @@ -0,0 +1,454 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Formal benchmark for Isaac on trn2.3xlarge. + +Measures TTFT, TPOT, tok/s, and HBM usage with warmup and multiple iterations. + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python benchmark.py [--seq-len 1024] [--warmup 3] [--iterations 10] +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import argparse # noqa: E402 +import json # noqa: E402 +import os # noqa: E402 +import statistics # noqa: E402 +import time # noqa: E402 + +import torch # noqa: E402 +import torchvision.transforms as T # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 +from transformers.image_utils import load_image # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +IMAGE_TOKEN_ID = 151655 +IMAGE_SIZE = 256 +NUM_VISION_TOKENS = 64 # (256/16)^2 / 4 + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_model_and_tokenizer(seq_len, tp=1): + """Create and load model at specified config.""" + traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-bench-s{seq_len}-tp{tp}" + + text_config = NeuronConfig( + batch_size=1, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[seq_len], + token_generation_buckets=[seq_len], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=True, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp, + world_size=tp, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + config.image_token_index = IMAGE_TOKEN_ID + + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + # Compile or load + if not os.path.exists(traced_path): + print(f" Compiling (seq_len={seq_len}, TP={tp})...") + t0 = time.time() + model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config) + model.compile(traced_path, debug=False) + tokenizer.save_pretrained(traced_path) + print(f" Compiled in {time.time() - t0:.1f}s") + model.load(traced_path, skip_warmup=True) + else: + print(f" Loading from {traced_path}...") + model = NeuronIsaacForConditionalGeneration(traced_path, config) + model.load(traced_path, skip_warmup=True) + + return model, tokenizer + + +def benchmark_text(model, tokenizer, prompt, max_new_tokens, warmup, iterations): + """Benchmark text-only generation with proper warmup and timing.""" + gen_model = HuggingFaceGenerationAdapter(model) + + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + input_len = input_ids.shape[1] + + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=max_new_tokens, + ) + + # Warmup + for _ in range(warmup): + gen_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=max_new_tokens, + ) + + # Timed iterations + latencies = [] + token_counts = [] + for _ in range(iterations): + t0 = time.time() + outputs = gen_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=max_new_tokens, + ) + elapsed = time.time() - t0 + + generated = outputs.sequences[0, input_len:] + n_tokens = len(generated) + latencies.append(elapsed) + token_counts.append(n_tokens) + + gen_text = tokenizer.decode( + outputs.sequences[0, input_len:], skip_special_tokens=True + ) + + avg_tokens = statistics.mean(token_counts) + avg_latency = statistics.mean(latencies) + # TTFT ≈ latency - (n_tokens - 1) * TPOT; approximate TPOT from overall + avg_tpot = avg_latency / avg_tokens if avg_tokens > 1 else avg_latency + avg_ttft = ( + avg_latency - (avg_tokens - 1) * avg_tpot if avg_tokens > 1 else avg_latency + ) + avg_tps = avg_tokens / avg_latency + + return { + "input_tokens": input_len, + "avg_output_tokens": avg_tokens, + "avg_latency_s": avg_latency, + "ttft_ms": avg_ttft * 1000, + "tpot_ms": avg_tpot * 1000, + "tok_per_sec": avg_tps, + "latency_std_ms": statistics.stdev(latencies) * 1000 + if len(latencies) > 1 + else 0, + "text_preview": gen_text[:150], + } + + +def benchmark_image_text(model, tokenizer, max_new_tokens, warmup, iterations): + """Benchmark image+text generation.""" + gen_model = HuggingFaceGenerationAdapter(model) + + # Load test image + try: + ref_img = load_image( + "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp" + ) + except Exception: + ref_img = Image.new("RGB", (256, 256), color="blue") + + transform = T.Compose( + [ + T.Resize( + (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC + ), + T.ToTensor(), + T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]), + ] + ) + pixel_values = transform(ref_img).unsqueeze(0).to(torch.bfloat16) + + # Build input with image tokens + prompt = "Describe this image in detail." + messages = [{"role": "user", "content": f"\n{prompt}"}] + text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + full_ids = tokenizer.encode(text, return_tensors="pt")[0] + + image_text_ids = tokenizer.encode("", add_special_tokens=False) + image_text_tensor = torch.tensor(image_text_ids) + found_pos = -1 + for idx in range(len(full_ids) - len(image_text_ids) + 1): + if torch.equal(full_ids[idx : idx + len(image_text_ids)], image_text_tensor): + found_pos = idx + break + + if found_pos >= 0: + before = full_ids[:found_pos] + after = full_ids[found_pos + len(image_text_ids) :] + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0) + else: + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0) + + attention_mask = torch.ones_like(input_ids) + vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool) + input_len = input_ids.shape[1] + + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=max_new_tokens, + ) + + # Warmup + for _ in range(warmup): + gen_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=max_new_tokens, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + + # Timed iterations + latencies = [] + token_counts = [] + for _ in range(iterations): + t0 = time.time() + outputs = gen_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=max_new_tokens, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + elapsed = time.time() - t0 + + generated = outputs[0, input_len:] + n_tokens = len(generated) + latencies.append(elapsed) + token_counts.append(n_tokens) + + gen_text = tokenizer.decode(outputs[0, input_len:], skip_special_tokens=True) + + avg_tokens = statistics.mean(token_counts) + avg_latency = statistics.mean(latencies) + avg_tpot = avg_latency / avg_tokens if avg_tokens > 1 else avg_latency + avg_ttft = ( + avg_latency - (avg_tokens - 1) * avg_tpot if avg_tokens > 1 else avg_latency + ) + avg_tps = avg_tokens / avg_latency + + return { + "input_tokens": input_len, + "vision_tokens": NUM_VISION_TOKENS, + "avg_output_tokens": avg_tokens, + "avg_latency_s": avg_latency, + "ttft_ms": avg_ttft * 1000, + "tpot_ms": avg_tpot * 1000, + "tok_per_sec": avg_tps, + "latency_std_ms": statistics.stdev(latencies) * 1000 + if len(latencies) > 1 + else 0, + "text_preview": gen_text[:150], + } + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--seq-len", type=int, default=1024) + parser.add_argument("--tp", type=int, default=1) + parser.add_argument("--warmup", type=int, default=3) + parser.add_argument("--iterations", type=int, default=10) + parser.add_argument("--max-new-tokens", type=int, default=128) + args = parser.parse_args() + + print(f"{'=' * 70}") + print(f"ISAAC BENCHMARK — seq_len={args.seq_len}, TP={args.tp}") + print( + f"warmup={args.warmup}, iterations={args.iterations}, max_new_tokens={args.max_new_tokens}" + ) + print(f"{'=' * 70}") + + model, tokenizer = create_model_and_tokenizer(args.seq_len, args.tp) + + all_results = { + "config": { + "seq_len": args.seq_len, + "tp": args.tp, + "batch_size": 1, + "warmup": args.warmup, + "iterations": args.iterations, + "max_new_tokens": args.max_new_tokens, + "instance": "trn2.3xlarge", + "lnc": 2, + "sdk": "2.29", + "model": "Isaac-0.2-2B-Preview", + }, + "text_benchmarks": [], + "image_text_benchmark": None, + } + + # Text benchmarks — short, medium, long prompts + text_prompts = [ + ("short", "The capital of France is", 32), + ("medium", "Explain quantum entanglement in simple terms:", 128), + ( + "long", + "Write a detailed essay about the history and future of artificial intelligence, " + "covering its origins, key milestones, current capabilities, and predictions " + "for the next decade:", + args.max_new_tokens, + ), + ] + + for label, prompt, max_tok in text_prompts: + print(f"\n--- Text benchmark: {label} (max_new_tokens={max_tok}) ---") + result = benchmark_text( + model, tokenizer, prompt, max_tok, args.warmup, args.iterations + ) + result["label"] = label + result["prompt"] = prompt[:80] + all_results["text_benchmarks"].append(result) + print( + f" Input: {result['input_tokens']} tok, Output: {result['avg_output_tokens']:.0f} tok" + ) + print(f" TTFT: {result['ttft_ms']:.1f}ms") + print(f" TPOT: {result['tpot_ms']:.2f}ms") + print(f" Throughput: {result['tok_per_sec']:.1f} tok/s") + print(f" Latency std: {result['latency_std_ms']:.1f}ms") + + # Image+text benchmark + print(f"\n--- Image+text benchmark ---") + img_result = benchmark_image_text( + model, tokenizer, args.max_new_tokens, args.warmup, args.iterations + ) + all_results["image_text_benchmark"] = img_result + print( + f" Input: {img_result['input_tokens']} tok ({img_result['vision_tokens']} vision)" + ) + print(f" Output: {img_result['avg_output_tokens']:.0f} tok") + print(f" TTFT: {img_result['ttft_ms']:.1f}ms (includes vision encoding)") + print(f" TPOT: {img_result['tpot_ms']:.2f}ms") + print(f" Throughput: {img_result['tok_per_sec']:.1f} tok/s") + + # Summary table + print(f"\n{'=' * 70}") + print("BENCHMARK SUMMARY") + print(f"{'=' * 70}") + print( + f"{'Workload':<20} {'In':>5} {'Out':>5} {'TTFT(ms)':>10} {'TPOT(ms)':>10} {'tok/s':>8}" + ) + print("-" * 60) + for r in all_results["text_benchmarks"]: + print( + f"{r['label']:<20} {r['input_tokens']:>5} {r['avg_output_tokens']:>5.0f} " + f"{r['ttft_ms']:>10.1f} {r['tpot_ms']:>10.2f} {r['tok_per_sec']:>8.1f}" + ) + ir = all_results["image_text_benchmark"] + print( + f"{'image+text':<20} {ir['input_tokens']:>5} {ir['avg_output_tokens']:>5.0f} " + f"{ir['ttft_ms']:>10.1f} {ir['tpot_ms']:>10.2f} {ir['tok_per_sec']:>8.1f}" + ) + + # Save + out_path = os.path.join( + REFERENCE_DIR, f"benchmark_s{args.seq_len}_tp{args.tp}.json" + ) + with open(out_path, "w") as f: + json.dump(all_results, f, indent=2, default=str) + print(f"\nResults saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py b/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py new file mode 100644 index 00000000..3a00b9d2 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/kernel_sweep.py @@ -0,0 +1,313 @@ +#!/usr/bin/env python3 +"""Kernel sweep for Isaac-0.2-2B: test TKG attention block, MLP, out_proj, and combos. + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:/mnt/models/neuronx-distributed-inference/src:$PYTHONPATH \ + python3 kernel_sweep.py +""" + +import os +import sys +import time +import json +import torch +import traceback + +# Ensure the correct paths +NXDI_ROOT = "/mnt/models/neuronx-distributed-inference" +sys.path.insert(0, f"{NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src") +sys.path.insert(0, f"{NXDI_ROOT}/src") + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +from transformers import AutoConfig, AutoTokenizer +from neuronx_distributed_inference.models.config import ( + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( + prepare_sampling_params, +) +from isaac_neuron.modeling_isaac import ( + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +MODEL_PATH = "/mnt/models/Isaac-0.2-2B-Preview" +COMPILED_BASE = "/mnt/models/traced_model/Isaac-0.2-2B" + +# Kernel configurations to test +CONFIGS = { + "baseline": { + "desc": "No kernels (reference)", + "tp": 1, + "flags": { + "attn_kernel_enabled": False, + "mlp_kernel_enabled": False, + "fused_qkv": False, + }, + }, + "cte_flash_only": { + "desc": "CTE flash attention only (current production config)", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": False, + "fused_qkv": False, + }, + }, + "mlp_tp1": { + "desc": "MLP kernel at TP=1 (nkilib production, NOT experimental)", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": True, + "fused_qkv": False, + }, + }, + "tkg_block": { + "desc": "TKG attention block kernel (fuses RMSNorm+QKV+QKnorm+RoPE+Attn+Oproj)", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": False, + "fused_qkv": True, + "qkv_kernel_enabled": True, + "attn_block_tkg_nki_kernel_enabled": True, + }, + }, + "tkg_block_plus_mlp": { + "desc": "TKG block + MLP kernel (full TKG optimization)", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": True, + "fused_qkv": True, + "qkv_kernel_enabled": True, + "attn_block_tkg_nki_kernel_enabled": True, + }, + }, + "out_proj": { + "desc": "CTE flash + out_proj kernel", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": False, + "fused_qkv": False, + "out_proj_kernel_enabled": True, + }, + }, + "tkg_block_mlp_outproj": { + "desc": "TKG block + MLP + out_proj (maximum kernel coverage)", + "tp": 1, + "flags": { + "attn_kernel_enabled": True, + "mlp_kernel_enabled": True, + "fused_qkv": True, + "qkv_kernel_enabled": True, + "attn_block_tkg_nki_kernel_enabled": True, + "out_proj_kernel_enabled": True, + }, + }, +} + + +def build_config(config_name, flags, tp_degree=1, seq_len=1024): + """Build IsaacInferenceConfig with specified kernel flags.""" + compiled_dir = f"{COMPILED_BASE}/kernel_sweep_{config_name}" + + text_config = NeuronConfig( + batch_size=1, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp_degree, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[seq_len], + token_generation_buckets=[seq_len], + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + save_sharded_checkpoint=True, + **flags, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp_degree, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + save_sharded_checkpoint=True, + fused_qkv=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + + inference_config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + # Override save/compiled paths + inference_config.save_path = compiled_dir + inference_config.compiled_model_path = compiled_dir + + return inference_config + + +def test_config(config_name, config_info): + """Test a single kernel configuration: compile, load, generate, benchmark.""" + print(f"\n{'=' * 70}") + print(f"Testing: {config_name}") + print(f" {config_info['desc']}") + print(f" Flags: {config_info['flags']}") + print(f"{'=' * 70}") + + tp = config_info["tp"] + flags = config_info["flags"] + + try: + inference_config = build_config(config_name, flags, tp_degree=tp) + compiled_dir = f"{COMPILED_BASE}/kernel_sweep_{config_name}" + + # Compile + t0 = time.time() + print(f" Compiling...") + model = NeuronIsaacForConditionalGeneration(MODEL_PATH, inference_config) + model.compile(compiled_dir, debug=False) + compile_time = time.time() - t0 + print(f" Compile time: {compile_time:.1f}s") + + # Load + print(f" Loading compiled model...") + model.load(compiled_dir, skip_warmup=True) + + # Generate text-only + tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True) + tokenizer.pad_token = tokenizer.eos_token + prompt = "What is the capital of France?" + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + + generation_model = HuggingFaceGenerationAdapter(model) + sampling_params = prepare_sampling_params( + batch_size=1, + top_k=[1], + top_p=[1.0], + temperature=[0.0], + ) + gen_kwargs = dict( + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + max_new_tokens=50, + ) + + # Warmup + print(f" Warmup (3 runs)...") + for _ in range(3): + out = generation_model.generate(input_ids, **gen_kwargs) + + # Benchmark (10 runs) + print(f" Benchmarking (10 runs, 50 tokens each)...") + times = [] + for _ in range(10): + t0 = time.time() + out = generation_model.generate(input_ids, **gen_kwargs) + times.append(time.time() - t0) + + output_text = tokenizer.decode( + out[0][input_ids.shape[1] :], skip_special_tokens=True + ) + avg_time = sum(times) / len(times) + tok_per_sec = 50 / avg_time + tpot_ms = (avg_time / 50) * 1000 + + result = { + "status": "SUCCESS", + "compile_time_s": compile_time, + "avg_time_s": avg_time, + "tok_per_sec": tok_per_sec, + "tpot_ms": tpot_ms, + "output_preview": output_text[:100], + } + print(f" tok/s: {tok_per_sec:.1f}") + print(f" TPOT: {tpot_ms:.2f} ms") + print(f" Output: {output_text[:80]}...") + + # Cleanup + del model + torch.cuda.empty_cache() if torch.cuda.is_available() else None + + return result + + except Exception as e: + tb = traceback.format_exc() + print(f" FAILED: {e}") + print(f" {tb[-500:]}") + return { + "status": "FAILED", + "error": str(e), + "traceback": tb[-500:], + } + + +def main(): + # Parse args + configs_to_test = sys.argv[1:] if len(sys.argv) > 1 else list(CONFIGS.keys()) + + print(f"Isaac Kernel Sweep") + print(f"Configs to test: {configs_to_test}") + print(f"Model: {MODEL_PATH}") + + results = {} + for name in configs_to_test: + if name not in CONFIGS: + print(f"Unknown config: {name}, skipping") + continue + results[name] = test_config(name, CONFIGS[name]) + + # Summary + print(f"\n{'=' * 80}") + print(f"KERNEL SWEEP SUMMARY") + print(f"{'=' * 80}") + print(f"{'Config':<25} {'Status':<10} {'tok/s':>8} {'TPOT ms':>10} {'Compile':>10}") + print("-" * 70) + for name, r in results.items(): + if r["status"] == "SUCCESS": + print( + f"{name:<25} {'OK':<10} {r['tok_per_sec']:>8.1f} {r['tpot_ms']:>10.2f} {r['compile_time_s']:>10.1f}s" + ) + else: + print(f"{name:<25} {'FAIL':<10} {'—':>8} {'—':>10} {'—':>10}") + + # Save results + out_path = "/mnt/models/kernel_sweep_results.json" + with open(out_path, "w") as f: + json.dump(results, f, indent=2, default=str) + print(f"\nResults saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py b/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py new file mode 100644 index 00000000..c1359cd5 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py @@ -0,0 +1,255 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Isaac-0.2-2B NxDI integration test script. + +Compiles and runs the Isaac VLM model on Neuron. +Supports both text-only and image+text generation. + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python run_isaac.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import logging # noqa: E402 +import os # noqa: E402 + +import torch # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, AutoProcessor # noqa: E402 + +from neuronx_distributed_inference.models.config import ( + NeuronConfig, + OnDeviceSamplingConfig, +) # noqa: E402 +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( + prepare_sampling_params, +) # noqa: E402 + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# Configure logging +logger = logging.getLogger(__name__) +logger.setLevel(logging.DEBUG) + +# Model configuration +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") + +CONFIG = { + "TEXT_TP_DEGREE": 1, # TP=1 for 2B model on trn2.3xlarge + "VISION_TP_DEGREE": 1, + "WORLD_SIZE": 1, + "BATCH_SIZE": 1, + "SEQ_LENGTH": 1024, # Start small for initial compilation test + "CTX_BUCKETS": [1024], + "TKG_BUCKETS": [1024], + "DTYPE": torch.bfloat16, + "MODEL_PATH": f"{DATA_PATH}/Isaac-0.2-2B-Preview", + "TRACED_MODEL_PATH": f"{DATA_PATH}/traced_model/Isaac-0.2-2B", + "MAX_NEW_TOKENS": 50, + # Optimizations + "FUSED_QKV": False, # Start without QKV fusion + "VISION_FUSED_QKV": False, + "ASYNC_MODE": False, # Disable async for debugging + "OUTPUT_LOGITS": True, + "ON_DEVICE_SAMPLING": OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, # Greedy for validation + global_topk=256, + top_k_kernel_enabled=True, + ), +} + +# Environment setup +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_neuron_configs(): + """Create text and vision neuron configurations.""" + text_config = NeuronConfig( + batch_size=CONFIG["BATCH_SIZE"], + seq_len=CONFIG["SEQ_LENGTH"], + torch_dtype=CONFIG["DTYPE"], + # Distributed + tp_degree=CONFIG["TEXT_TP_DEGREE"], + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + # Continuous batching + is_continuous_batching=True, + ctx_batch_size=1, + # Bucketing + enable_bucketing=True, + context_encoding_buckets=CONFIG["CTX_BUCKETS"], + token_generation_buckets=CONFIG["TKG_BUCKETS"], + # Optimizations + async_mode=CONFIG["ASYNC_MODE"], + on_device_sampling_config=CONFIG["ON_DEVICE_SAMPLING"], + output_logits=CONFIG["OUTPUT_LOGITS"], + fused_qkv=CONFIG["FUSED_QKV"], + sequence_parallel_enabled=False, + # Kernels — conservative for initial test + # ISA limit: text MLP intermediate=6144 > 4096 at TP=1 + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=CONFIG["BATCH_SIZE"], + seq_len=CONFIG["SEQ_LENGTH"], + torch_dtype=CONFIG["DTYPE"], + # Distributed + tp_degree=CONFIG["VISION_TP_DEGREE"], + world_size=CONFIG["WORLD_SIZE"], + save_sharded_checkpoint=True, + # Continuous batching + is_continuous_batching=True, + ctx_batch_size=1, + # Bucketing + enable_bucketing=True, + buckets=[1], + # Optimizations + fused_qkv=CONFIG["VISION_FUSED_QKV"], + # Kernels — all disabled for vision encoder + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + return text_config, vision_config + + +def setup_model(): + """Initialize model configuration and compile/load.""" + text_config, vision_config = create_neuron_configs() + + # Isaac uses trust_remote_code; load HF config directly + hf_config = AutoConfig.from_pretrained(CONFIG["MODEL_PATH"], trust_remote_code=True) + + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + print( + f"Text config: {config.text_config.num_hidden_layers} layers, " + f"hidden={config.text_config.hidden_size}" + ) + print( + f"Vision config: {config.vision_config.num_hidden_layers} layers, " + f"hidden={config.vision_config.hidden_size}" + ) + + tokenizer = AutoTokenizer.from_pretrained( + CONFIG["MODEL_PATH"], padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + return config, tokenizer + + +def compile_model(config, tokenizer): + """Compile model (text + vision) and save traced artifacts.""" + print("\nCompiling Isaac model (text + vision)...") + model = NeuronIsaacForConditionalGeneration(CONFIG["MODEL_PATH"], config) + # debug=False to avoid profiler's CUDA introspection issue on Neuron instances + model.compile(CONFIG["TRACED_MODEL_PATH"], debug=False) + tokenizer.save_pretrained(CONFIG["TRACED_MODEL_PATH"]) + print(f"Model compiled and saved to {CONFIG['TRACED_MODEL_PATH']}") + # Load compiled model for inference + model.load(CONFIG["TRACED_MODEL_PATH"], skip_warmup=True) + return model + + +def load_model(): + """Load pre-compiled model from traced checkpoint.""" + print(f"\nLoading model from {CONFIG['TRACED_MODEL_PATH']}...") + model = NeuronIsaacForConditionalGeneration(CONFIG["TRACED_MODEL_PATH"]) + model.load(CONFIG["TRACED_MODEL_PATH"], skip_warmup=True) + return model + + +def run_text_only(model, tokenizer): + """Run text-only generation test.""" + print("\n=== Text-only Generation ===") + prompt = "The capital of France is" + + messages = [{"role": "user", "content": prompt}] + # Use tokenizer directly (Isaac's processor requires tensor_stream for images) + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + + print(f"Input: '{prompt}'") + print(f"Input IDs shape: {input_ids.shape}") + + generation_model = HuggingFaceGenerationAdapter(model) + sampling_params = prepare_sampling_params( + batch_size=CONFIG["BATCH_SIZE"], + top_k=[1], + top_p=[1.0], + temperature=[0.0], + ) + + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + max_new_tokens=CONFIG["MAX_NEW_TOKENS"], + ) + + output_text = tokenizer.batch_decode(outputs, skip_special_tokens=True) + for i, text in enumerate(output_text): + print(f"Output {i}: {text}") + + +def main(): + import sys + + config, tokenizer = setup_model() + + mode = sys.argv[1] if len(sys.argv) > 1 else "auto" + + if mode == "compile": + # Force recompilation + import shutil + + if os.path.exists(CONFIG["TRACED_MODEL_PATH"]): + print(f"Removing old traced model at {CONFIG['TRACED_MODEL_PATH']}...") + shutil.rmtree(CONFIG["TRACED_MODEL_PATH"]) + model = compile_model(config, tokenizer) + elif mode == "load": + # Load only + model = load_model() + else: + # Auto: compile if not found, else load + if not os.path.exists(CONFIG["TRACED_MODEL_PATH"]): + model = compile_model(config, tokenizer) + else: + model = load_model() + + run_text_only(model, tokenizer) + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py b/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py new file mode 100644 index 00000000..cd933cc9 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/test_kernels.py @@ -0,0 +1,357 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Test NKI kernel enablement for Isaac at TP=1. + +Incrementally enables kernels and validates: +1. Compilation succeeds +2. Accuracy matches baseline (cosine vs CPU reference) +3. Throughput improvement + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python test_kernels.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import json # noqa: E402 +import os # noqa: E402 +import shutil # noqa: E402 +import sys # noqa: E402 +import time # noqa: E402 +import traceback # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + +# Kernel configurations to test (incremental enablement) +KERNEL_CONFIGS = { + "baseline": { + "description": "No kernels (current default)", + "text_config": { + "fused_qkv": False, + "attn_kernel_enabled": False, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": False, + "mlp_kernel_enabled": False, + }, + }, + "cte_flash_attn": { + "description": "CTE flash attention only", + "text_config": { + "fused_qkv": False, + "attn_kernel_enabled": True, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": False, + "mlp_kernel_enabled": False, + }, + }, + "mlp_kernel": { + "description": "MLP kernel only", + "text_config": { + "fused_qkv": False, + "attn_kernel_enabled": False, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": False, + "mlp_kernel_enabled": True, + }, + }, + "qkv_kernel": { + "description": "QKV kernel (requires fused_qkv)", + "text_config": { + "fused_qkv": True, + "attn_kernel_enabled": False, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": True, + "qkv_nki_kernel_enabled": True, + "mlp_kernel_enabled": False, + }, + }, + "cte_flash_plus_mlp": { + "description": "CTE flash attention + MLP kernel", + "text_config": { + "fused_qkv": False, + "attn_kernel_enabled": True, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": False, + "mlp_kernel_enabled": True, + }, + }, + "full_suite": { + "description": "All kernels: CTE flash + QKV + MLP + fused residual", + "text_config": { + "fused_qkv": True, + "attn_kernel_enabled": True, + "attn_tkg_nki_kernel_enabled": False, + "attn_tkg_builtin_kernel_enabled": False, + "qkv_kernel_enabled": True, + "qkv_nki_kernel_enabled": True, + "mlp_kernel_enabled": True, + "mlp_kernel_fuse_residual_add": True, + "qkv_kernel_fuse_residual_add": True, + "out_proj_kernel_enabled": True, + }, + }, +} + +PROMPTS = [ + "The capital of France is", + "Explain quantum entanglement in simple terms:", +] + + +def create_config(kernel_name, kernel_cfg): + """Create config with specified kernel settings.""" + traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-kernel-{kernel_name}" + + text_overrides = kernel_cfg["text_config"] + + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + sequence_parallel_enabled=False, + **text_overrides, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + config.image_token_index = 151655 + + return config, traced_path + + +def test_kernel_config(kernel_name, kernel_cfg, tokenizer): + """Test a single kernel configuration.""" + print(f"\n{'=' * 70}") + print(f"Testing: {kernel_name} — {kernel_cfg['description']}") + print(f"{'=' * 70}") + + config, traced_path = create_config(kernel_name, kernel_cfg) + result = { + "name": kernel_name, + "description": kernel_cfg["description"], + "compiled": False, + "accuracy_pass": False, + "prompts": [], + "compile_time": None, + "error": None, + } + + # Clean and compile + if os.path.exists(traced_path): + shutil.rmtree(traced_path) + + try: + t0 = time.time() + model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config) + model.compile(traced_path, debug=False) + tokenizer.save_pretrained(traced_path) + compile_time = time.time() - t0 + model.load(traced_path, skip_warmup=True) + result["compiled"] = True + result["compile_time"] = compile_time + print(f" Compiled in {compile_time:.1f}s") + except Exception as e: + result["error"] = str(e) + print(f" COMPILATION FAILED: {e}") + traceback.print_exc() + return result + + # Validate accuracy + generation_model = HuggingFaceGenerationAdapter(model) + all_passed = True + + for i, prompt in enumerate(PROMPTS): + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=50, + ) + + t0 = time.time() + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=50, + ) + elapsed = time.time() - t0 + + generated = outputs.sequences[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + n_tokens = len(generated) + tok_per_sec = n_tokens / elapsed if elapsed > 0 else 0 + + # Compare first-token logits + neuron_logits = outputs.scores[0][0].float().cpu() + ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt") + cosine = -1.0 + if os.path.exists(ref_path): + ref_logits = torch.load(ref_path, map_location="cpu") + cosine = F.cosine_similarity( + neuron_logits.unsqueeze(0), ref_logits.unsqueeze(0) + ).item() + + top1_match = neuron_logits.argmax().item() == 151667 + passed = cosine >= 0.99 and top1_match + if not passed: + all_passed = False + + prompt_result = { + "prompt": prompt, + "cosine": cosine, + "top1_match": top1_match, + "passed": passed, + "text": gen_text[:200], + "n_tokens": n_tokens, + "tok_per_sec": tok_per_sec, + "elapsed": elapsed, + } + result["prompts"].append(prompt_result) + print( + f" Prompt {i}: cosine={cosine:.6f}, top1={'OK' if top1_match else 'MISS'}, " + f"{n_tokens} tok, {tok_per_sec:.1f} tok/s | {gen_text[:60]!r}" + ) + + result["accuracy_pass"] = all_passed + + # Cleanup model to free NeuronCores + del model + del generation_model + import gc + + gc.collect() + + return result + + +def main(): + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + results = [] + for name, cfg in KERNEL_CONFIGS.items(): + r = test_kernel_config(name, cfg, tokenizer) + results.append(r) + + # Summary table + print(f"\n{'=' * 70}") + print("KERNEL TEST SUMMARY") + print(f"{'=' * 70}") + print( + f"{'Config':<25} {'Compiled':>10} {'Accuracy':>10} {'Compile(s)':>12} {'tok/s (avg)':>12}" + ) + print("-" * 70) + for r in results: + compiled = "YES" if r["compiled"] else "FAIL" + accuracy = "PASS" if r["accuracy_pass"] else "FAIL" + compile_t = f"{r['compile_time']:.1f}" if r["compile_time"] else "N/A" + avg_tps = "N/A" + if r["prompts"]: + tps_vals = [p["tok_per_sec"] for p in r["prompts"] if p["tok_per_sec"] > 0] + if tps_vals: + avg_tps = f"{sum(tps_vals) / len(tps_vals):.1f}" + print( + f"{r['name']:<25} {compiled:>10} {accuracy:>10} {compile_t:>12} {avg_tps:>12}" + ) + + # Save results + out_path = os.path.join(REFERENCE_DIR, "kernel_test_results.json") + with open(out_path, "w") as f: + json.dump(results, f, indent=2, default=str) + print(f"\nResults saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py b/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py new file mode 100644 index 00000000..08968ab9 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/test_scaling.py @@ -0,0 +1,362 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Test Isaac scaling: sequence length and batch size. + +Tests compilation and throughput at various seq_len and batch_size. + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + + # Test single config + python test_scaling.py --seq-len 2048 --batch-size 1 + + # Test all configs (sequential) + python test_scaling.py --sweep +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import argparse # noqa: E402 +import json # noqa: E402 +import os # noqa: E402 +import shutil # noqa: E402 +import subprocess # noqa: E402 +import sys # noqa: E402 +import time # noqa: E402 +import traceback # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def get_hbm_usage(): + """Get current HBM usage from neuron-ls.""" + try: + result = subprocess.run( + ["neuron-ls", "--json-output"], + capture_output=True, + text=True, + timeout=10, + ) + if result.returncode == 0: + data = json.loads(result.stdout) + for device in data: + mem = device.get("neuron_device", {}).get("memory", {}) + used = mem.get("used_bytes", 0) + total = mem.get("total_bytes", 0) + return used / 1e9, total / 1e9 # GB + except Exception: + pass + return None, None + + +def create_config(seq_len, batch_size, tp=1): + """Create configs for a given seq_len and batch_size.""" + traced_path = f"{DATA_PATH}/traced_model/Isaac-2B-s{seq_len}-b{batch_size}-tp{tp}" + + # Build bucketing: CTE uses the seq_len bucket, TKG uses same + cte_buckets = [seq_len] + tkg_buckets = [seq_len] + + text_config = NeuronConfig( + batch_size=batch_size, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=batch_size, + enable_bucketing=True, + context_encoding_buckets=cte_buckets, + token_generation_buckets=tkg_buckets, + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + # Enable CTE flash attention (verified working) + attn_kernel_enabled=True, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=batch_size, + seq_len=seq_len, + torch_dtype=torch.bfloat16, + tp_degree=tp, + world_size=tp, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=batch_size, + enable_bucketing=True, + buckets=[batch_size], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + config.image_token_index = 151655 + + return config, traced_path + + +def test_config(seq_len, batch_size, tp=1, force_recompile=True): + """Test a single seq_len + batch_size configuration.""" + print(f"\n{'=' * 70}") + print(f"Testing: seq_len={seq_len}, batch_size={batch_size}, TP={tp}") + print(f"{'=' * 70}") + + result = { + "seq_len": seq_len, + "batch_size": batch_size, + "tp": tp, + "compiled": False, + "inference_ok": False, + "compile_time": None, + "hbm_used_gb": None, + "hbm_total_gb": None, + "ttft_ms": None, + "tkg_tok_per_sec": None, + "error": None, + } + + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + config, traced_path = create_config(seq_len, batch_size, tp) + + if force_recompile and os.path.exists(traced_path): + shutil.rmtree(traced_path) + + # Compile + try: + t0 = time.time() + model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config) + model.compile(traced_path, debug=False) + tokenizer.save_pretrained(traced_path) + compile_time = time.time() - t0 + result["compiled"] = True + result["compile_time"] = compile_time + print(f" Compiled in {compile_time:.1f}s") + except Exception as e: + result["error"] = str(e)[:500] + print(f" COMPILATION FAILED: {str(e)[:200]}") + traceback.print_exc() + return result + + # Load + try: + model.load(traced_path, skip_warmup=True) + except Exception as e: + result["error"] = f"Load failed: {str(e)[:400]}" + print(f" LOAD FAILED: {str(e)[:200]}") + return result + + # HBM usage + hbm_used, hbm_total = get_hbm_usage() + result["hbm_used_gb"] = hbm_used + result["hbm_total_gb"] = hbm_total + if hbm_used: + print(f" HBM: {hbm_used:.1f} / {hbm_total:.1f} GB") + + # Inference test + generation_model = HuggingFaceGenerationAdapter(model) + prompt = "Explain the theory of relativity in detail, covering both special and general relativity:" + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + + # For BS > 1, replicate input + if batch_size > 1: + input_ids = input_ids.repeat(batch_size, 1) + + attention_mask = torch.ones_like(input_ids) + + sampling_params = prepare_sampling_params( + batch_size=batch_size, + top_k=[1] * batch_size, + top_p=[1.0] * batch_size, + temperature=[1.0] * batch_size, + ) + gen_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=50, + ) + + try: + # TTFT: first token time + t0 = time.time() + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=50, + ) + total_time = time.time() - t0 + + generated = outputs.sequences[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + n_tokens = len(generated) + + # TTFT approximation (first score is first token) + if hasattr(outputs, "scores") and len(outputs.scores) > 0: + # Rough: total_time / n_tokens gives TPOT, TTFT ≈ total_time - (n_tokens-1)*TPOT + tpot = total_time / n_tokens if n_tokens > 1 else total_time + ttft = total_time - (n_tokens - 1) * tpot if n_tokens > 1 else total_time + else: + ttft = total_time + tpot = total_time / n_tokens if n_tokens > 0 else 0 + + tok_per_sec = (n_tokens * batch_size) / total_time if total_time > 0 else 0 + + result["inference_ok"] = True + result["ttft_ms"] = ttft * 1000 + result["tkg_tok_per_sec"] = tok_per_sec + result["tpot_ms"] = tpot * 1000 + result["n_tokens"] = n_tokens + result["text_preview"] = gen_text[:100] + + print(f" Generated: {n_tokens} tokens in {total_time:.3f}s") + print(f" TTFT: ~{ttft * 1000:.1f}ms, TPOT: ~{tpot * 1000:.1f}ms") + print(f" Throughput: {tok_per_sec:.1f} tok/s (total across batch)") + print(f" Text: {gen_text[:80]!r}") + + except Exception as e: + result["error"] = f"Inference failed: {str(e)[:400]}" + print(f" INFERENCE FAILED: {str(e)[:200]}") + traceback.print_exc() + + # Cleanup + del model + del generation_model + import gc + + gc.collect() + + return result + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--seq-len", type=int, default=1024) + parser.add_argument("--batch-size", type=int, default=1) + parser.add_argument("--tp", type=int, default=1) + parser.add_argument("--sweep", action="store_true", help="Run full sweep") + parser.add_argument("--no-recompile", action="store_true") + args = parser.parse_args() + + if args.sweep: + # Sweep configurations: seq_len first, then batch_size + configs = [ + # Seq len sweep (BS=1) + (1024, 1), # baseline + (2048, 1), + (4096, 1), + (8192, 1), + # Batch size sweep (seq_len=1024) + (1024, 2), + (1024, 4), + (1024, 8), + ] + + results = [] + for sl, bs in configs: + r = test_config(sl, bs, tp=args.tp, force_recompile=not args.no_recompile) + results.append(r) + + # Summary + print(f"\n{'=' * 80}") + print("SCALING TEST SUMMARY") + print(f"{'=' * 80}") + print( + f"{'seq_len':>8} {'BS':>4} {'Compiled':>10} {'CompileT':>10} " + f"{'HBM(GB)':>10} {'TTFT(ms)':>10} {'tok/s':>10} {'TPOT(ms)':>10}" + ) + print("-" * 80) + for r in results: + comp = "YES" if r["compiled"] else "FAIL" + ct = f"{r['compile_time']:.0f}" if r["compile_time"] else "N/A" + hbm = f"{r['hbm_used_gb']:.1f}" if r["hbm_used_gb"] else "N/A" + ttft = f"{r['ttft_ms']:.1f}" if r["ttft_ms"] else "N/A" + tps = f"{r['tkg_tok_per_sec']:.1f}" if r["tkg_tok_per_sec"] else "N/A" + tpot = f"{r.get('tpot_ms', 0):.1f}" if r.get("tpot_ms") else "N/A" + print( + f"{r['seq_len']:>8} {r['batch_size']:>4} {comp:>10} {ct:>10} " + f"{hbm:>10} {ttft:>10} {tps:>10} {tpot:>10}" + ) + + # Save + out_path = os.path.join(REFERENCE_DIR, "scaling_test_results.json") + with open(out_path, "w") as f: + json.dump(results, f, indent=2, default=str) + print(f"\nResults saved to {out_path}") + + else: + r = test_config( + args.seq_len, + args.batch_size, + tp=args.tp, + force_recompile=not args.no_recompile, + ) + print(f"\nResult: {json.dumps(r, indent=2, default=str)}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py b/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py new file mode 100644 index 00000000..7f99e0b9 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/test_tp.py @@ -0,0 +1,387 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Test Isaac at TP=2 and TP=4 on trn2.3xlarge (LNC=2, 4 logical cores). + +Compiles fresh models at each TP degree, runs text-only + image+text, +and compares first-token logits against CPU reference. + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + # TP=2: + python test_tp.py --tp 2 + # TP=4: + python test_tp.py --tp 4 +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import argparse # noqa: E402 +import json # noqa: E402 +import os # noqa: E402 +import shutil # noqa: E402 +import sys # noqa: E402 +import time # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +import torchvision.transforms as T # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 +from transformers.image_utils import load_image # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" + +IMAGE_TOKEN_ID = 151655 +IMAGE_SIZE = 256 +NUM_VISION_TOKENS = (IMAGE_SIZE // 16) ** 2 // 4 # 64 + +TEXT_PROMPTS = [ + "The capital of France is", + "def fibonacci(n):", + "Explain quantum entanglement in simple terms:", +] + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_configs(tp_degree): + """Create neuron configs for a given TP degree.""" + traced_path = f"{DATA_PATH}/traced_model/Isaac-0.2-2B-tp{tp_degree}" + + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=tp_degree, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=tp_degree, + world_size=tp_degree, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + config.image_token_index = IMAGE_TOKEN_ID + + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + return config, tokenizer, traced_path + + +def compile_and_load(config, tokenizer, traced_path, force_recompile=False): + """Compile (if needed) and load the model.""" + if force_recompile and os.path.exists(traced_path): + print(f" Removing old traced model at {traced_path}...") + shutil.rmtree(traced_path) + + if not os.path.exists(traced_path): + print(f" Compiling at TP={config.neuron_config.tp_degree}...") + t0 = time.time() + model = NeuronIsaacForConditionalGeneration(MODEL_PATH, config) + model.compile(traced_path, debug=False) + tokenizer.save_pretrained(traced_path) + compile_time = time.time() - t0 + print(f" Compilation complete in {compile_time:.1f}s") + model.load(traced_path, skip_warmup=True) + else: + print(f" Loading existing model from {traced_path}...") + model = NeuronIsaacForConditionalGeneration(traced_path, config) + model.load(traced_path, skip_warmup=True) + + return model + + +def validate_text(model, tokenizer, tp_degree): + """Run text-only validation and compare against CPU reference.""" + print(f"\n --- Text-only validation (TP={tp_degree}) ---") + generation_model = HuggingFaceGenerationAdapter(model) + + results = [] + for i, prompt in enumerate(TEXT_PROMPTS): + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=20, + ) + + t0 = time.time() + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=20, + ) + elapsed = time.time() - t0 + + generated = outputs.sequences[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + n_tokens = len(generated) + + # First-token logits comparison + neuron_logits = outputs.scores[0][0].float().cpu() + ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt") + cosine = -1.0 + if os.path.exists(ref_path): + ref_logits = torch.load(ref_path, map_location="cpu") + cosine = F.cosine_similarity( + neuron_logits.unsqueeze(0), ref_logits.unsqueeze(0) + ).item() + + top1_match = neuron_logits.argmax().item() == 151667 # + + passed = cosine >= 0.99 and top1_match + print( + f" Prompt {i}: cosine={cosine:.6f}, top1={'match' if top1_match else 'MISS'}, " + f"{n_tokens} tok in {elapsed:.2f}s | {gen_text[:80]!r}" + ) + + results.append( + { + "prompt": prompt, + "cosine": cosine, + "top1_match": top1_match, + "passed": passed, + "text": gen_text[:200], + "n_tokens": n_tokens, + "elapsed": elapsed, + } + ) + + all_passed = all(r["passed"] for r in results) + return results, all_passed + + +def validate_image_text(model, tokenizer, tp_degree): + """Run image+text validation.""" + print(f"\n --- Image+text validation (TP={tp_degree}) ---") + generation_model = HuggingFaceGenerationAdapter(model) + + try: + ref_img = load_image( + "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp" + ) + except Exception: + ref_img = Image.new("RGB", (256, 256), color="blue") + + # Prepare image inputs + transform = T.Compose( + [ + T.Resize( + (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC + ), + T.ToTensor(), + T.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5]), + ] + ) + pixel_values = transform(ref_img).unsqueeze(0).to(torch.bfloat16) + + prompt = "Describe this image in detail." + messages_with_image = [{"role": "user", "content": f"\n{prompt}"}] + text_with_image = tokenizer.apply_chat_template( + messages_with_image, tokenize=False, add_generation_prompt=True + ) + full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0] + + # Find and replace tokens + image_text_ids = tokenizer.encode("", add_special_tokens=False) + image_text_tensor = torch.tensor(image_text_ids) + found_pos = -1 + for idx in range(len(full_ids) - len(image_text_ids) + 1): + if torch.equal(full_ids[idx : idx + len(image_text_ids)], image_text_tensor): + found_pos = idx + break + + if found_pos >= 0: + before = full_ids[:found_pos] + after = full_ids[found_pos + len(image_text_ids) :] + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0) + else: + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0) + + attention_mask = torch.ones_like(input_ids) + vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool) + + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=30, + ) + + t0 = time.time() + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=30, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + elapsed = time.time() - t0 + + generated = outputs[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + n_tokens = len(generated) + + passed = len(gen_text.strip()) > 0 and n_tokens > 0 + print(f" Image+text: {n_tokens} tok in {elapsed:.2f}s | {gen_text[:150]!r}") + print(f" {'PASS' if passed else 'FAIL'}") + + return { + "passed": passed, + "text": gen_text[:200], + "n_tokens": n_tokens, + "elapsed": elapsed, + } + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--tp", type=int, required=True, choices=[2, 4]) + parser.add_argument("--force-recompile", action="store_true") + args = parser.parse_args() + + tp = args.tp + print(f"{'=' * 70}") + print(f"TENSOR PARALLELISM TEST: TP={tp}") + print(f"{'=' * 70}") + + config, tokenizer, traced_path = create_configs(tp) + print(f" Model path: {MODEL_PATH}") + print(f" Traced path: {traced_path}") + print(f" Text TP={config.neuron_config.tp_degree}") + print(f" Vision TP={config.vision_config.neuron_config.tp_degree}") + + model = compile_and_load( + config, tokenizer, traced_path, force_recompile=args.force_recompile + ) + + text_results, text_passed = validate_text(model, tokenizer, tp) + img_result = validate_image_text(model, tokenizer, tp) + + # Summary + all_passed = text_passed and img_result["passed"] + print(f"\n{'=' * 70}") + print(f"TP={tp} SUMMARY") + print(f"{'=' * 70}") + for r in text_results: + print( + f' {"PASS" if r["passed"] else "FAIL"}: "{r["prompt"][:40]}" cosine={r["cosine"]:.6f}' + ) + print( + f" {'PASS' if img_result['passed'] else 'FAIL'}: Image+text ({img_result['n_tokens']} tokens)" + ) + + if all_passed: + print(f"\n ALL TP={tp} TESTS PASSED") + else: + print(f"\n SOME TP={tp} TESTS FAILED") + sys.exit(1) + + # Save + out_path = os.path.join(REFERENCE_DIR, f"neuron_tp{tp}_validation.json") + with open(out_path, "w") as f: + json.dump( + {"tp_degree": tp, "text_results": text_results, "image_result": img_result}, + f, + indent=2, + default=str, + ) + print(f" Results saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py b/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py new file mode 100644 index 00000000..0626639b --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/test_weight_loading.py @@ -0,0 +1,193 @@ +"""Test weight loading: HF -> NxDI state dict conversion for Isaac.""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import torch +from collections import OrderedDict +from transformers import AutoConfig, AutoModelForCausalLM +from neuronx_distributed_inference.models.config import ( + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config +from neuronx_distributed.utils import cpu_mode +from isaac_neuron.modeling_isaac import ( + IsaacInferenceConfig, + NeuronIsaacForConditionalGeneration, +) + +MODEL_PATH = "/mnt/models/Isaac-0.2-2B-Preview" + + +def main(): + # 1) Load HF model and get state dict + print("Loading HF model...") + hf_model = AutoModelForCausalLM.from_pretrained( + MODEL_PATH, trust_remote_code=True, torch_dtype=torch.bfloat16 + ) + hf_state_dict = OrderedDict(hf_model.state_dict()) + print(f"HF state dict keys: {len(hf_state_dict)}") + for k in sorted(hf_state_dict.keys())[:15]: + print(f" {k}: {hf_state_dict[k].shape}") + print(" ...") + del hf_model + torch.cuda.empty_cache() if torch.cuda.is_available() else None + + # 2) Create NxDI config + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + text_nc = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + ) + vision_nc = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + config = IsaacInferenceConfig( + text_neuron_config=text_nc, + vision_neuron_config=vision_nc, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + # 3) Run state dict conversion + print("\nRunning convert_hf_to_neuron_state_dict...") + neuron_sd = NeuronIsaacForConditionalGeneration.convert_hf_to_neuron_state_dict( + hf_state_dict, config + ) + print(f"Neuron state dict keys: {len(neuron_sd)}") + + # 4) Compute expected NxDI parameter names analytically + print("\nComputing expected NxDI parameter names...") + + # Text model expected keys (28 decoder layers, Qwen3 architecture) + num_text_layers = config.text_config.num_hidden_layers # 28 + expected_text = set() + expected_text.add("embed_tokens.weight") + expected_text.add("lm_head.weight") + expected_text.add("norm.weight") + for i in range(num_text_layers): + pfx = f"layers.{i}" + expected_text.add(f"{pfx}.input_layernorm.weight") + expected_text.add(f"{pfx}.post_attention_layernorm.weight") + expected_text.add(f"{pfx}.mlp.gate_proj.weight") + expected_text.add(f"{pfx}.mlp.up_proj.weight") + expected_text.add(f"{pfx}.mlp.down_proj.weight") + # NxDI attention: qkv_proj.{q,k,v}_proj.weight, o_proj.o_proj.weight + expected_text.add(f"{pfx}.self_attn.qkv_proj.q_proj.weight") + expected_text.add(f"{pfx}.self_attn.qkv_proj.k_proj.weight") + expected_text.add(f"{pfx}.self_attn.qkv_proj.v_proj.weight") + expected_text.add(f"{pfx}.self_attn.o_proj.o_proj.weight") + expected_text.add(f"{pfx}.self_attn.q_layernorm.weight") + expected_text.add(f"{pfx}.self_attn.k_layernorm.weight") + + # Vision encoder expected keys (SigLIP2, 27 layers) + num_vision_layers = config.vision_config.num_hidden_layers # 27 + expected_vision = set() + # SigLIP patch embedding + expected_vision.add( + "vision_encoder.vision_encoder.vision_model.embeddings.patch_embedding.weight" + ) + expected_vision.add( + "vision_encoder.vision_encoder.vision_model.embeddings.patch_embedding.bias" + ) + expected_vision.add( + "vision_encoder.vision_encoder.vision_model.embeddings.position_embedding.weight" + ) + # SigLIP encoder layers + for i in range(num_vision_layers): + vpfx = f"vision_encoder.vision_encoder.vision_model.encoder.layers.{i}" + expected_vision.add(f"{vpfx}.layer_norm1.weight") + expected_vision.add(f"{vpfx}.layer_norm1.bias") + expected_vision.add(f"{vpfx}.layer_norm2.weight") + expected_vision.add(f"{vpfx}.layer_norm2.bias") + # NxDI vision attention: qkv_proj.{q,k,v}_proj.{weight,bias}, o_proj.o_proj.{weight,bias} + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.q_proj.weight") + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.q_proj.bias") + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.k_proj.weight") + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.k_proj.bias") + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.v_proj.weight") + expected_vision.add(f"{vpfx}.self_attn.qkv_proj.v_proj.bias") + expected_vision.add(f"{vpfx}.self_attn.o_proj.o_proj.weight") + expected_vision.add(f"{vpfx}.self_attn.o_proj.o_proj.bias") + # MLP + expected_vision.add(f"{vpfx}.mlp.fc1.weight") + expected_vision.add(f"{vpfx}.mlp.fc1.bias") + expected_vision.add(f"{vpfx}.mlp.fc2.weight") + expected_vision.add(f"{vpfx}.mlp.fc2.bias") + # SigLIP post layer norm + expected_vision.add( + "vision_encoder.vision_encoder.vision_model.post_layernorm.weight" + ) + expected_vision.add( + "vision_encoder.vision_encoder.vision_model.post_layernorm.bias" + ) + # MLP projector + expected_vision.add("vision_encoder.multi_modal_projector.fc1.weight") + expected_vision.add("vision_encoder.multi_modal_projector.fc2.weight") + + expected_keys = expected_text | expected_vision + neuron_keys = set(neuron_sd.keys()) + + # Filter runtime keys + skip_patterns = ("rank_util", "sampler", "lm_head.bias") + neuron_filtered = {k for k in neuron_keys if not any(p in k for p in skip_patterns)} + + missing = expected_keys - neuron_filtered + unexpected = neuron_filtered - expected_keys + + print(f"\n=== RESULTS ===") + print(f"Expected keys: {len(expected_keys)}") + print(f"Neuron state dict keys (filtered): {len(neuron_filtered)}") + print(f"Missing (in model, not in weights): {len(missing)}") + print(f"Unexpected (in weights, not in model): {len(unexpected)}") + + if missing: + print("\nMISSING keys:") + for k in sorted(missing): + print(f" {k}") + + if unexpected: + print("\nUNEXPECTED keys:") + for k in sorted(unexpected): + print(f" {k}") + + if not missing and not unexpected: + print("\n*** ALL WEIGHTS MATCH PERFECTLY ***") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py new file mode 100644 index 00000000..16d96c4d --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_image_text.py @@ -0,0 +1,453 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Validate Isaac image+text inference on Neuron. + +Tests the full VLM pipeline: + pixel_values -> SigLIP2 encoder -> pixel_shuffle -> MLP projector -> text decoder + +Since the compiled model uses image_size=256, we use 256x256 images. +The CPU reference was captured with tensor_stream (different preprocessing), +so we validate: +1. E2E generates non-garbage text (qualitative) +2. Top-1 token is (consistent with model behavior) +3. Vision encoder produces reasonable embeddings (not NaN/Inf) + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python validate_image_text.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import json # noqa: E402 +import os # noqa: E402 +import sys # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +import torchvision.transforms as T # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 +from transformers.image_utils import load_image # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B" + +# Isaac uses <|image_pad|> = 151655 as placeholder for vision embeddings +IMAGE_TOKEN_ID = 151655 +IMAGE_SIZE = 256 # Compiled model's vision image_size +PATCH_SIZE = 16 +PIXEL_SHUFFLE_SCALE = 2 +NUM_VISION_TOKENS = (IMAGE_SIZE // PATCH_SIZE) ** 2 // (PIXEL_SHUFFLE_SCALE**2) # 64 + +# SigLIP2 normalization +IMAGE_MEAN = [0.5, 0.5, 0.5] +IMAGE_STD = [0.5, 0.5, 0.5] + +# Environment +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_neuron_configs(): + """Create text and vision neuron configurations (must match compilation).""" + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + return text_config, vision_config + + +def load_compiled_model(): + """Load the pre-compiled Isaac model.""" + text_config, vision_config = create_neuron_configs() + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + # Set image_token_index (Isaac config doesn't have it by default) + config.image_token_index = IMAGE_TOKEN_ID + + print(f"Loading compiled model from {TRACED_MODEL_PATH}...") + model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config) + model.load(TRACED_MODEL_PATH, skip_warmup=True) + print("Model loaded successfully.") + + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + return model, tokenizer + + +def preprocess_image(image: Image.Image) -> torch.Tensor: + """Preprocess image to pixel_values tensor [1, 3, H, W]. + + Matches SigLIP2 normalization: rescale to [0,1], normalize with mean/std=0.5. + """ + transform = T.Compose( + [ + T.Resize( + (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC + ), + T.ToTensor(), # [C, H, W] in [0, 1] + T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD), # -> [-1, 1] + ] + ) + pixel_values = transform(image).unsqueeze(0) # [1, 3, 256, 256] + return pixel_values + + +def prepare_image_text_inputs(prompt: str, image: Image.Image, tokenizer): + """Prepare input_ids, attention_mask, pixel_values, and vision_mask. + + Isaac's processor uses -256 as image token placeholder in tensor_stream. + For NxDI, we: + 1. Tokenize with chat template + 2. Insert IMAGE_TOKEN_ID (151655) for vision token positions + 3. Create boolean vision_mask + + Returns: + input_ids: [1, seq_len] with IMAGE_TOKEN_ID at vision positions + attention_mask: [1, seq_len] all ones + pixel_values: [1, 3, 256, 256] normalized + vision_mask: [1, seq_len, 1] bool + """ + # Build input_ids with image token placeholders + # Format: <|im_start|>user\n[64 image tokens]\n{prompt}<|im_end|>\n<|im_start|>assistant\n + messages = [{"role": "user", "content": prompt}] + text = tokenizer.apply_chat_template( + messages, tokenize=False, add_generation_prompt=True + ) + + # Tokenize the text (without image tokens) + # The template produces: <|im_start|>user\n{prompt}<|im_end|>\n<|im_start|>assistant\n + text_ids = tokenizer.encode(text, return_tensors="pt") # [1, text_len] + text_ids = text_ids[0] # [text_len] + + # Find where to insert image tokens + # Isaac inserts image tokens after "user\n" — between the user header and the prompt content + # The chat template is: <|im_start|>user\n\n{prompt}<|im_end|>\n<|im_start|>assistant\n + # But since we used the prompt directly (without ), we need to insert manually + + # Re-create with placeholder in the message + messages_with_image = [{"role": "user", "content": f"\n{prompt}"}] + text_with_image = tokenizer.apply_chat_template( + messages_with_image, tokenize=False, add_generation_prompt=True + ) + # Tokenize fully + full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0] # [seq_len] + + # Now find where "" tokens are and replace with IMAGE_TOKEN_ID blocks + # The tokenizer encodes "" as multiple tokens: [27, 1805, 29] = '<', 'image', '>' + # We need to replace those 3 tokens with NUM_VISION_TOKENS copies of IMAGE_TOKEN_ID + + # Find the "" token sequence + image_text_ids = tokenizer.encode( + "", add_special_tokens=False + ) # [27, 1805, 29] + image_text_tensor = torch.tensor(image_text_ids) + + # Find position of in full_ids + found_pos = -1 + for i in range(len(full_ids) - len(image_text_ids) + 1): + if torch.equal(full_ids[i : i + len(image_text_ids)], image_text_tensor): + found_pos = i + break + + if found_pos >= 0: + # Replace tokens with IMAGE_TOKEN_ID * NUM_VISION_TOKENS + before = full_ids[:found_pos] + after = full_ids[found_pos + len(image_text_ids) :] + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0) + else: + # Fallback: prepend image tokens after user header + print( + "WARNING: Could not find in tokenized text, prepending image tokens" + ) + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + # Insert after position 2 (after <|im_start|>user\n) + input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0) + + attention_mask = torch.ones_like(input_ids) + pixel_values = preprocess_image(image) + vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool) + + return input_ids, attention_mask, pixel_values, vision_mask + + +def run_validation(): + """Run image+text validation.""" + model, tokenizer = load_compiled_model() + generation_model = HuggingFaceGenerationAdapter(model) + + print(f"\n{'=' * 70}") + print("IMAGE+TEXT INFERENCE VALIDATION ON NEURON") + print(f"{'=' * 70}") + print(f" Image size: {IMAGE_SIZE}x{IMAGE_SIZE}") + print(f" Vision tokens: {NUM_VISION_TOKENS}") + print(f" Image token ID: {IMAGE_TOKEN_ID}") + + # Test images + test_cases = [] + + # Test 1: Solid color image (sanity check) + img_red = Image.new("RGB", (256, 256), color="red") + test_cases.append(("Describe this image in detail.", img_red, "red_square")) + + # Test 2: Reference image (resized to 256x256) + try: + img_ref = load_image( + "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp" + ) + test_cases.append( + ("Describe this image in detail.", img_ref, "reference_image") + ) + test_cases.append( + ("What text or signs do you see in this image?", img_ref, "reference_ocr") + ) + except Exception as e: + print(f" WARNING: Could not load reference image: {e}") + + results = [] + all_passed = True + + for i, (prompt, image, label) in enumerate(test_cases): + print(f'\n--- Test {i}: [{label}] "{prompt}" ---') + print(f" Image: {image.size} -> will be resized to {IMAGE_SIZE}x{IMAGE_SIZE}") + + try: + input_ids, attention_mask, pixel_values, vision_mask = ( + prepare_image_text_inputs(prompt, image, tokenizer) + ) + except Exception as e: + print(f" ERROR in input preparation: {e}") + import traceback + + traceback.print_exc() + all_passed = False + continue + + seq_len = input_ids.shape[1] + n_image_tokens = vision_mask.sum().item() + print(f" input_ids: {input_ids.shape}, seq_len={seq_len}") + print(f" pixel_values: {pixel_values.shape}, dtype={pixel_values.dtype}") + print(f" vision_mask: {n_image_tokens} image tokens") + print( + f" pixel_values range: [{pixel_values.min():.4f}, {pixel_values.max():.4f}]" + ) + + # Verify seq_len fits in bucket + if seq_len > 1024: + print(f" SKIP: seq_len {seq_len} > max bucket 1024") + continue + + sampling_params = prepare_sampling_params( + batch_size=1, + top_k=[1], + top_p=[1.0], + temperature=[1.0], + ) + + generation_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=30, # Generate enough to see meaningful output + ) + + try: + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=generation_config, + max_new_tokens=30, + pixel_values=pixel_values.to(torch.bfloat16), + vision_mask=vision_mask, + ) + except Exception as e: + print(f" ERROR in generate: {e}") + import traceback + + traceback.print_exc() + all_passed = False + results.append({"label": label, "passed": False, "error": str(e)}) + continue + + # Extract generated tokens + if hasattr(outputs, "sequences"): + generated = outputs.sequences[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + else: + generated = outputs[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated, skip_special_tokens=True) + + print(f" Generated: {gen_text[:200]!r}") + + # Extract first-token logits + first_logits = None + if ( + hasattr(outputs, "scores") + and outputs.scores is not None + and len(outputs.scores) > 0 + ): + first_logits = outputs.scores[0][0].float().cpu() + top5 = torch.topk(first_logits, 5) + top5_tokens = [tokenizer.decode([tid]) for tid in top5.indices.tolist()] + print(f" Top-5 tokens: {list(zip(top5_tokens, top5.values.tolist()))}") + top1 = first_logits.argmax().item() + print(f" Top-1: {top1} ({tokenizer.decode([top1])!r})") + + # Validation checks + passed = True + failures = [] + + # Check 1: Generated text is not empty + if len(gen_text.strip()) == 0: + passed = False + failures.append("Empty generated text") + + # Check 2: No NaN in logits + if first_logits is not None and torch.isnan(first_logits).any(): + passed = False + failures.append("NaN in logits") + + # Check 3: No Inf in logits + if first_logits is not None and torch.isinf(first_logits).any(): + passed = False + failures.append("Inf in logits") + + # Check 4: Top-1 should be (consistent with model behavior) + if first_logits is not None: + top1 = first_logits.argmax().item() + if top1 != 151667: + # Not necessarily a failure for image inputs + print( + f" NOTE: Top-1 is {top1}, not (151667) — may be normal for image input" + ) + + result = { + "label": label, + "prompt": prompt, + "passed": passed, + "generated_text": gen_text[:200], + "top1": first_logits.argmax().item() if first_logits is not None else None, + "failures": failures, + } + results.append(result) + if not passed: + all_passed = False + + status = "PASS" if passed else "FAIL" + print(f" [{status}]") + for f in failures: + print(f" FAILURE: {f}") + + # Summary + print(f"\n{'=' * 70}") + print("SUMMARY") + print(f"{'=' * 70}") + passed_count = sum(1 for r in results if r["passed"]) + total = len(results) + print(f" Passed: {passed_count}/{total}") + + if all_passed: + print("\n ALL IMAGE+TEXT TESTS PASSED") + else: + print("\n SOME TESTS FAILED — see details above") + sys.exit(1) + + # Save results + out_path = os.path.join(REFERENCE_DIR, "neuron_image_text_validation.json") + with open(out_path, "w") as f: + json.dump(results, f, indent=2, default=str) + print(f"\n Results saved to {out_path}") + + +if __name__ == "__main__": + run_validation() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py new file mode 100644 index 00000000..24451bac --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_text_logits.py @@ -0,0 +1,369 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Validate Isaac text-only logits on Neuron against CPU reference. + +Loads the compiled Isaac model, runs all 5 text reference prompts, +and compares first-token logit distributions against saved CPU reference .pt files. + +Metrics: +- Top-1 token match +- Top-5 / Top-10 overlap +- Cosine similarity of full logit vectors +- Max absolute error + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python validate_text_logits.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import json # noqa: E402 +import os # noqa: E402 +import sys # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B" + +# Same prompts as capture_reference.py +TEXT_PROMPTS = [ + "The capital of France is", + "def fibonacci(n):", + "Explain quantum entanglement in simple terms:", + "The meaning of life is", + "List three primary colors:", +] + +# Thresholds +COSINE_SIM_THRESHOLD = 0.99 # BF16 quantization on Neuron vs FP32 CPU +TOP1_MUST_MATCH = True +TOP5_MIN_OVERLAP = 3 # At least 3 of 5 should match +TOP10_MIN_OVERLAP = 5 # At least 5 of 10 should match + +# Environment +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_neuron_configs(): + """Create text and vision neuron configurations (must match compilation).""" + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + return text_config, vision_config + + +def load_compiled_model(): + """Load the pre-compiled Isaac model from traced checkpoint.""" + text_config, vision_config = create_neuron_configs() + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + print(f"Loading compiled model from {TRACED_MODEL_PATH}...") + model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config) + model.load(TRACED_MODEL_PATH, skip_warmup=True) + print("Model loaded successfully.") + + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + + return model, tokenizer + + +def compare_logits(neuron_logits, ref_logits, prompt_name): + """Compare Neuron vs CPU reference logit vectors. + + Args: + neuron_logits: [vocab_size] float tensor from Neuron + ref_logits: [vocab_size] float tensor from CPU reference + prompt_name: string for logging + + Returns: + dict with all comparison metrics, and bool pass/fail + """ + neuron_f = neuron_logits.float() + ref_f = ref_logits.float() + + # Top-1 match + neuron_top1 = neuron_f.argmax().item() + ref_top1 = ref_f.argmax().item() + top1_match = neuron_top1 == ref_top1 + + # Top-5 overlap + neuron_top5 = set(torch.topk(neuron_f, 5).indices.tolist()) + ref_top5 = set(torch.topk(ref_f, 5).indices.tolist()) + top5_overlap = len(neuron_top5 & ref_top5) + + # Top-10 overlap + neuron_top10 = set(torch.topk(neuron_f, 10).indices.tolist()) + ref_top10 = set(torch.topk(ref_f, 10).indices.tolist()) + top10_overlap = len(neuron_top10 & ref_top10) + + # Cosine similarity + cosine_sim = F.cosine_similarity(neuron_f.unsqueeze(0), ref_f.unsqueeze(0)).item() + + # Max absolute error + max_abs_err = (neuron_f - ref_f).abs().max().item() + + # Mean absolute error + mean_abs_err = (neuron_f - ref_f).abs().mean().item() + + # Pass/fail + passed = True + failures = [] + if TOP1_MUST_MATCH and not top1_match: + passed = False + failures.append(f"Top-1 mismatch: Neuron={neuron_top1}, CPU={ref_top1}") + if top5_overlap < TOP5_MIN_OVERLAP: + passed = False + failures.append(f"Top-5 overlap {top5_overlap} < {TOP5_MIN_OVERLAP}") + if top10_overlap < TOP10_MIN_OVERLAP: + passed = False + failures.append(f"Top-10 overlap {top10_overlap} < {TOP10_MIN_OVERLAP}") + if cosine_sim < COSINE_SIM_THRESHOLD: + passed = False + failures.append(f"Cosine sim {cosine_sim:.6f} < {COSINE_SIM_THRESHOLD}") + + result = { + "prompt": prompt_name, + "passed": passed, + "top1_match": top1_match, + "neuron_top1": neuron_top1, + "ref_top1": ref_top1, + "top5_overlap": top5_overlap, + "top10_overlap": top10_overlap, + "cosine_sim": cosine_sim, + "max_abs_err": max_abs_err, + "mean_abs_err": mean_abs_err, + "failures": failures, + "neuron_top10_ids": sorted(neuron_top10), + "ref_top10_ids": sorted(ref_top10), + } + + return result, passed + + +def run_validation(): + """Main validation loop.""" + model, tokenizer = load_compiled_model() + generation_model = HuggingFaceGenerationAdapter(model) + + # Load reference results metadata + with open(os.path.join(REFERENCE_DIR, "reference_results.json")) as f: + ref_metadata = json.load(f) + + print(f"\n{'=' * 70}") + print("TEXT-ONLY LOGIT VALIDATION: Neuron vs CPU Reference") + print(f"{'=' * 70}") + print(f" Reference dir: {REFERENCE_DIR}") + print( + f" Thresholds: cosine>{COSINE_SIM_THRESHOLD}, top1_must_match={TOP1_MUST_MATCH}" + ) + print(f" Prompts: {len(TEXT_PROMPTS)}") + + results = [] + all_passed = True + + for i, prompt in enumerate(TEXT_PROMPTS): + print(f'\n--- Prompt {i}: "{prompt}" ---') + + # Load CPU reference logits + ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt") + if not os.path.exists(ref_path): + print(f" SKIP: Reference file not found: {ref_path}") + continue + ref_logits = torch.load(ref_path, map_location="cpu") # [151936] float32 + print( + f" CPU ref: top-1={ref_logits.argmax().item()}, shape={ref_logits.shape}" + ) + + # Tokenize with chat template (matching capture_reference.py) + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + seq_len = input_ids.shape[1] + print(f" Input seq_len: {seq_len}") + + # Generate with logit collection + # We only need 1 new token to get the first-token logits (CTE pass) + sampling_params = prepare_sampling_params( + batch_size=1, + top_k=[1], + top_p=[1.0], + temperature=[1.0], # temperature=1.0 so scores == raw logits + ) + + generation_config = GenerationConfig( + do_sample=False, + output_scores=True, + return_dict_in_generate=True, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=1, # Only need first token + ) + + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=generation_config, + max_new_tokens=1, + ) + + # Extract first-token logits from scores + # outputs.scores is a tuple of tensors, one per generated token + # outputs.scores[0] shape: [batch_size, vocab_size] + if ( + hasattr(outputs, "scores") + and outputs.scores is not None + and len(outputs.scores) > 0 + ): + neuron_logits = outputs.scores[0][0].float().cpu() # [vocab_size] + print( + f" Neuron: top-1={neuron_logits.argmax().item()}, shape={neuron_logits.shape}" + ) + else: + print( + " ERROR: No scores in output. Check output_logits=True in NeuronConfig." + ) + print(f" Output type: {type(outputs)}") + if hasattr(outputs, "__dict__"): + print(f" Output attrs: {list(outputs.__dict__.keys())}") + all_passed = False + continue + + # Compare + result, passed = compare_logits(neuron_logits, ref_logits, prompt) + results.append(result) + if not passed: + all_passed = False + + # Print result + status = "PASS" if passed else "FAIL" + print( + f" [{status}] cosine={result['cosine_sim']:.6f}, " + f"top1={'match' if result['top1_match'] else 'MISMATCH'}, " + f"top5={result['top5_overlap']}/5, top10={result['top10_overlap']}/10, " + f"max_abs_err={result['max_abs_err']:.4f}" + ) + if not passed: + for f in result["failures"]: + print(f" FAILURE: {f}") + + # Summary + print(f"\n{'=' * 70}") + print("SUMMARY") + print(f"{'=' * 70}") + passed_count = sum(1 for r in results if r["passed"]) + total = len(results) + print(f" Passed: {passed_count}/{total}") + + if results: + avg_cosine = sum(r["cosine_sim"] for r in results) / len(results) + avg_top5 = sum(r["top5_overlap"] for r in results) / len(results) + avg_top10 = sum(r["top10_overlap"] for r in results) / len(results) + print(f" Avg cosine sim: {avg_cosine:.6f}") + print(f" Avg top-5 overlap: {avg_top5:.1f}/5") + print(f" Avg top-10 overlap: {avg_top10:.1f}/10") + + if all_passed: + print("\n ALL TEXT PROMPTS PASSED") + else: + print("\n SOME PROMPTS FAILED — see details above") + sys.exit(1) + + # Save results + out_path = os.path.join(REFERENCE_DIR, "neuron_text_validation.json") + with open(out_path, "w") as f: + json.dump(results, f, indent=2) + print(f"\n Results saved to {out_path}") + + +if __name__ == "__main__": + run_validation() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py new file mode 100644 index 00000000..20ed9469 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_tkg.py @@ -0,0 +1,710 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Validate Isaac TKG (token generation) on Neuron. + +Tests the full CTE+TKG generation loop with: +1. Multi-token text-only generation (50+ tokens, 5 prompts) +2. Multi-token image+text generation +3. Per-step logit extraction at max_new_tokens=32 +4. Edge cases: state reset, consecutive generates, vision clearing + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python validate_tkg.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import json # noqa: E402 +import os # noqa: E402 +import sys # noqa: E402 +import time # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +import torchvision.transforms as T # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoConfig, AutoTokenizer, GenerationConfig # noqa: E402 +from transformers.image_utils import load_image # noqa: E402 + +from neuronx_distributed_inference.models.config import ( # noqa: E402 + NeuronConfig, + OnDeviceSamplingConfig, +) +from neuronx_distributed_inference.utils.hf_adapter import ( # noqa: E402 + load_pretrained_config, + HuggingFaceGenerationAdapter, +) +from neuronx_distributed_inference.modules.generation.sampling import ( # noqa: E402 + prepare_sampling_params, +) + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B" + +IMAGE_TOKEN_ID = 151655 # <|image_pad|> +IMAGE_SIZE = 256 +IMAGE_MEAN = [0.5, 0.5, 0.5] +IMAGE_STD = [0.5, 0.5, 0.5] +NUM_VISION_TOKENS = (IMAGE_SIZE // 16) ** 2 // 4 # 64 + +TEXT_PROMPTS = [ + "The capital of France is", + "def fibonacci(n):", + "Explain quantum entanglement in simple terms:", + "The meaning of life is", + "List three primary colors:", +] + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def create_neuron_configs(): + """Create text and vision neuron configurations (must match compilation).""" + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + on_device_sampling_config=OnDeviceSamplingConfig( + dynamic=True, + do_sample=True, + deterministic=True, + temperature=1.0, + top_p=1.0, + top_k=1, + global_topk=256, + top_k_kernel_enabled=True, + ), + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + return text_config, vision_config + + +def load_compiled_model(): + text_config, vision_config = create_neuron_configs() + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + config.image_token_index = IMAGE_TOKEN_ID + model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config) + model.load(TRACED_MODEL_PATH, skip_warmup=True) + tokenizer = AutoTokenizer.from_pretrained( + MODEL_PATH, padding_side="right", trust_remote_code=True + ) + tokenizer.pad_token = tokenizer.eos_token + return model, tokenizer + + +def preprocess_image(image: Image.Image) -> torch.Tensor: + transform = T.Compose( + [ + T.Resize( + (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC + ), + T.ToTensor(), + T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD), + ] + ) + return transform(image).unsqueeze(0) + + +def prepare_image_text_inputs(prompt, image, tokenizer): + """Prepare input_ids with image token placeholders.""" + messages_with_image = [{"role": "user", "content": f"\n{prompt}"}] + text_with_image = tokenizer.apply_chat_template( + messages_with_image, tokenize=False, add_generation_prompt=True + ) + full_ids = tokenizer.encode(text_with_image, return_tensors="pt")[0] + + # Find tokens and replace with IMAGE_TOKEN_ID placeholders + image_text_ids = tokenizer.encode("", add_special_tokens=False) + image_text_tensor = torch.tensor(image_text_ids) + + found_pos = -1 + for i in range(len(full_ids) - len(image_text_ids) + 1): + if torch.equal(full_ids[i : i + len(image_text_ids)], image_text_tensor): + found_pos = i + break + + if found_pos >= 0: + before = full_ids[:found_pos] + after = full_ids[found_pos + len(image_text_ids) :] + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([before, image_tokens, after]).unsqueeze(0) + else: + image_tokens = torch.full( + (NUM_VISION_TOKENS,), IMAGE_TOKEN_ID, dtype=torch.long + ) + input_ids = torch.cat([full_ids[:3], image_tokens, full_ids[3:]]).unsqueeze(0) + + attention_mask = torch.ones_like(input_ids) + pixel_values = preprocess_image(image).to(torch.bfloat16) + vision_mask = (input_ids == IMAGE_TOKEN_ID).unsqueeze(-1).to(torch.bool) + return input_ids, attention_mask, pixel_values, vision_mask + + +def generate_text( + model, + tokenizer, + prompt, + max_new_tokens=50, + collect_logits=False, + pixel_values=None, + vision_mask=None, +): + """Run generation and optionally collect per-step logits.""" + generation_model = HuggingFaceGenerationAdapter(model) + + messages = [{"role": "user", "content": prompt}] + input_ids = tokenizer.apply_chat_template( + messages, tokenize=True, add_generation_prompt=True, return_tensors="pt" + ) + attention_mask = torch.ones_like(input_ids) + + sampling_params = prepare_sampling_params( + batch_size=1, + top_k=[1], + top_p=[1.0], + temperature=[1.0], + ) + + gen_config = GenerationConfig( + do_sample=False, + output_scores=collect_logits, + return_dict_in_generate=collect_logits, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=max_new_tokens, + ) + + kwargs = dict( + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=max_new_tokens, + ) + if pixel_values is not None: + kwargs["pixel_values"] = pixel_values + if vision_mask is not None: + kwargs["vision_mask"] = vision_mask + + start = time.time() + outputs = generation_model.generate(input_ids, **kwargs) + elapsed = time.time() - start + + if collect_logits and hasattr(outputs, "sequences"): + generated_ids = outputs.sequences[0, input_ids.shape[1] :] + scores = outputs.scores if outputs.scores else [] + else: + if hasattr(outputs, "sequences"): + generated_ids = outputs.sequences[0, input_ids.shape[1] :] + else: + generated_ids = outputs[0, input_ids.shape[1] :] + scores = [] + + gen_text = tokenizer.decode(generated_ids, skip_special_tokens=False) + gen_text_clean = tokenizer.decode(generated_ids, skip_special_tokens=True) + + return { + "input_ids": input_ids, + "generated_ids": generated_ids, + "text_raw": gen_text, + "text_clean": gen_text_clean, + "scores": scores, + "elapsed": elapsed, + "num_tokens": len(generated_ids), + "tokens_per_sec": len(generated_ids) / elapsed if elapsed > 0 else 0, + } + + +# =========================================================================== +# Test functions +# =========================================================================== + + +def test_multi_token_text(model, tokenizer): + """Test 1: Multi-token text-only generation for all 5 prompts.""" + print(f"\n{'=' * 70}") + print("TEST 1: Multi-token text-only generation (50 tokens)") + print(f"{'=' * 70}") + + results = [] + all_passed = True + + for i, prompt in enumerate(TEXT_PROMPTS): + print(f'\n--- Prompt {i}: "{prompt}" ---') + result = generate_text(model, tokenizer, prompt, max_new_tokens=50) + + # Validation + passed = True + failures = [] + + # Non-empty + if len(result["text_clean"].strip()) == 0: + passed = False + failures.append("Empty output") + + # Generated expected number of tokens (or hit EOS) + if result["num_tokens"] == 0: + passed = False + failures.append("Zero tokens generated") + + # Should start with (Isaac thinking model) + first_token = ( + result["generated_ids"][0].item() if result["num_tokens"] > 0 else -1 + ) + if first_token != 151667: + failures.append( + f"First token {first_token} != (151667) — may be normal" + ) + + # Check for repetition (degenerate TKG) + if result["num_tokens"] >= 10: + last_10 = result["generated_ids"][-10:].tolist() + if len(set(last_10)) <= 2: + passed = False + failures.append(f"Degenerate repetition in last 10 tokens: {last_10}") + + result["passed"] = passed + result["failures"] = failures + results.append(result) + if not passed: + all_passed = False + + status = "PASS" if passed else "FAIL" + print( + f" [{status}] {result['num_tokens']} tokens in {result['elapsed']:.2f}s ({result['tokens_per_sec']:.1f} tok/s)" + ) + print(f" Output: {result['text_clean'][:200]!r}") + for f in failures: + print(f" NOTE: {f}") + + return results, all_passed + + +def test_logit_collection(model, tokenizer): + """Test 2: Collect per-step logits at max_new_tokens=32.""" + print(f"\n{'=' * 70}") + print("TEST 2: Per-step logit collection (32 tokens)") + print(f"{'=' * 70}") + + results = [] + all_passed = True + + for i, prompt in enumerate(TEXT_PROMPTS[:3]): # First 3 prompts + print(f'\n--- Prompt {i}: "{prompt}" ---') + result = generate_text( + model, tokenizer, prompt, max_new_tokens=32, collect_logits=True + ) + + passed = True + failures = [] + + # Check we got scores + n_scores = len(result["scores"]) + print( + f" Generated {result['num_tokens']} tokens, collected {n_scores} score tensors" + ) + + if n_scores == 0: + passed = False + failures.append("No scores collected (output_logits may not be working)") + else: + # Check each score tensor + for step_idx, score in enumerate(result["scores"]): + s = score[0].float() # [vocab_size] + if torch.isnan(s).any(): + passed = False + failures.append(f"NaN at step {step_idx}") + break + if torch.isinf(s).any(): + passed = False + failures.append(f"Inf at step {step_idx}") + break + + # Compare first-token logits against saved reference + ref_path = os.path.join(REFERENCE_DIR, f"text_logits_{i:03d}.pt") + if os.path.exists(ref_path) and n_scores > 0: + ref_logits = torch.load(ref_path, map_location="cpu") + neuron_first = result["scores"][0][0].float().cpu() + cosine = F.cosine_similarity( + neuron_first.unsqueeze(0), ref_logits.unsqueeze(0) + ).item() + print(f" First-token cosine vs CPU ref: {cosine:.6f}") + if cosine < 0.99: + passed = False + failures.append(f"First-token cosine {cosine:.6f} < 0.99") + + # Check that later tokens also have reasonable logits + if n_scores >= 5: + for step in [0, n_scores // 2, n_scores - 1]: + s = result["scores"][step][0].float() + top1 = s.argmax().item() + top1_val = s.max().item() + print( + f" Step {step}: top-1={top1} ({tokenizer.decode([top1])!r}), logit={top1_val:.2f}" + ) + + result["passed"] = passed + result["failures"] = failures + result["n_scores"] = n_scores + results.append(result) + if not passed: + all_passed = False + + status = "PASS" if passed else "FAIL" + print(f" [{status}]") + for f in failures: + print(f" FAILURE: {f}") + + return results, all_passed + + +def test_state_reset(model, tokenizer): + """Test 3: Verify state resets between consecutive generate() calls.""" + print(f"\n{'=' * 70}") + print("TEST 3: State reset between consecutive generates") + print(f"{'=' * 70}") + + passed = True + failures = [] + + # Run same prompt twice — should get identical output + print("\n Running same prompt twice...") + r1 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20) + r2 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20) + + ids1 = r1["generated_ids"].tolist() + ids2 = r2["generated_ids"].tolist() + match = ids1 == ids2 + print(f" Run 1: {r1['text_clean'][:100]!r}") + print(f" Run 2: {r2['text_clean'][:100]!r}") + print(f" Token sequences match: {match}") + if not match: + # Check how many match + min_len = min(len(ids1), len(ids2)) + matching = sum(1 for a, b in zip(ids1[:min_len], ids2[:min_len]) if a == b) + print(f" Matching: {matching}/{min_len} tokens") + if matching < min_len * 0.9: + failures.append( + f"Same prompt gave different outputs: {matching}/{min_len} match" + ) + passed = False + + # Run different prompts — verify no cross-contamination + print("\n Running different prompts...") + r3 = generate_text(model, tokenizer, "def fibonacci(n):", max_new_tokens=20) + r4 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20) + + ids4 = r4["generated_ids"].tolist() + match_after = ids4 == ids2 + print(f" After different prompt, re-running 'France': {r4['text_clean'][:100]!r}") + print(f" Matches original: {match_after}") + if not match_after: + min_len = min(len(ids4), len(ids2)) + matching = sum(1 for a, b in zip(ids4[:min_len], ids2[:min_len]) if a == b) + if matching < min_len * 0.9: + failures.append( + f"State contamination: re-run after different prompt gives different output ({matching}/{min_len})" + ) + passed = False + + status = "PASS" if passed else "FAIL" + print(f"\n [{status}]") + for f in failures: + print(f" FAILURE: {f}") + + return {"passed": passed, "failures": failures} + + +def test_image_text_generation(model, tokenizer): + """Test 4: Multi-token image+text generation.""" + print(f"\n{'=' * 70}") + print("TEST 4: Image+text multi-token generation") + print(f"{'=' * 70}") + + passed = True + failures = [] + + try: + ref_img = load_image( + "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp" + ) + except Exception as e: + print(f" WARNING: Could not load reference image: {e}") + ref_img = Image.new("RGB", (256, 256), color="blue") + + prompt = "Describe this image in detail." + input_ids, attention_mask, pixel_values, vision_mask = prepare_image_text_inputs( + prompt, ref_img, tokenizer + ) + + print(f" Input: {input_ids.shape}, vision tokens: {vision_mask.sum().item()}") + + generation_model = HuggingFaceGenerationAdapter(model) + sampling_params = prepare_sampling_params( + batch_size=1, + top_k=[1], + top_p=[1.0], + temperature=[1.0], + ) + gen_config = GenerationConfig( + do_sample=False, + output_scores=False, + return_dict_in_generate=False, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=50, + ) + + start = time.time() + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=50, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + elapsed = time.time() - start + + generated_ids = outputs[0, input_ids.shape[1] :] + gen_text = tokenizer.decode(generated_ids, skip_special_tokens=True) + n_tokens = len(generated_ids) + + print( + f" Generated {n_tokens} tokens in {elapsed:.2f}s ({n_tokens / elapsed:.1f} tok/s)" + ) + print(f" Output: {gen_text[:300]!r}") + + if len(gen_text.strip()) == 0: + passed = False + failures.append("Empty image+text output") + + if n_tokens == 0: + passed = False + failures.append("Zero tokens generated") + + # Check for degenerate repetition + if n_tokens >= 10: + last_10 = generated_ids[-10:].tolist() + if len(set(last_10)) <= 2: + passed = False + failures.append(f"Degenerate repetition: {last_10}") + + status = "PASS" if passed else "FAIL" + print(f" [{status}]") + for f in failures: + print(f" FAILURE: {f}") + + return { + "passed": passed, + "failures": failures, + "text": gen_text[:300], + "n_tokens": n_tokens, + } + + +def test_vision_state_reset(model, tokenizer): + """Test 5: Vision state resets between image and text-only prompts.""" + print(f"\n{'=' * 70}") + print("TEST 5: Vision state reset (image -> text -> image)") + print(f"{'=' * 70}") + + passed = True + failures = [] + + # 1. Run text-only + r1 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20) + print(f" Text-only: {r1['text_clean'][:100]!r}") + + # 2. Run image+text + img = Image.new("RGB", (256, 256), color="red") + input_ids, attention_mask, pixel_values, vision_mask = prepare_image_text_inputs( + "Describe this image.", img, tokenizer + ) + + generation_model = HuggingFaceGenerationAdapter(model) + sampling_params = prepare_sampling_params( + batch_size=1, top_k=[1], top_p=[1.0], temperature=[1.0] + ) + gen_config = GenerationConfig( + do_sample=False, + pad_token_id=tokenizer.eos_token_id, + max_new_tokens=20, + ) + outputs = generation_model.generate( + input_ids, + attention_mask=attention_mask, + max_length=model.config.neuron_config.max_length, + sampling_params=sampling_params, + generation_config=gen_config, + max_new_tokens=20, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + img_text = tokenizer.decode( + outputs[0, input_ids.shape[1] :], skip_special_tokens=True + ) + print(f" Image+text: {img_text[:100]!r}") + + # 3. Run text-only again — should match run 1 + r3 = generate_text(model, tokenizer, "The capital of France is", max_new_tokens=20) + print(f" Text-only (after image): {r3['text_clean'][:100]!r}") + + ids1 = r1["generated_ids"].tolist() + ids3 = r3["generated_ids"].tolist() + match = ids1 == ids3 + print(f" Text outputs match (pre/post image): {match}") + + if not match: + min_len = min(len(ids1), len(ids3)) + matching = sum(1 for a, b in zip(ids1[:min_len], ids3[:min_len]) if a == b) + if matching < min_len * 0.9: + passed = False + failures.append( + f"Vision state leaked: text output changed after image prompt ({matching}/{min_len})" + ) + + status = "PASS" if passed else "FAIL" + print(f" [{status}]") + for f in failures: + print(f" FAILURE: {f}") + + return {"passed": passed, "failures": failures} + + +# =========================================================================== +# Main +# =========================================================================== + + +def main(): + print(f"{'=' * 70}") + print("TKG VALIDATION: Isaac on Neuron") + print(f"{'=' * 70}") + + model, tokenizer = load_compiled_model() + + # Run all tests + test_results = {} + + r1, p1 = test_multi_token_text(model, tokenizer) + test_results["multi_token_text"] = { + "results": [ + { + "prompt": TEXT_PROMPTS[i], + "passed": r["passed"], + "n_tokens": r["num_tokens"], + "text": r["text_clean"][:200], + "tok_per_sec": r["tokens_per_sec"], + } + for i, r in enumerate(r1) + ], + "all_passed": p1, + } + + r2, p2 = test_logit_collection(model, tokenizer) + test_results["logit_collection"] = { + "results": [ + { + "prompt": TEXT_PROMPTS[i], + "passed": r["passed"], + "n_scores": r.get("n_scores", 0), + } + for i, r in enumerate(r2) + ], + "all_passed": p2, + } + + r3 = test_state_reset(model, tokenizer) + test_results["state_reset"] = r3 + + r4 = test_image_text_generation(model, tokenizer) + test_results["image_text_generation"] = r4 + + r5 = test_vision_state_reset(model, tokenizer) + test_results["vision_state_reset"] = r5 + + # Overall summary + all_tests = [p1, p2, r3["passed"], r4["passed"], r5["passed"]] + all_passed = all(all_tests) + + print(f"\n{'=' * 70}") + print("OVERALL SUMMARY") + print(f"{'=' * 70}") + test_names = [ + "Multi-token text", + "Logit collection", + "State reset", + "Image+text generation", + "Vision state reset", + ] + for name, p in zip(test_names, all_tests): + print(f" {'PASS' if p else 'FAIL'}: {name}") + + if all_passed: + print(f"\n ALL TKG TESTS PASSED") + else: + print(f"\n SOME TESTS FAILED") + sys.exit(1) + + # Save results + out_path = os.path.join(REFERENCE_DIR, "neuron_tkg_validation.json") + with open(out_path, "w") as f: + json.dump(test_results, f, indent=2, default=str) + print(f" Results saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py b/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py new file mode 100644 index 00000000..8cd31c06 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/test/integration/validate_vision_encoder.py @@ -0,0 +1,250 @@ +# Copyright 2025 © Amazon.com and Affiliates +"""Validate Isaac vision encoder on Neuron vs CPU reference. + +Approach: Since the HF Isaac model uses a different vision input format +(packed_seq_patches via tensor_stream) than the NxDI model (standard pixel_values +through Conv2d), we can't directly compare vision encoder outputs. + +Instead, we validate the Neuron vision encoder by: +1. Running the NxDI vision encoder on a test image +2. Checking that output embeddings are numerically reasonable (no NaN/Inf) +3. Checking that different images produce different embeddings (not degenerate) +4. Running a manual Conv2d + encoder comparison using reshaped weights + +Usage: + source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate + export PYTHONPATH=/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:$PYTHONPATH + python validate_vision_encoder.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import json # noqa: E402 +import os # noqa: E402 +import sys # noqa: E402 + +import torch # noqa: E402 +import torch.nn.functional as F # noqa: E402 +import torchvision.transforms as T # noqa: E402 +from PIL import Image # noqa: E402 +from transformers import AutoConfig # noqa: E402 +from transformers.image_utils import load_image # noqa: E402 + +from neuronx_distributed_inference.models.config import NeuronConfig # noqa: E402 +from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config # noqa: E402 + +from isaac_neuron.modeling_isaac import ( # noqa: E402 + NeuronIsaacForConditionalGeneration, + IsaacInferenceConfig, +) + +# --------------------------------------------------------------------------- +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +TRACED_MODEL_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B" +REFERENCE_DIR = f"{DATA_PATH}/reference_outputs" + +IMAGE_SIZE = 256 +IMAGE_MEAN = [0.5, 0.5, 0.5] +IMAGE_STD = [0.5, 0.5, 0.5] + +os.environ["NEURON_RT_STOCHASTIC_ROUNDING_EN"] = "0" +torch.manual_seed(42) + + +def preprocess_image(image: Image.Image) -> torch.Tensor: + transform = T.Compose( + [ + T.Resize( + (IMAGE_SIZE, IMAGE_SIZE), interpolation=T.InterpolationMode.BICUBIC + ), + T.ToTensor(), + T.Normalize(mean=IMAGE_MEAN, std=IMAGE_STD), + ] + ) + return transform(image).unsqueeze(0) + + +def load_neuron_model(): + """Load the compiled Neuron model and return the full model object.""" + text_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + cp_degree=1, + save_sharded_checkpoint=True, + skip_sharding=False, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + context_encoding_buckets=[1024], + token_generation_buckets=[1024], + async_mode=False, + output_logits=True, + fused_qkv=False, + sequence_parallel_enabled=False, + attn_kernel_enabled=False, + attn_tkg_nki_kernel_enabled=False, + attn_tkg_builtin_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + vision_config = NeuronConfig( + batch_size=1, + seq_len=1024, + torch_dtype=torch.bfloat16, + tp_degree=1, + world_size=1, + save_sharded_checkpoint=True, + is_continuous_batching=True, + ctx_batch_size=1, + enable_bucketing=True, + buckets=[1], + fused_qkv=False, + attn_kernel_enabled=False, + qkv_kernel_enabled=False, + mlp_kernel_enabled=False, + ) + + hf_config = AutoConfig.from_pretrained(MODEL_PATH, trust_remote_code=True) + config = IsaacInferenceConfig( + text_neuron_config=text_config, + vision_neuron_config=vision_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + model = NeuronIsaacForConditionalGeneration(TRACED_MODEL_PATH, config) + model.load(TRACED_MODEL_PATH, skip_warmup=True) + return model + + +def main(): + print(f"{'=' * 70}") + print("VISION ENCODER VALIDATION: Neuron") + print(f"{'=' * 70}") + + # Prepare test images + images = { + "red": Image.new("RGB", (256, 256), color="red"), + "blue": Image.new("RGB", (256, 256), color="blue"), + "black": Image.new("RGB", (256, 256), color="black"), + } + try: + images["reference"] = load_image( + "https://raw.githubusercontent.com/perceptron-ai-inc/perceptron/refs/heads/main/huggingface/assets/example.webp" + ) + except Exception as e: + print(f" WARNING: Could not load reference image: {e}") + + # Load model + print("\nLoading compiled Neuron model...") + model = load_neuron_model() + print(" Model loaded.") + + # Run vision encoder on each image + embeddings = {} + all_passed = True + results = [] + + for label, img in images.items(): + print(f"\n--- {label} ({img.size}) ---") + pixel_values = preprocess_image(img).to(torch.bfloat16) + print(f" pixel_values: {pixel_values.shape}") + + with torch.no_grad(): + output = model.vision_encoder_model(pixel_values) + + output_f = output.float().cpu() + embeddings[label] = output_f + + # Check 1: Shape + expected_tokens = (IMAGE_SIZE // 16) ** 2 // 4 # 64 + expected_dim = 2048 # text hidden size + shape_ok = output_f.shape == torch.Size([1, expected_tokens, expected_dim]) + print( + f" Output shape: {output_f.shape} (expected [1, {expected_tokens}, {expected_dim}]): {'OK' if shape_ok else 'FAIL'}" + ) + + # Check 2: No NaN + has_nan = torch.isnan(output_f).any().item() + print(f" NaN check: {'FAIL' if has_nan else 'OK'}") + + # Check 3: No Inf + has_inf = torch.isinf(output_f).any().item() + print(f" Inf check: {'FAIL' if has_inf else 'OK'}") + + # Check 4: Non-zero variance (not degenerate) + variance = output_f.var().item() + variance_ok = variance > 1e-6 + print( + f" Variance: {variance:.6f} {'OK' if variance_ok else 'FAIL (degenerate)'}" + ) + + # Check 5: Reasonable value range + val_min = output_f.min().item() + val_max = output_f.max().item() + val_mean = output_f.mean().item() + range_ok = abs(val_min) < 100 and abs(val_max) < 100 + print( + f" Range: [{val_min:.4f}, {val_max:.4f}], mean={val_mean:.4f} {'OK' if range_ok else 'SUSPICIOUS'}" + ) + + passed = shape_ok and not has_nan and not has_inf and variance_ok and range_ok + if not passed: + all_passed = False + results.append( + { + "label": label, + "passed": passed, + "shape": list(output_f.shape), + "has_nan": has_nan, + "has_inf": has_inf, + "variance": variance, + "range": [val_min, val_max], + "mean": val_mean, + } + ) + + # Cross-image comparison: different images should produce different embeddings + print(f"\n--- Cross-image comparison ---") + labels = list(embeddings.keys()) + for i in range(len(labels)): + for j in range(i + 1, len(labels)): + a, b = labels[i], labels[j] + cos = F.cosine_similarity( + embeddings[a].reshape(1, -1), embeddings[b].reshape(1, -1) + ).item() + different = cos < 0.999 # Different images should have cosine < 0.999 + print( + f" {a} vs {b}: cosine={cos:.6f} {'OK (different)' if different else 'WARNING (too similar)'}" + ) + if not different: + print(f" WARNING: Very similar embeddings for different images!") + + # Summary + print(f"\n{'=' * 70}") + print("SUMMARY") + print(f"{'=' * 70}") + for r in results: + status = "PASS" if r["passed"] else "FAIL" + print( + f" [{status}] {r['label']}: shape={r['shape']}, var={r['variance']:.6f}, range=[{r['range'][0]:.3f}, {r['range'][1]:.3f}]" + ) + + if all_passed: + print(f"\n ALL VISION ENCODER CHECKS PASSED") + else: + print(f"\n SOME CHECKS FAILED") + sys.exit(1) + + out_path = os.path.join(REFERENCE_DIR, "neuron_vision_encoder_validation.json") + with open(out_path, "w") as f: + json.dump(results, f, indent=2) + print(f" Results saved to {out_path}") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/vllm/README.md b/contrib/models/Isaac-0.2-2B/vllm/README.md new file mode 100644 index 00000000..e6541f52 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/README.md @@ -0,0 +1,162 @@ +# Running Isaac-0.2-2B with vLLM on AWS Neuron + +## Setup + +### 1. Download Model Weights + +```bash +huggingface-cli download PerceptronAI/Isaac-0.2-2B-Preview --local-dir /mnt/models/Isaac-0.2-2B-Preview +``` + +### 2. Activate vLLM Environment + +Use the DLAMI venv that includes vLLM 0.16.0 + vllm-neuron 0.5.0: + +```bash +source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate +``` + +### 3. Apply vLLM Patches + +Isaac is a contrib model and requires patching vllm-neuron to register the model: + +```bash +NXDI_ROOT="/mnt/models/neuronx-distributed-inference" +PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \ + python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py +``` + +This patches 3 files in the installed vllm-neuron package: +1. `constants.py` — Registers `IsaacForConditionalGeneration` as a multimodal model +2. `neuronx_distributed_model_loader.py` — Adds Isaac wrapper class with `load_weights()` and custom `execute_model()` override +3. `neuronx_distributed_model_runner.py` — Adds multimodal data routing for `"isaac"` model type + +### 3.5. Patch modular_isaac.py (Required) + +Isaac's HuggingFace `modular_isaac.py` imports the proprietary `perceptron.tensorstream` package, which +is unavailable on Neuron instances. This must be patched before vLLM can load the model config: + +```bash +NXDI_ROOT="/mnt/models/neuronx-distributed-inference" +python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py \ + /mnt/models/Isaac-0.2-2B-Preview/modular_isaac.py +``` + +**Important**: If HuggingFace has already cached the model code, also patch the cached copy: + +```bash +python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/gpu_benchmark/nuke_perceptron_import.py \ + ~/.cache/huggingface/modules/transformers_modules/Isaac_hyphen_0_dot_2_hyphen_2B_hyphen_Preview/modular_isaac.py +``` + +### 4. Compile Model (if not already compiled) + +The model must be compiled via NxDI before vLLM can serve it: + +```bash +source /opt/aws_neuronx_venv_pytorch_2_9_nxd_inference/bin/activate +PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \ + python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/test/integration/run_isaac.py compile +``` + +## Running + +### Offline Inference + +```bash +source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate +NXDI_ROOT="/mnt/models/neuronx-distributed-inference" +PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \ + python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py +``` + +### Online Serving + +1. Start the server: + +```bash +source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate +NXDI_ROOT="/mnt/models/neuronx-distributed-inference" +PYTHONPATH="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src:${NXDI_ROOT}/src:$PYTHONPATH" \ + bash ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh +``` + +2. Query the server: + +```bash +python ${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py --base-url http://localhost:8080 +``` + +Or use curl: + +```bash +curl http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "Isaac-0.2-2B-Preview", + "messages": [{"role": "user", "content": "What is quantum computing?"}], + "max_tokens": 100, + "temperature": 0 + }' +``` + +## Configuration + +Key vLLM parameters for Isaac: + +| Parameter | Value | Notes | +|-----------|-------|-------| +| `tensor-parallel-size` | 1 | 2B model fits on single core | +| `max-model-len` | 1024 | Adjust based on compiled buckets | +| `max-num-seqs` | 1 | VLM framework limitation | +| `trust-remote-code` | Required | Isaac uses custom model code | +| `attn_kernel_enabled` | true | CTE flash attention (+2%) | + +## Tested Results + +| Mode | Status | Throughput | Notes | +|------|--------|------------|-------| +| Text-only (offline) | **Working** | ~78 tok/s | Correct output verified | +| Image+text (offline) | Not working | N/A | pixel_values format mismatch | +| Online API server | Not tested | N/A | Text-only expected to work | + +**Example output** (text-only): +``` +Prompt: "What is the capital of France?" +Output: "\n\n\n\nThe capital of France is Paris." +``` + +## Known Limitations + +1. **Image+text is not supported via vLLM**: vLLM-neuron delivers `pixel_values` in pre-flattened + patch format `[num_patches, patch_dim]`, but Isaac's NxDI model expects raw image tensors + `[B, 3, 256, 256]`. Fixing this requires adapting vLLM's multimodal preprocessing or adding + a reshape layer in the wrapper. + +2. **On-device sampling mismatch**: Isaac's NxDI model returns logits (not on-device sampled tokens). + The `execute_model()` override in the wrapper handles this by extracting + `output.logits[:, -1, :]` and applying `torch.argmax()`. This means sampling parameters + like `temperature` and `top_p` are NOT respected — generation is always greedy. + +3. **`modular_isaac.py` must be patched**: The proprietary `perceptron.tensorstream` import must be + removed before vLLM can load the model. See step 3.5 above. + +4. **Single sequence only**: `max-num-seqs=1` is required due to the NxDI VLM framework limitation + (shared with all VLM contrib models). + +## Architecture + +The vLLM integration uses a 3-file patch approach: + +``` +vllm-neuron (installed package) +├── worker/constants.py + "IsaacForConditionalGeneration" in NEURON_MULTI_MODAL_MODELS +├── worker/neuronx_distributed_model_loader.py + NeuronIsaacForConditionalGeneration class +│ + get_neuron_model() dispatch +└── worker/neuronx_distributed_model_runner.py + "isaac" multimodal routing +``` + +The `NeuronIsaacForConditionalGeneration` wrapper: +- Loads the compiled NxDI Isaac model via `load_weights()` +- Overrides `execute_model()` to handle the logits→token ID conversion +- Uses `vision_token_id = 151655` (`<|image_pad|>`) for vision mask construction diff --git a/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py b/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py new file mode 100644 index 00000000..d003c0e2 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/add_execute_model.py @@ -0,0 +1,88 @@ +#!/usr/bin/env python3 +"""Add execute_model and forward overrides to NeuronIsaacForConditionalGeneration in model_loader.py.""" + +import sys + +path = ( + sys.argv[1] + if len(sys.argv) > 1 + else ("/vllm/vllm_neuron/worker/neuronx_distributed_model_loader.py") +) + +with open(path, "r") as f: + content = f.read() + +# We need to add execute_model and forward methods to NeuronIsaacForConditionalGeneration +# The class currently only has load_weights. + +# The old code ends with: +OLD_END = """ self.vision_token_id = tokenizer( + "<|image_pad|>", add_special_tokens=False + ).input_ids[0] + return success, compiled_model_path + + +def _get_model_configs""" + +# The new code adds execute_model and forward after load_weights +NEW_END = ''' self.vision_token_id = tokenizer( + "<|image_pad|>", add_special_tokens=False + ).input_ids[0] + return success, compiled_model_path + + def execute_model(self, model_input, **kwargs): + """Execute model forward pass for Isaac VLM. + + Unlike Llama4, Isaac uses vision_token_id (set during load_weights) + instead of model.config.image_token_index for vision mask creation. + """ + vision_mask = ( + model_input.input_tokens == self.vision_token_id + ).unsqueeze(-1) + + pixel_values = None + if ( + model_input.multi_modal_kwargs is not None + and model_input.multi_modal_kwargs.get("pixel_values") is not None + ): + pixel_values = model_input.multi_modal_kwargs["pixel_values"] + + # Call the base NeuronMultiModalCausalLM.forward directly + # (skip Llama4's forward which assumes Llama4-specific pixel_values format) + hidden_states = NeuronMultiModalCausalLM.forward( + self, + input_ids=model_input.input_tokens, + positions=model_input.position_ids, + input_block_ids=model_input.input_block_ids, + sampling_params=model_input.sampling_params, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) + return hidden_states + + +def _get_model_configs''' + +if OLD_END in content: + content = content.replace(OLD_END, NEW_END) + with open(path, "w") as f: + f.write(content) + print( + f"SUCCESS: Added execute_model override to NeuronIsaacForConditionalGeneration in {path}" + ) +else: + print(f"ERROR: Could not find the expected code block in {path}") + # Show what's around the class + import re + + match = re.search( + r"class NeuronIsaacForConditionalGeneration.*?(?=\nclass |\ndef _get_model_configs)", + content, + re.DOTALL, + ) + if match: + print(f"Found class at positions {match.start()}-{match.end()}") + print("Last 200 chars of class:") + print(match.group()[-200:]) + else: + print("Could not find the class at all") diff --git a/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py b/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py new file mode 100644 index 00000000..9b0f932b --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/patch_vllm_isaac.py @@ -0,0 +1,346 @@ +#!/usr/bin/env python3 +# Copyright 2025 (c) Amazon.com and Affiliates +"""Patch vllm-neuron 0.5.0 to support Isaac-0.2-2B VLM. + +Applies the 4-layer registration: +1. constants.py — Add to NEURON_MULTI_MODAL_MODELS +2. model_loader.py — Add NeuronIsaacForConditionalGeneration wrapper class +3. model_loader.py — Add architecture dispatch in get_neuron_model() + fix Sampler import +4. model_runner.py — Add multimodal data routing + +Usage: + source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate + python patch_vllm_isaac.py +""" + +import importlib +import os +import sys + + +def find_vllm_neuron_path(): + """Find the installed vllm_neuron package path.""" + try: + spec = importlib.util.find_spec("vllm_neuron") + if spec and spec.origin: + return os.path.dirname(spec.origin) + except (ModuleNotFoundError, AttributeError): + pass + + # Fallback: search common locations + for base in sys.path: + candidate = os.path.join(base, "vllm_neuron") + if os.path.isdir(candidate): + return candidate + + raise FileNotFoundError( + "Cannot find vllm_neuron package. Is vllm-neuron installed?" + ) + + +def patch_constants(worker_dir): + """Layer 1: Add Isaac to NEURON_MULTI_MODAL_MODELS.""" + path = os.path.join(worker_dir, "constants.py") + with open(path, "r") as f: + content = f.read() + + if "IsaacForConditionalGeneration" in content: + print("[constants.py] Already patched — skipping") + return + + # Add Isaac to the NEURON_MULTI_MODAL_MODELS list + # Try various insertion points + for marker in [ + '"Qwen3VLForConditionalGeneration",', + '"Qwen2VLForConditionalGeneration",', + '"Llama4ForConditionalGeneration",', + '"LlavaForConditionalGeneration",', + ]: + if marker in content: + content = content.replace( + marker, + marker + '\n "IsaacForConditionalGeneration",', + ) + break + + if "IsaacForConditionalGeneration" not in content: + print("[constants.py] WARNING: Could not find insertion point") + return + + with open(path, "w") as f: + f.write(content) + print( + "[constants.py] Added IsaacForConditionalGeneration to NEURON_MULTI_MODAL_MODELS" + ) + + +def patch_model_loader(worker_dir): + """Layer 2+3: Fix Sampler import, add Isaac wrapper class, add architecture dispatch.""" + path = os.path.join(worker_dir, "neuronx_distributed_model_loader.py") + with open(path, "r") as f: + content = f.read() + + # Fix Sampler import (shared issue with Gemma3) + if "from vllm.v1.sample import sampler as Sampler" in content: + content = content.replace( + "from vllm.v1.sample import sampler as Sampler", + "from vllm.v1.sample.sampler import Sampler", + ) + print("[model_loader.py] Fixed Sampler import") + + if "NeuronIsaacForConditionalGeneration" in content: + print("[model_loader.py] Already patched — skipping") + with open(path, "w") as f: + f.write(content) + return + + # --- Add Isaac wrapper class before get_neuron_model or _get_model_configs --- + isaac_class = ''' + +class NeuronIsaacForConditionalGeneration(NeuronLlama4ForCausalLM): + """Isaac VLM using dynamically loaded NeuronIsaacForConditionalGeneration from contrib.""" + + def load_weights(self, model_name_or_path: str, architecture: str, **kwargs): + import importlib + + neuronx_module = importlib.import_module("isaac_neuron.modeling_isaac") + neuronx_model_cls = getattr(neuronx_module, "NeuronIsaacForConditionalGeneration") + + default_neuron_config = kwargs["neuron_config"] + override_neuron_config = _validate_image_to_text_override_neuron_config( + kwargs["override_neuron_config"] + ) + + vision_neuron_config = copy.deepcopy(default_neuron_config) + vision_neuron_config.update( + override_neuron_config.get("vision_neuron_config", {}) + ) + vision_neuron_config = neuronx_model_cls.get_neuron_config_cls()( + **vision_neuron_config + ) + + text_neuron_config = copy.deepcopy(default_neuron_config) + text_neuron_config.update(override_neuron_config.get("text_neuron_config", {})) + text_neuron_config = neuronx_model_cls.get_neuron_config_cls()( + **text_neuron_config + ) + + from transformers import AutoConfig + hf_config = AutoConfig.from_pretrained(model_name_or_path, trust_remote_code=True) + + config = neuronx_model_cls.get_config_cls()( + text_neuron_config=text_neuron_config, + vision_neuron_config=vision_neuron_config, + load_config=load_pretrained_config(hf_config=hf_config), + ) + + success, compiled_model_path, _ = self._load_weights_common( + model_name_or_path, neuronx_model_cls, config=config, **kwargs + ) + + if not success: + if not os.path.exists(model_name_or_path): + model_name_or_path = self._save_pretrained_model(model_name_or_path) + + self._compile_and_load_model( + model_name_or_path, neuronx_model_cls, config, compiled_model_path + ) + + # Load tokenizer to get vision token ID + from transformers import AutoTokenizer + tokenizer = AutoTokenizer.from_pretrained( + model_name_or_path, trust_remote_code=True + ) + self.vision_token_id = tokenizer( + "<|image_pad|>", add_special_tokens=False + ).input_ids[0] + return success, compiled_model_path + + def execute_model(self, model_input, **kwargs): + """Execute model forward pass for Isaac VLM. + + Uses vision_token_id for vision mask (not model.config.image_token_index), + calls base forward directly, and handles logits->token_id conversion since + the Isaac compiled model returns logits (not on-device sampled tokens). + """ + vision_mask = ( + model_input.input_tokens == self.vision_token_id + ).unsqueeze(-1) + + pixel_values = None + if ( + model_input.multi_modal_kwargs is not None + and model_input.multi_modal_kwargs.get("pixel_values") is not None + ): + pixel_values = model_input.multi_modal_kwargs["pixel_values"] + + # Call base forward with Isaac-specific args + with self._reordered( + model_input.input_block_ids, + input_ids=model_input.input_tokens, + positions=model_input.position_ids, + sampling_params=model_input.sampling_params, + pixel_values=pixel_values, + vision_mask=vision_mask, + ) as (sorted_ids, inputs, restore): + output = self.model( + inputs["input_ids"].to(torch.int32), + attention_mask=None, + position_ids=inputs["positions"].to(torch.int32), + seq_ids=sorted_ids.flatten().to(torch.int32), + pixel_values=inputs.get("pixel_values"), + vision_mask=inputs.get("vision_mask"), + sampling_params=inputs["sampling_params"], + ) + + # Isaac model returns logits (not on-device sampled tokens) + # Extract last-token logits and argmax to get token IDs + if hasattr(output, "hidden_states") and isinstance(output.hidden_states, torch.Tensor) and output.hidden_states.numel() > 0: + result = output.hidden_states + else: + logits = output.logits[:, -1, :] # [batch, vocab] + result = torch.argmax(logits, dim=-1) # [batch] - token IDs + + return restore(result) + +''' + + # Insert class before _get_model_configs or get_neuron_model + for marker in ["def _get_model_configs(", "def get_neuron_model("]: + if marker in content: + idx = content.index(marker) + content = content[:idx] + isaac_class + "\n" + content[idx:] + print("[model_loader.py] Added NeuronIsaacForConditionalGeneration class") + break + else: + print("[model_loader.py] WARNING: Could not find insertion point for class") + + # --- Add architecture dispatch in get_neuron_model() --- + # This function is in model_loader.py and dispatches based on architecture string + dispatch_markers = [ + 'elif architecture == "Qwen3VLForConditionalGeneration":', + 'elif architecture == "Qwen2VLForConditionalGeneration":', + 'elif architecture == "Llama4ForConditionalGeneration":', + ] + + for marker in dispatch_markers: + if marker in content: + # Find the line after this elif + its body + idx = content.index(marker) + # Find next elif or else + search_start = idx + len(marker) + next_elif = content.find("\n elif ", search_start) + next_else = content.find("\n else:", search_start) + + # Pick the closest one + candidates = [c for c in [next_elif, next_else] if c > 0] + if candidates: + insert_point = min(candidates) + insert_text = ( + '\n elif architecture == "IsaacForConditionalGeneration":' + "\n model = NeuronIsaacForConditionalGeneration(model_config.hf_config)" + ) + content = content[:insert_point] + insert_text + content[insert_point:] + print( + "[model_loader.py] Added Isaac architecture dispatch in get_neuron_model()" + ) + break + else: + print("[model_loader.py] WARNING: Could not find dispatch insertion point") + + with open(path, "w") as f: + f.write(content) + + +def patch_model_runner(worker_dir): + """Layer 4: Add multimodal data routing for Isaac model_type.""" + path = os.path.join(worker_dir, "neuronx_distributed_model_runner.py") + with open(path, "r") as f: + content = f.read() + + if '"isaac"' in content or "'isaac'" in content: + print("[model_runner.py] Already patched — skipping") + return + + changed = False + + # Add multimodal data routing for Isaac + # Isaac uses pass-through (no special multimodal preprocessing needed, like Llama4) + # Look for existing qwen3_vl routing and add after it + routing_markers = [ + 'elif self.model.model.config.model_type == "qwen3_vl":', + 'elif self.model.model.config.model_type == "qwen2_vl":', + 'elif self.model.model.config.model_type == "llava":', + ] + + for marker in routing_markers: + if marker in content: + # Find the line(s) after this elif + idx = content.index(marker) + search_start = idx + len(marker) + # Find next elif or else + next_elif = content.find("\n elif ", search_start) + next_else = content.find("\n else:", search_start) + + candidates = [c for c in [next_elif, next_else] if c > 0] + if candidates: + insert_point = min(candidates) + insert_text = ( + '\n elif self.model.model.config.model_type == "isaac":' + "\n pass # Isaac does not require special multimodal preprocessing" + ) + content = content[:insert_point] + insert_text + content[insert_point:] + print("[model_runner.py] Added Isaac multimodal data routing") + changed = True + break + + if not changed: + # Try alternative: check if there's a list-style routing + for list_marker in [ + "in ['llama4'", + 'in ["llama4"', + "in ['llama4', 'gemma3'", + 'in ["llama4", "gemma3"', + ]: + if list_marker in content: + content = content.replace( + list_marker, + list_marker.rstrip("'\"") + "', 'isaac'" + if "'" in list_marker + else list_marker.rstrip("'\"") + '", "isaac"', + ) + print("[model_runner.py] Added Isaac to multimodal list routing") + changed = True + break + + if not changed: + print( + "[model_runner.py] WARNING: Could not add multimodal routing — may need manual patch" + ) + + with open(path, "w") as f: + f.write(content) + + +def main(): + vllm_neuron_path = find_vllm_neuron_path() + worker_dir = os.path.join(vllm_neuron_path, "worker") + print(f"Found vllm_neuron at: {vllm_neuron_path}") + print(f"Worker directory: {worker_dir}") + print() + + patch_constants(worker_dir) + patch_model_loader(worker_dir) + patch_model_runner(worker_dir) + + print() + print("All patches applied. To use Isaac with vLLM:") + print(" export VLLM_NEURON_FRAMEWORK='neuronx-distributed-inference'") + print(" export NEURON_COMPILED_ARTIFACTS='/mnt/models/traced_model/Isaac-0.2-2B'") + print( + " PYTHONPATH='.../Isaac-0.2-2B/src:$PYTHONPATH' python -m vllm.entrypoints.openai.api_server ..." + ) + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py b/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py new file mode 100644 index 00000000..0182161a --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/run_offline_inference.py @@ -0,0 +1,129 @@ +#!/usr/bin/env python3 +# Copyright 2025 (c) Amazon.com and Affiliates +"""Offline inference for Isaac-0.2-2B via vLLM on Neuron. + +Usage: + source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate + export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference" + export NEURON_COMPILED_ARTIFACTS="/mnt/models/traced_model/Isaac-0.2-2B" + PYTHONPATH="/mnt/models/neuronx-distributed-inference/contrib/models/Isaac-0.2-2B/src:/mnt/models/neuronx-distributed-inference/src:$PYTHONPATH" \ + python run_offline_inference.py +""" + +from isaac_neuron.ndxi_patch import apply_patch + +apply_patch() + +import os # noqa: E402 +from pathlib import Path # noqa: E402 + +from vllm import LLM, SamplingParams # noqa: E402 + +HOME_DIR = Path.home() +DATA_PATH = os.getenv("DATA_HOME", "/mnt/models") +MODEL_PATH = f"{DATA_PATH}/Isaac-0.2-2B-Preview" +COMPILED_PATH = f"{DATA_PATH}/traced_model/Isaac-0.2-2B" + +os.environ["VLLM_NEURON_FRAMEWORK"] = "neuronx-distributed-inference" +os.environ["NEURON_COMPILED_ARTIFACTS"] = COMPILED_PATH + + +def main(max_seq_len: int = 1024) -> None: + llm = LLM( + model=MODEL_PATH, + max_num_seqs=1, + max_model_len=max_seq_len, + tensor_parallel_size=1, + limit_mm_per_prompt={"image": 1}, + allowed_local_media_path=HOME_DIR.as_posix(), + enable_prefix_caching=False, + enable_chunked_prefill=False, + trust_remote_code=True, + additional_config={ + "override_neuron_config": { + "text_neuron_config": { + "attn_kernel_enabled": True, + "enable_bucketing": True, + "context_encoding_buckets": [max_seq_len], + "token_generation_buckets": [max_seq_len], + "is_continuous_batching": True, + "async_mode": False, + }, + "vision_neuron_config": { + "enable_bucketing": True, + "buckets": [1], + "is_continuous_batching": True, + }, + }, + }, + ) + + sampling_params = SamplingParams(top_k=1, max_tokens=100) + + # Test 1: Text-only + print("=" * 60) + print("Test 1: Text-only") + print("=" * 60) + conversation = [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "What is the capital of France? Explain briefly.", + }, + ], + } + ] + for output in llm.chat(conversation, sampling_params): + print(f"Generated: {output.outputs[0].text!r}") + + # Test 2: Text-only (longer) + print("\n" + "=" * 60) + print("Test 2: Text-only (longer)") + print("=" * 60) + conversation = [ + { + "role": "user", + "content": [ + { + "type": "text", + "text": "Explain quantum entanglement in simple terms.", + }, + ], + } + ] + for output in llm.chat(conversation, sampling_params): + print(f"Generated: {output.outputs[0].text!r}") + + # Test 3: Image+text (requires a test image) + print("\n" + "=" * 60) + print("Test 3: Image+text") + print("=" * 60) + test_image = Path(__file__).resolve().parent / "data" / "test_image.jpg" + if test_image.exists(): + image_url = f"file://{test_image.as_posix()}" + else: + # Use a publicly accessible image URL + image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png" + + conversation = [ + { + "role": "user", + "content": [ + {"type": "image_url", "image_url": {"url": image_url}}, + {"type": "text", "text": "Describe this image in detail."}, + ], + } + ] + try: + for output in llm.chat(conversation, sampling_params): + print(f"Generated: {output.outputs[0].text!r}") + except Exception as e: + print(f"Image+text failed (may need local image): {e}") + + print("\nAll tests completed.") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py b/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py new file mode 100644 index 00000000..5b8f9eb1 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/run_online_inference.py @@ -0,0 +1,104 @@ +#!/usr/bin/env python3 +# Copyright 2025 (c) Amazon.com and Affiliates +"""Online inference client for Isaac vLLM server. + +Sends requests to a running vLLM OpenAI-compatible API server. + +Usage: + # Start server first (see start-vllm-server.sh) + python run_online_inference.py [--base-url http://localhost:8080] +""" + +import argparse +import json +import time + +import requests + + +def chat_completion(base_url, messages, max_tokens=100, temperature=0): + """Send a chat completion request to the vLLM server.""" + url = f"{base_url}/v1/chat/completions" + payload = { + "model": "Isaac-0.2-2B-Preview", + "messages": messages, + "max_tokens": max_tokens, + "temperature": temperature, + } + t0 = time.time() + response = requests.post(url, json=payload, timeout=120) + elapsed = time.time() - t0 + response.raise_for_status() + result = response.json() + return result, elapsed + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument("--base-url", default="http://localhost:8080") + args = parser.parse_args() + + # Test 1: Text-only + print("=" * 60) + print("Test 1: Text-only") + print("=" * 60) + messages = [ + {"role": "user", "content": "What is the capital of France? Explain briefly."} + ] + result, elapsed = chat_completion(args.base_url, messages) + text = result["choices"][0]["message"]["content"] + usage = result.get("usage", {}) + print(f"Response: {text[:200]}") + print(f"Latency: {elapsed:.2f}s") + print(f"Usage: {usage}") + + # Test 2: Text-only (longer) + print("\n" + "=" * 60) + print("Test 2: Text-only (longer)") + print("=" * 60) + messages = [ + { + "role": "user", + "content": "Explain quantum entanglement in simple terms.", + } + ] + result, elapsed = chat_completion(args.base_url, messages) + text = result["choices"][0]["message"]["content"] + usage = result.get("usage", {}) + print(f"Response: {text[:200]}") + print(f"Latency: {elapsed:.2f}s") + print(f"Usage: {usage}") + + # Test 3: Image+text + print("\n" + "=" * 60) + print("Test 3: Image+text") + print("=" * 60) + messages = [ + { + "role": "user", + "content": [ + { + "type": "image_url", + "image_url": { + "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/280px-PNG_transparency_demonstration_1.png" + }, + }, + {"type": "text", "text": "Describe this image."}, + ], + } + ] + try: + result, elapsed = chat_completion(args.base_url, messages) + text = result["choices"][0]["message"]["content"] + usage = result.get("usage", {}) + print(f"Response: {text[:200]}") + print(f"Latency: {elapsed:.2f}s") + print(f"Usage: {usage}") + except Exception as e: + print(f"Image+text failed: {e}") + + print("\nAll online tests completed.") + + +if __name__ == "__main__": + main() diff --git a/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh b/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh new file mode 100644 index 00000000..92e3a517 --- /dev/null +++ b/contrib/models/Isaac-0.2-2B/vllm/start-vllm-server.sh @@ -0,0 +1,32 @@ +#!/bin/bash +# Copyright 2025 (c) Amazon.com and Affiliates +# Start vLLM server for Isaac-0.2-2B on Neuron +# +# Prerequisites: +# 1. Apply vLLM patches: python patch_vllm_isaac.py +# 2. Model compiled at NEURON_COMPILED_ARTIFACTS path +# +# Usage: +# source /opt/aws_neuronx_venv_pytorch_inference_vllm_0_16/bin/activate +# bash start-vllm-server.sh + +export VLLM_NEURON_FRAMEWORK="neuronx-distributed-inference" +export NEURON_COMPILED_ARTIFACTS="/mnt/models/traced_model/Isaac-0.2-2B" +export VLLM_RPC_TIMEOUT=100000 + +NXDI_ROOT="/mnt/models/neuronx-distributed-inference" +ISAAC_SRC="${NXDI_ROOT}/contrib/models/Isaac-0.2-2B/src" +export PYTHONPATH="${ISAAC_SRC}:${NXDI_ROOT}/src:${PYTHONPATH}" + +python -m vllm.entrypoints.openai.api_server \ + --port=8080 \ + --model="/mnt/models/Isaac-0.2-2B-Preview" \ + --max-num-seqs=1 \ + --max-model-len=1024 \ + --limit-mm-per-prompt='{"image": 1}' \ + --allowed-local-media-path="/mnt/models" \ + --tensor-parallel-size=1 \ + --trust-remote-code \ + --no-enable-chunked-prefill \ + --no-enable-prefix-caching \ + --additional-config='{"override_neuron_config":{"text_neuron_config":{"attn_kernel_enabled":true,"enable_bucketing":true,"context_encoding_buckets":[1024],"token_generation_buckets":[1024],"is_continuous_batching":true,"async_mode":false},"vision_neuron_config":{"enable_bucketing":true,"buckets":[1],"is_continuous_batching":true}}}'