Add Isaac-0.2-2B-Preview VLM contrib model#154
Open
jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
Open
Add Isaac-0.2-2B-Preview VLM contrib model#154jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
jimburtoft wants to merge 4 commits intoaws-neuron:mainfrom
Conversation
NxDI implementation of PerceptronAI/Isaac-0.2-2B-Preview VLM: - Qwen3 text backbone with SigLIP2 vision encoder - 2-layer MLP projector with pixel shuffle (64 vision tokens/image) - Supports TP=1/2/4, seq_len up to 8192 - 110.7 tok/s text-only, 108.7 tok/s image+text on trn2.3xlarge - 9.0ms TPOT at seq_len=1024 - BF16, CTE flash attention enabled - Validated: cosine 0.9999+ vs CPU reference across all configs
- vLLM-neuron integration with 3-file patch (text-only working, ~78 tok/s) - GPU comparative benchmark: L40S at 52 tok/s vs trn2 at 111 tok/s (2.13x speedup) - modular_isaac.py perceptron import fix (nuke_perceptron_import.py) - execute_model override for logits-to-token-ID conversion - Known limitation: image+text via vLLM not yet supported (pixel_values format mismatch)
Previous benchmark used enforce_eager=True which handicapped GPU to 52 tok/s. With CUDA graphs + torch.compile + FlashAttention v2, L40S achieves 174 tok/s. GPU is 1.5x faster per-core than single NeuronCore, but trn2 DP=4 is 2.5x faster at device level.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note: The below template includes items meant for model contributions only.
Description
Isaac-0.2-2B-Preview is a 2.57B vision-language model from PerceptronAI, combining a standard Qwen3 text backbone with a SigLIP2 vision encoder and 2-layer MLP projector with pixel shuffle. Onboarded to Neuron via NxDI's
NeuronBaseForImageToTextframework.Validated on trn2.3xlarge (LNC=2, TP=1, BF16) with text-only cosine similarity 0.999978 vs CPU reference, 110.7 tok/s text-only and 108.7 tok/s image+text generation.
Model Information
Model Name: Isaac-0.2-2B-Preview
Model Architecture: Vision-language model (SigLIP2 encoder + pixel shuffle + 2-layer MLP projector + Qwen3 decoder)
HuggingFace: PerceptronAI/Isaac-0.2-2B-Preview
License: CC-BY-NC-4.0
Checklist
Required Components
Accuracy Test (
test/integration/validate_text_logits.py)validate_image_text.py(3 image+text E2E tests),validate_vision_encoder.py,validate_tkg.pyREADME.md with the following sections:
Source Code (
src/isaac_neuron/)modeling_isaac.py: Top-level VLM orchestrator (NeuronBaseForImageToText)modeling_isaac_text.py: Text backbone (NeuronBaseModel wrapping NxDI Qwen3 layers)modeling_isaac_vision.py: Vision encoder wrapper (SigLIP2 + pixel shuffle + MLP projector)siglip/modeling_siglip.py: SigLIP2 encoder (adapted from Gemma3-vision contrib)siglip/layers.py: Parallel Conv2d for vision patch embeddingndxi_patch.py: SDK 2.29 compatibility patchesutils.py: Shared utilitiesOptional Components
Integration Tests (
test/integration/)validate_text_logits.py: First-token logit accuracy (CPU vs Neuron)validate_tkg.py: Token generation quality and throughputvalidate_image_text.py: End-to-end multimodal generationvalidate_vision_encoder.py: Vision encoder output validationtest_tp.py: Tensor parallelism at TP=1, 2, 4test_kernels.py: NKI kernel compatibility sweeptest_scaling.py: Sequence length scaling (1024-8192)test_weight_loading.py: State dict key mapping validationbenchmark.py: Formal benchmark harness (10 iterations, 3 warmup)run_isaac.py: Quick compile + run utilityvLLM Integration (
vllm/)patch_vllm_isaac.py: Automated 3-file vllm-neuron patch scriptrun_offline_inference.py: Offline inference examplerun_online_inference.py: OpenAI-compatible API clientstart-vllm-server.sh: Server launch scriptREADME.md: Setup and usage documentationGPU Benchmark (
gpu_benchmark/)benchmark_gpu.py: L40S benchmark script (vLLM 0.20.0, CUDA graphs enabled)gpu_benchmark_results.json: Full results (4 workloads)Folder Structure
Testing
How did you test this change?
All tests run on trn2.3xlarge (LNC=2, TP=1) with Neuron SDK 2.29 (DLAMI 20260410, NxDI 0.9.17334).
Benchmark Results (trn2.3xlarge, TP=1, BF16, seq_len=1024, 10 iterations):
GPU Comparison (L40S, BF16, vLLM 0.20.0, CUDA graphs enabled):
L40S GPU is 1.5x faster per-core than a single NeuronCore. At the device level (DP=4), trn2.3xlarge is 2.5x faster than L40S.