Add InternVL3-8B-Instruct contrib model by jimburtoft · Pull Request #153 · aws-neuron/neuronx-distributed-inference

jimburtoft · 2026-04-30T23:31:23Z

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via the NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters, BF16).

Validated on trn2.3xlarge (LNC=2, TP=4) with logit_validation passing, 75.1 tok/s text generation, and end-to-end multimodal inference.

Model Information

Model Name: InternVL3-8B-Instruct

Model Architecture: Vision-language model (InternViT-300M encoder + Qwen2.5-7B decoder)

Purpose: Multimodal text generation (image-to-text, visual question answering, text-only chat)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

Accuracy Test (ex. test/integration/test_model.py)
- Integration test validates model accuracy using torch_neuronx.testing.validation.logit_validation
- Generates CPU FP32 reference logits via generate_expected_logits, compares against Neuron BF16 output
- Handles TP vocab padding (151674 → 151676) by truncating to TP-aligned boundary
- BF16-appropriate tolerance map (top-5: 0.05, top-50: 0.06, all: 0.08)
- 2/2 tests pass on trn2.3xlarge
README.md with the following sections:
- Usage Example: Compile and run examples for text-only and multimodal inference
- Compatibility Matrix: Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29
- Example Checkpoints: OpenGVLab/InternVL3-8B-Instruct
- Testing Instructions: pytest test/integration/test_model.py -v --tb=short
Source Code (src/)
- modeling_internvl3.py: Top-level VLM (NeuronBaseForImageToText)
- modeling_internvl3_text.py: Text backbone (Qwen2.5-7B with vision embedding injection)
- modeling_internvl3_vision.py: Vision encoder (InternViT-300M, torch_neuronx.trace)

Optional Components

Unit Tests (CPU or Neuron-based)
- test/unit/__init__.py exists (placeholder for future tests)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/InternVL3-8B-Instruct/
  README.md
  compile_internvl3_vlm.py
  /src
    __init__.py
    modeling_internvl3.py
    modeling_internvl3_text.py
    modeling_internvl3_vision.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests run on trn2.3xlarge (LNC=2, TP=4) with Neuron SDK 2.29.

Model compiled using compile_internvl3_vlm.py (text CTE+TKG + vision encoder NEFFs)
Integration test test_model.py runs:
- test_config: Validates InternVL3 config matches expected Qwen2.5-7B architecture (7 assertions)
- test_text_logit_validation: CPU FP32 reference logits (16 tokens) compared against Neuron BF16 via logit_validation() with per-tier tolerances

Test Results:

test/integration/test_model.py::TestInternVL3Integration::test_config PASSED
test/integration/test_model.py::TestInternVL3Integration::test_text_logit_validation PASSED

Summary: Max divergence difference = 0
  Top k = 5  max error = 0.036 (tol: 0.05)
  Top k = 50 max error = 0.048 (tol: 0.06)
  Top k = 1000 max error = 0.050 (tol: 0.06)
  Top k = None max error = 0.063 (tol: 0.08)

======================= 2 passed in 26.82s ========================

Compatibility

Tested with:

Neuron SDK Version(s): 2.29 (NxDI 0.9.17334, neuronx-cc 2.24.5133.0)
Instance Type(s): trn2.3xlarge (LNC=2, TP=4)
PyTorch Version: 2.9
Python Version: 3.12.3

Additional Information

Performance: 75.1 tok/s (BS=1, seq_len=2048), TTFT 138ms. 1.85x faster than L40S GPU.
Vision encoder: Compiled via torch_neuronx.trace() with --auto-cast=matmult -O1. 34.5ms per 448x448 tile.
TP vocab padding: InternVL3 vocab_size (151674) is not divisible by TP=4, so lm_head pads to 151676. The test truncates logits to the TP-aligned boundary (151672) to avoid false failures from padding artifacts.
Maintainer: Jim Burtoft (@jimburtoft)

Related Issues

N/A

vLLM Integration

This model/feature is intended for use with vLLM
Documentation includes vLLM registration instructions

vLLM integration requires patches to vllm-neuron's model loader to register the InternVL3 architecture. See README for details.

By submitting this PR, I confirm that:

I have read and followed the contributing guidelines
This is a community contribution and may have limited testing compared to officially-supported models
The code follows best practices and is well-documented
All required components listed above are included

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters). Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29: - logit_validation passes with BF16-appropriate tolerances - 75.1 tok/s (BS=1, seq_len=2048), 1.85x vs L40S GPU - Text-only and multimodal inference supported Includes compile script, src/ modeling code (3 files), and integration tests using torch_neuronx logit_validation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add InternVL3-8B-Instruct contrib model#153

Add InternVL3-8B-Instruct contrib model#153
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/internvl3-8b-instruct

jimburtoft commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jimburtoft commented Apr 30, 2026

Description

Model Information

Checklist

Required Components

Optional Components

Folder Structure

Testing

Compatibility

Additional Information

Related Issues

vLLM Integration

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant