Skip to content

Add InternVL3-8B-Instruct contrib model#153

Open
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/internvl3-8b-instruct
Open

Add InternVL3-8B-Instruct contrib model#153
jimburtoft wants to merge 1 commit intoaws-neuron:mainfrom
jimburtoft:contrib/internvl3-8b-instruct

Conversation

@jimburtoft
Copy link
Copy Markdown
Contributor

Note: The below template includes items meant for model contributions only. For other contributions such as bug fixes, features, etc., only fill out the relevant portions of the form.

Description

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference via the NeuronBaseForImageToText framework. Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector + Qwen2.5-7B text backbone (~8B total parameters, BF16).

Validated on trn2.3xlarge (LNC=2, TP=4) with logit_validation passing, 75.1 tok/s text generation, and end-to-end multimodal inference.

Model Information

Model Name: InternVL3-8B-Instruct

Model Architecture: Vision-language model (InternViT-300M encoder + Qwen2.5-7B decoder)

Purpose: Multimodal text generation (image-to-text, visual question answering, text-only chat)

Checklist

Please ensure your PR includes the following items. Refer to the contrib/CONTRIBUTING.md for detailed guidelines.

Required Components

  • Accuracy Test (ex. test/integration/test_model.py)

    • Integration test validates model accuracy using torch_neuronx.testing.validation.logit_validation
    • Generates CPU FP32 reference logits via generate_expected_logits, compares against Neuron BF16 output
    • Handles TP vocab padding (151674 → 151676) by truncating to TP-aligned boundary
    • BF16-appropriate tolerance map (top-5: 0.05, top-50: 0.06, all: 0.08)
    • 2/2 tests pass on trn2.3xlarge
  • README.md with the following sections:

    • Usage Example: Compile and run examples for text-only and multimodal inference
    • Compatibility Matrix: Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29
    • Example Checkpoints: OpenGVLab/InternVL3-8B-Instruct
    • Testing Instructions: pytest test/integration/test_model.py -v --tb=short
  • Source Code (src/)

    • modeling_internvl3.py: Top-level VLM (NeuronBaseForImageToText)
    • modeling_internvl3_text.py: Text backbone (Qwen2.5-7B with vision embedding injection)
    • modeling_internvl3_vision.py: Vision encoder (InternViT-300M, torch_neuronx.trace)

Optional Components

  • Unit Tests (CPU or Neuron-based)
    • test/unit/__init__.py exists (placeholder for future tests)

Folder Structure

Confirm your contribution follows this structure:

/contrib/models/InternVL3-8B-Instruct/
  README.md
  compile_internvl3_vlm.py
  /src
    __init__.py
    modeling_internvl3.py
    modeling_internvl3_text.py
    modeling_internvl3_vision.py
  /test
    __init__.py
    /unit
      __init__.py
    /integration
      __init__.py
      test_model.py

Testing

How did you test this change?

All tests run on trn2.3xlarge (LNC=2, TP=4) with Neuron SDK 2.29.

  1. Model compiled using compile_internvl3_vlm.py (text CTE+TKG + vision encoder NEFFs)
  2. Integration test test_model.py runs:
    • test_config: Validates InternVL3 config matches expected Qwen2.5-7B architecture (7 assertions)
    • test_text_logit_validation: CPU FP32 reference logits (16 tokens) compared against Neuron BF16 via logit_validation() with per-tier tolerances

Test Results:

test/integration/test_model.py::TestInternVL3Integration::test_config PASSED
test/integration/test_model.py::TestInternVL3Integration::test_text_logit_validation PASSED

Summary: Max divergence difference = 0
  Top k = 5  max error = 0.036 (tol: 0.05)
  Top k = 50 max error = 0.048 (tol: 0.06)
  Top k = 1000 max error = 0.050 (tol: 0.06)
  Top k = None max error = 0.063 (tol: 0.08)

======================= 2 passed in 26.82s ========================

Compatibility

Tested with:

  • Neuron SDK Version(s): 2.29 (NxDI 0.9.17334, neuronx-cc 2.24.5133.0)
  • Instance Type(s): trn2.3xlarge (LNC=2, TP=4)
  • PyTorch Version: 2.9
  • Python Version: 3.12.3

Additional Information

  • Performance: 75.1 tok/s (BS=1, seq_len=2048), TTFT 138ms. 1.85x faster than L40S GPU.
  • Vision encoder: Compiled via torch_neuronx.trace() with --auto-cast=matmult -O1. 34.5ms per 448x448 tile.
  • TP vocab padding: InternVL3 vocab_size (151674) is not divisible by TP=4, so lm_head pads to 151676. The test truncates logits to the TP-aligned boundary (151672) to avoid false failures from padding artifacts.
  • Maintainer: Jim Burtoft (@jimburtoft)

Related Issues

N/A

vLLM Integration

  • This model/feature is intended for use with vLLM
  • Documentation includes vLLM registration instructions

vLLM integration requires patches to vllm-neuron's model loader to register the InternVL3 architecture. See README for details.


By submitting this PR, I confirm that:

  • I have read and followed the contributing guidelines
  • This is a community contribution and may have limited testing compared to officially-supported models
  • The code follows best practices and is well-documented
  • All required components listed above are included

InternVL3-8B-Instruct VLM (vision-language model) for Neuron inference
via NeuronBaseForImageToText framework.

Architecture: InternViT-300M vision encoder + pixel shuffle MLP projector
+ Qwen2.5-7B text backbone (~8B total parameters).

Validated on trn2.3xlarge (LNC=2, TP=4) with SDK 2.29:
- logit_validation passes with BF16-appropriate tolerances
- 75.1 tok/s (BS=1, seq_len=2048), 1.85x vs L40S GPU
- Text-only and multimodal inference supported

Includes compile script, src/ modeling code (3 files), and
integration tests using torch_neuronx logit_validation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant