Skip to content

Conversation

@yongming-qin
Copy link
Contributor

@yongming-qin yongming-qin commented Nov 30, 2025

Purpose

Add support for OpenVLA model in vLLM. OpenVLA is a vision-language-action model that uses timm-based vision backbones (Prismatic architecture) with LLM backbones for action prediction tasks. This implementation follows the same pattern as DeepSeek-VL2, which also uses timm for vision processing.

This PR adds:

  • Model executor implementation (vllm/model_executor/models/openvla.py)
  • Configuration class (vllm/transformers_utils/configs/openvla.py)
  • Processor class (vllm/transformers_utils/processors/openvla.py)

The implementation supports:

  • Single and fused vision backbones using timm ViT models
  • Multiple LLM backbones (Llama-2, Mistral, Phi-3)
  • Image embedding insertion via prompt updates
  • Tensor parallelism support for vision backbone

FIX #14739

Test Plan

  1. Basic inference test:

    vllm serve openvla/openvla-7b
  2. Test with image input:

    • Use OpenAI-compatible API to send requests with image data
    • Verify image embeddings are correctly processed and inserted
  3. Compare outputs with HuggingFace implementation:

    • Run inference on same inputs with both vLLM and HF implementations
    • Verify output logits/tokens match

Test Result

[To be filled after testing]

…vision.

Signed-off-by: yongming-qin <yq0536@gmail.com>
@yongming-qin
Copy link
Contributor Author

Note: Currenly the model can be loaded by vllm and we can use it to process image + text instruction. However, the results of vllm and Transformers are different. Comments and collaboration are welcome.

…penvla-7b

Signed-off-by: Luke <yq0536@gmail.com>
@mergify mergify bot added the new-model Requests to new models label Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-model Requests to new models

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]:Can you support the VLA series models? For example, openVLA.

1 participant