Skip to content

Conversation

@tomtomjhj
Copy link

@tomtomjhj tomtomjhj commented Nov 28, 2025

Purpose

The scheduler may fail to schedule a batch if get_image_size_with_most_features is wrong and max_num_batched_tokens is not big enough:

  • Some models' get_image_size_with_most_features returns an image size that is encoded into a sequence shorter than the actual maximum. That is, max_tokens_per_mm_item < num_mm_input_tokens is possible.
  • encoder_compute_budget is the max of the sequence length of an image with size get_image_size_with_most_features and max_num_batched_tokens (= max_num_encoder_input_tokens).
    encoder_compute_budget = max(
    scheduler_config.max_num_encoder_input_tokens, max_tokens_per_mm_item
    )
  • Therefore, if max_num_batched_tokens < num_mm_input_tokens, it is possible that max_num_encoder_input_tokens < num_mm_input_tokens, which results in scheduling failure.
    # Not enough compute budget
    if num_tokens > encoder_compute_budget:
    return False

This PR fixes get_image_size_with_most_features of qwen2_vl and gemma3.

  • qwen2_vl
    • Problem: An image with with most features is not a square if the number of features is not a squre.
    • Solution: Factorize the max number of features and contruct the image shape that is closest to square to meet the aspect ratio constraint.
  • gemma3
    • Problem: The image is not big enough to trigger cropping.
    • Solution: Use the native image size.

Test Plan

Apply the following patch

diff --git a/examples/offline_inference/vision_language.py b/examples/offline_inference/vision_language.py
index 8f72bf6f0..755aba99d 100755
--- a/examples/offline_inference/vision_language.py
+++ b/examples/offline_inference/vision_language.py
@@ -321,6 +321,7 @@ def run_gemma3(questions: list[str], modality: str) -> ModelRequestData:
         max_num_seqs=2,
         mm_processor_kwargs={"do_pan_and_scan": True},
         limit_mm_per_prompt={modality: 1},
+        max_num_batched_tokens=512,
     )
 
     prompts = [
@@ -1527,6 +1528,7 @@ def run_qwen2_5_vl(questions: list[str], modality: str) -> ModelRequestData:
             "fps": 1,
         },
         limit_mm_per_prompt={modality: 1},
+        max_num_batched_tokens=512,
     )
 
     if modality == "image":

Then run

python examples/offline_inference/vision_language.py --model-type qwen2_5_vl

and

python examples/offline_inference/vision_language.py --model-type gemma3

Test Result

Without this PR's fix, vLLM will hang due to repeated scheduler failure. This PR fixes this problem.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Problem:
An image with with most features is not a square if the number of
features is not a squre.

Solution:
Factorize the max number of features and contruct the image shape that
is closest to square to meet the aspect ratio constraint.

Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
Problem:
The image is not big enough to trigger cropping.

Solution:
Use the native image size.

Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added the qwen Related to Qwen models label Nov 28, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a scheduling failure bug caused by an incorrect implementation of get_image_size_with_most_features for the qwen2_vl and gemma3 models. The changes correctly calculate the image dimensions that produce the maximum number of features, which resolves the scheduling issue. For gemma3, the fix replaces a hardcoded image size with one derived from the model's native image size, ensuring that the pan-and-scan cropping logic is correctly triggered. For qwen2_vl, the implementation is updated to calculate the maximum number of features and then determines the optimal image dimensions by finding the factor pair of the feature count that is closest to a square, satisfying aspect ratio constraints. The changes are logical, well-implemented, and directly address the root cause of the bug. The code quality is good, and I have no further suggestions.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing, can you add some regression tests using the image size that causes this failure?

Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
@mergify mergify bot added the multi-modality Related to multi-modality (#4194) label Nov 29, 2025
@tomtomjhj
Copy link
Author

Added unit tests for get_image_size_with_most_features. These tests fail without my fix.

If the seq len can't be factored into a pair that satifies the aspect
ratio contraint, decrement the seq len and retry.

Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025
@ywang96
Copy link
Member

ywang96 commented Nov 29, 2025

Looks like there are some test failures

unsupported operand type(s) for //: 'NoneType' and 'int'

Comment on lines -962 to +983
max_image_size, _ = self._get_vision_info(
image_width=9999999,
image_height=9999999,
num_frames=1,
image_processor=None,
)
return max_image_size
hf_config = self.get_hf_config()
vision_config = hf_config.vision_config
patch_size = vision_config.patch_size
merge_size = vision_config.spatial_merge_size
image_processor = self.get_image_processor()
max_pixels = image_processor.max_pixels
unit = patch_size * merge_size
max_seq_len = max_pixels // (unit * unit)

def closest_factor_pair(n: int) -> tuple[int, int]:
# left <= right
for d in range(math.isqrt(n), 0, -1):
if n % d == 0:
return d, n // d
return 1, n

height_factor, width_factor = 1, max_seq_len
for seq_len in range(max_seq_len, 0, -1):
height_factor, width_factor = closest_factor_pair(seq_len)
if width_factor / height_factor <= 200:
break
Copy link
Member

@ywang96 ywang96 Nov 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the bugfix on Qwen2_VL but I don't think this is the right fix. IMO we should fix _get_vision_info itself instead.

I'm also curious if you can provide a repro example on when the current _get_vision_info impl is not accurate? Here we're already giving it a pretty big image size of 9999999, 9999999 so I'm actually a bit surprised that it doesn't give us the max result image size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_vision_info uses smart_resize.
https://github.com/huggingface/transformers/blob/cac0a28c83cf87b7a05495de3177099c635ba852/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L76

smart_resize ensures that the number of pixels in the resized image is less than or equal to max_pixels. If max_pixels is 1280 * 28 * 28, the max number of patches in the resize image is 1280. Since 1280 is not a square number, a square image cannot have the max number of patches regardless of the size. Specifically, the big square image is resized in to 1225 (35 * 35) patches. My patch fixes the issue by factoring 1280 into 32 * 40, and constructing an image size with 32 patches high and 40 patches wide.

Copy link
Member

@ywang96 ywang96 Nov 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we should be updating _get_vision_info instead since it's a "bug" of that function rather than get_image_size_with_most_features, correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think _get_vision_info is doing its job correctly. Qwen2 processes huge square image into 35x35 patches when given pixel limit 1280x28x28, so _get_vision_info should follow that behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

multi-modality Related to multi-modality (#4194) qwen Related to Qwen models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants