-
-
Notifications
You must be signed in to change notification settings - Fork 11.7k
[Bugfix] Schedule failure due to wrong get_image_size_with_most_features #29692
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Problem: An image with with most features is not a square if the number of features is not a squre. Solution: Factorize the max number of features and contruct the image shape that is closest to square to meet the aspect ratio constraint. Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
Problem: The image is not big enough to trigger cropping. Solution: Use the native image size. Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a scheduling failure bug caused by an incorrect implementation of get_image_size_with_most_features for the qwen2_vl and gemma3 models. The changes correctly calculate the image dimensions that produce the maximum number of features, which resolves the scheduling issue. For gemma3, the fix replaces a hardcoded image size with one derived from the model's native image size, ensuring that the pan-and-scan cropping logic is correctly triggered. For qwen2_vl, the implementation is updated to calculate the maximum number of features and then determines the optimal image dimensions by finding the factor pair of the feature count that is closest to a square, satisfying aspect ratio constraints. The changes are logical, well-implemented, and directly address the root cause of the bug. The code quality is good, and I have no further suggestions.
DarkLight1337
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing, can you add some regression tests using the image size that causes this failure?
|
Added unit tests for |
If the seq len can't be factored into a pair that satifies the aspect ratio contraint, decrement the seq len and retry. Signed-off-by: Jaehwang Jung <tomtomjhj@gmail.com>
|
Looks like there are some test failures |
| max_image_size, _ = self._get_vision_info( | ||
| image_width=9999999, | ||
| image_height=9999999, | ||
| num_frames=1, | ||
| image_processor=None, | ||
| ) | ||
| return max_image_size | ||
| hf_config = self.get_hf_config() | ||
| vision_config = hf_config.vision_config | ||
| patch_size = vision_config.patch_size | ||
| merge_size = vision_config.spatial_merge_size | ||
| image_processor = self.get_image_processor() | ||
| max_pixels = image_processor.max_pixels | ||
| unit = patch_size * merge_size | ||
| max_seq_len = max_pixels // (unit * unit) | ||
|
|
||
| def closest_factor_pair(n: int) -> tuple[int, int]: | ||
| # left <= right | ||
| for d in range(math.isqrt(n), 0, -1): | ||
| if n % d == 0: | ||
| return d, n // d | ||
| return 1, n | ||
|
|
||
| height_factor, width_factor = 1, max_seq_len | ||
| for seq_len in range(max_seq_len, 0, -1): | ||
| height_factor, width_factor = closest_factor_pair(seq_len) | ||
| if width_factor / height_factor <= 200: | ||
| break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the bugfix on Qwen2_VL but I don't think this is the right fix. IMO we should fix _get_vision_info itself instead.
I'm also curious if you can provide a repro example on when the current _get_vision_info impl is not accurate? Here we're already giving it a pretty big image size of 9999999, 9999999 so I'm actually a bit surprised that it doesn't give us the max result image size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_get_vision_info uses smart_resize.
https://github.com/huggingface/transformers/blob/cac0a28c83cf87b7a05495de3177099c635ba852/src/transformers/models/qwen2_vl/image_processing_qwen2_vl.py#L76
smart_resize ensures that the number of pixels in the resized image is less than or equal to max_pixels. If max_pixels is 1280 * 28 * 28, the max number of patches in the resize image is 1280. Since 1280 is not a square number, a square image cannot have the max number of patches regardless of the size. Specifically, the big square image is resized in to 1225 (35 * 35) patches. My patch fixes the issue by factoring 1280 into 32 * 40, and constructing an image size with 32 patches high and 40 patches wide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we should be updating _get_vision_info instead since it's a "bug" of that function rather than get_image_size_with_most_features, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think _get_vision_info is doing its job correctly. Qwen2 processes huge square image into 35x35 patches when given pixel limit 1280x28x28, so _get_vision_info should follow that behavior.
Purpose
The scheduler may fail to schedule a batch if
get_image_size_with_most_featuresis wrong andmax_num_batched_tokensis not big enough:get_image_size_with_most_featuresreturns an image size that is encoded into a sequence shorter than the actual maximum. That is,max_tokens_per_mm_item < num_mm_input_tokensis possible.encoder_compute_budgetis the max of the sequence length of an image with sizeget_image_size_with_most_featuresandmax_num_batched_tokens(=max_num_encoder_input_tokens).vllm/vllm/v1/core/encoder_cache_manager.py
Lines 336 to 338 in 0808eb8
max_num_batched_tokens < num_mm_input_tokens, it is possible thatmax_num_encoder_input_tokens < num_mm_input_tokens, which results in scheduling failure.vllm/vllm/v1/core/encoder_cache_manager.py
Lines 139 to 141 in 0808eb8
This PR fixes
get_image_size_with_most_featuresof qwen2_vl and gemma3.Test Plan
Apply the following patch
Then run
and
Test Result
Without this PR's fix, vLLM will hang due to repeated scheduler failure. This PR fixes this problem.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.