Skip to content

Conversation

@Isotr0py
Copy link
Member

@Isotr0py Isotr0py commented Nov 29, 2025

Purpose

Test Plan

Tested with cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit on Tesla T4 GPU:

python examples/offline_inference/vision_language.py -m qwen3_vl

Test Result

(EngineCore_DP0 pid=9615) INFO 11-29 16:05:38 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev393+g39e63dec7) with config: model='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'enable_fusion': False, 'enable_attn_fusion': False, 'enable_noop': False, 'enable_sequence_parallelism': False, 'enable_async_tp': False, 'enable_fi_allreduce_fusion': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:38 [multiproc_executor.py:880] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=9615) ERROR 11-29 16:05:38 [fa_utils.py:72] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore_DP0 pid=9615) ERROR 11-29 16:05:38 [fa_utils.py:72] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1200] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:59121 backend=nccl
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1200] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59121 backend=nccl
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:39 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.5 not supported, communicator is not available.
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:39 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.5 not supported, communicator is not available.
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1408] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1408] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:46 [gpu_model_runner.py:3425] Starting to load model cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit...
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) WARNING 11-29 16:05:47 [compressed_tensors.py:717] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:47 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) WARNING 11-29 16:05:47 [compressed_tensors.py:717] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:05:47 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:47 [cuda.py:411] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:05:47 [cuda.py:411] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:10<00:42, 10.62s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:23<00:34, 11.66s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:35<00:23, 11.99s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:48<00:12, 12.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:54<00:00, 10.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:54<00:00, 10.84s/it]
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) 
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:42 [default_loader.py:308] Loading weights took 54.22 seconds
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:43 [gpu_model_runner.py:3507] Model loading took 11.0020 GiB memory and 56.037485 seconds
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:06:44 [gpu_model_runner.py:4264] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:44 [gpu_model_runner.py:4264] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:57 [gpu_worker.py:349] Available KV cache memory: 0.59 GiB
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [kv_cache_utils.py:1286] GPU KV cache size: 4,816 tokens
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [kv_cache_utils.py:1291] Maximum concurrency for 4,096 tokens per request: 1.18x
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:58 [kernel_warmup.py:65] Warming up FlashInfer attention.
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:06:58 [kernel_warmup.py:65] Warming up FlashInfer attention.
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [core.py:254] init engine (profile, create kv cache, warmup model) took 15.01 seconds
(EngineCore_DP0 pid=9615) WARNING 11-29 16:07:09 [vllm.py:596] Inductor compilation was disabled by user settings,Optimizations settings that are only active duringInductor compilation will be ignored.
(EngineCore_DP0 pid=9615) INFO 11-29 16:07:09 [vllm.py:695] Cudagraph is disabled under eager mode
INFO 11-29 16:07:09 [llm.py:346] Supported tasks: ['generate']
Adding requests: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.16s/it]
Processed prompts:   0%|                 | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 11-29 16:07:43 [loggers.py:236] Engine 000: Avg prompt throughput: 116.1 tokens/s, Avg generation throughput: 4.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 85.3%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
Processed prompts: 100%|███████| 4/4 [00:27<00:00,  6.91s/it, est. speed input: 141.59 toks/s, output: 9.26 toks/s]
--------------------------------------------------
The image captures a beautiful spring scene featuring **cherry blossoms** in full bloom, with a prominent **tower** visible through the branches against a **clear blue sky**.

### Key Elements:

1. **Cherry Blossoms (Sakura):**
   - The foreground and midground are filled with delicate
--------------------------------------------------

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables compressed-tensors AWQ quantization on Turing GPUs by lowering the minimum required compute capability from 8.0 to 7.5. This is a valuable enhancement that extends hardware support. My review identified a minor but important maintainability issue where a code comment was not updated to reflect this change, which could cause confusion for future developers. I've provided a suggestion to correct it.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Isotr0py Isotr0py enabled auto-merge (squash) November 29, 2025 17:29
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025
@Isotr0py Isotr0py merged commit e1464c3 into vllm-project:main Nov 30, 2025
53 checks passed
@Isotr0py Isotr0py deleted the turing-awq branch November 30, 2025 07:43
kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025
amd-hhashemi pushed a commit to amd-hhashemi/vllm that referenced this pull request Dec 2, 2025
…ject#29732)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Usage]: Workaround to run model on GPUs with Compute Capability < 8.0?

2 participants