[Quantization] Enable compressed-tensors AWQ for Turing GPU #29732

Isotr0py · 2025-11-29T16:23:23Z

Purpose

Fix [Usage]: Workaround to run model on GPUs with Compute Capability < 8.0? #29707
Compressed-tensors AWQ quantization should be able to run on Turing GPU, verified on Tesla T4 GPU

Test Plan

Tested with cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit on Tesla T4 GPU:

python examples/offline_inference/vision_language.py -m qwen3_vl

Test Result

(EngineCore_DP0 pid=9615) INFO 11-29 16:05:38 [core.py:93] Initializing a V1 LLM engine (v0.11.2.dev393+g39e63dec7) with config: model='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', speculative_config=None, tokenizer='cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': None, 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'enable_fusion': False, 'enable_attn_fusion': False, 'enable_noop': False, 'enable_sequence_parallelism': False, 'enable_async_tp': False, 'enable_fi_allreduce_fusion': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>}, 'local_cache_dir': None}
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:38 [multiproc_executor.py:880] Reducing Torch parallelism from 2 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=9615) ERROR 11-29 16:05:38 [fa_utils.py:72] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore_DP0 pid=9615) ERROR 11-29 16:05:38 [fa_utils.py:72] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1200] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:59121 backend=nccl
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1200] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59121 backend=nccl
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:39 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.5 not supported, communicator is not available.
(EngineCore_DP0 pid=9615) WARNING 11-29 16:05:39 [symm_mem.py:67] SymmMemCommunicator: Device capability 7.5 not supported, communicator is not available.
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1408] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0
(EngineCore_DP0 pid=9615) INFO 11-29 16:05:39 [parallel_state.py:1408] rank 1 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:46 [gpu_model_runner.py:3425] Starting to load model cpatonn/Qwen3-VL-32B-Instruct-AWQ-4bit...
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) WARNING 11-29 16:05:47 [compressed_tensors.py:717] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:47 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) WARNING 11-29 16:05:47 [compressed_tensors.py:717] Acceleration for non-quantized schemes is not supported by Compressed Tensors. Falling back to UnquantizedLinearMethod
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:05:47 [compressed_tensors_wNa16.py:108] Using ExllamaLinearKernel for CompressedTensorsWNA16
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:05:47 [cuda.py:411] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:05:47 [cuda.py:411] Using FLASHINFER attention backend out of potential backends: ['FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']
Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:10<00:42, 10.62s/it]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:23<00:34, 11.66s/it]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:35<00:23, 11.99s/it]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:48<00:12, 12.27s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:54<00:00, 10.04s/it]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:54<00:00, 10.84s/it]
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) 
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:42 [default_loader.py:308] Loading weights took 54.22 seconds
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:43 [gpu_model_runner.py:3507] Model loading took 11.0020 GiB memory and 56.037485 seconds
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:06:44 [gpu_model_runner.py:4264] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:44 [gpu_model_runner.py:4264] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:57 [gpu_worker.py:349] Available KV cache memory: 0.59 GiB
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [kv_cache_utils.py:1286] GPU KV cache size: 4,816 tokens
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [kv_cache_utils.py:1291] Maximum concurrency for 4,096 tokens per request: 1.18x
(EngineCore_DP0 pid=9615) (Worker_TP0 pid=9621) INFO 11-29 16:06:58 [kernel_warmup.py:65] Warming up FlashInfer attention.
(EngineCore_DP0 pid=9615) (Worker_TP1 pid=9623) INFO 11-29 16:06:58 [kernel_warmup.py:65] Warming up FlashInfer attention.
(EngineCore_DP0 pid=9615) INFO 11-29 16:06:58 [core.py:254] init engine (profile, create kv cache, warmup model) took 15.01 seconds
(EngineCore_DP0 pid=9615) WARNING 11-29 16:07:09 [vllm.py:596] Inductor compilation was disabled by user settings,Optimizations settings that are only active duringInductor compilation will be ignored.
(EngineCore_DP0 pid=9615) INFO 11-29 16:07:09 [vllm.py:695] Cudagraph is disabled under eager mode
INFO 11-29 16:07:09 [llm.py:346] Supported tasks: ['generate']
Adding requests: 100%|███████████████████████████████████████████████████████████████| 4/4 [00:08<00:00,  2.16s/it]
Processed prompts:   0%|                 | 0/4 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]INFO 11-29 16:07:43 [loggers.py:236] Engine 000: Avg prompt throughput: 116.1 tokens/s, Avg generation throughput: 4.8 tokens/s, Running: 4 reqs, Waiting: 0 reqs, GPU KV cache usage: 85.3%, Prefix cache hit rate: 0.0%, MM cache hit rate: 0.0%
Processed prompts: 100%|███████| 4/4 [00:27<00:00,  6.91s/it, est. speed input: 141.59 toks/s, output: 9.26 toks/s]
--------------------------------------------------
The image captures a beautiful spring scene featuring **cherry blossoms** in full bloom, with a prominent **tower** visible through the branches against a **clear blue sky**.

### Key Elements:

1. **Cherry Blossoms (Sakura):**
   - The foreground and midground are filled with delicate
--------------------------------------------------

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

chatgpt-codex-connector · 2025-11-29T16:23:28Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

gemini-code-assist

Code Review

This pull request enables compressed-tensors AWQ quantization on Turing GPUs by lowering the minimum required compute capability from 8.0 to 7.5. This is a valuable enhancement that extends hardware support. My review identified a minor but important maintainability issue where a code comment was not updated to reflect this change, which could cause confusion for future developers. I've provided a suggestion to correct it.

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…ject#29732) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

…ject#29732) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Hashem Hashemi <hashem.hashemi@amd.com>

enable compressed-tensors AWQ for turing GPU

2294405

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Isotr0py requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners November 29, 2025 16:23

Isotr0py mentioned this pull request Nov 29, 2025

[Usage]: Workaround to run model on GPUs with Compute Capability < 8.0? #29707

Closed

gemini-code-assist bot reviewed Nov 29, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/schemes/compressed_tensors_wNa16.py Show resolved Hide resolved

update comment

819892e

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

DarkLight1337 approved these changes Nov 29, 2025

View reviewed changes

Isotr0py enabled auto-merge (squash) November 29, 2025 17:29

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 29, 2025

Merge branch 'main' into turing-awq

6ad930f

Isotr0py merged commit e1464c3 into vllm-project:main Nov 30, 2025
53 checks passed

Isotr0py deleted the turing-awq branch November 30, 2025 07:43

kitaekatt pushed a commit to kitaekatt/vllm that referenced this pull request Dec 1, 2025

[Quantization] Enable compressed-tensors AWQ for Turing GPU (vllm-pro…

0bcce20

…ject#29732) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Quantization] Enable compressed-tensors AWQ for Turing GPU #29732

[Quantization] Enable compressed-tensors AWQ for Turing GPU #29732

Uh oh!

Isotr0py commented Nov 29, 2025 •

edited by github-actions bot

Loading

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

[Quantization] Enable compressed-tensors AWQ for Turing GPU #29732

[Quantization] Enable compressed-tensors AWQ for Turing GPU #29732

Uh oh!

Conversation

Isotr0py commented Nov 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector bot commented Nov 29, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Isotr0py commented Nov 29, 2025 •

edited by github-actions bot

Loading