GitHub - intel/auto-round: Advanced Quantization Algorithm for LLMs and VLMs, with support for CPU, Intel GPU, CUDA and HPU.

Advanced Quantization Algorithm for LLMs

🚀 What is AutoRound?

AutoRound is an advanced quantization library designed for Large Language Models (LLMs) and Vision-Language Models (VLMs). It delivers high accuracy at ultra-low bit widths (2–4 bits) with minimal tuning by leveraging sign-gradient descent and offering broad hardware compatibility. For more details, see our paper for more details and explore quantized models available on several Hugging Face Spaces, e.g. Intel, OPEA, Kaitchup and fbaldassarri. For usage instructions, please refer to User Guide.

🆕 What's New

[2025/10] AutoRound has been integrated into SGLang. You can now run models in the AutoRound format directly using the latest SGLang later than v0.5.4.

[2025/10] We enhanced the RTN mode (--iters 0) to significantly reduce quantization cost compared to the default tuning mode. Check out this doc for some accuracy results. If you don’t have sufficient resources, you can use this mode for 4-bit quantization.

[2025/10] We proposed a fast algorithm to generate mixed bits/datatypes schemes in minutes. Please refer to the documentation for accuracy results and this guide for usage instructions.

[2025/09] AutoRound now includes experimental support for the mxfp4 and nvfp4 dtypes. For accuracy results, see the documentation . We currently recommend exporting to the LLM-Compressor format.

[2025/08] AutoRound now provides experimental support for an improved INT2 algorithm via --enable_alg_ext. See this documentation for some accuracy results.

[2025/07] AutoRound now offers experimental support for GGUF format, and recommends using optimized RTN mode (--iters 0) for all bits other than 3 bits. A more advanced algorithm tailored for specific configurations may be available in v0.8.1.

[2025/05] AutoRound has been integrated into vLLM. You can now run models in the AutoRound format directly with vLLM versions later than v0.85.post1.

[2025/04] AutoRound has been integrated into Transformers. You can run models in the AutoRound format directly with Transformers versions later than 4.51.3.

[2025/03] The INT2-mixed DeepSeek-R1 model (~200GB) retains 97.9% accuracy. Check out OPEA/DeepSeek-R1-int2-mixed-sym-inc.

✨ Key Features

✅ Superior Accuracy Delivers strong performance even at 2–3 bits example models, with leading results at 4 bits benchmark.

✅ Ecosystem Integration Seamlessly works with Transformers, vLLM, and more.

✅ Multiple Formats Export Support AutoRound, AutoAWQ, AutoGPTQ, and GGUF for maximum compatibility. Details are shown in export formats

✅ Affordable Quantization Cost Quantize 7B models in about 10 minutes on a single GPU. Details are shown in quantization costs

✅ Fast Mixed Bits/Dtypes Scheme Generation Automatically configure in minutes, with about 2X-4X the model’s BF16 VRAM size as overhead. Accuracy results and user guide.

✅ 10+ VLMs Support Out-of-the-box quantization for 10+ vision-language models example models, support matrix

✅ Layerwise Mixed Bits Quantization Assign different bits per layer for fine-grained accuracy/performance trade-offs. Details are shown in mixed bits quantization

✅ Optimized Round-to-Nearest Mode Use --iters 0 for fast, calibration-free quantization with some accuracy drop for 4 bits. Details are shown in opt_rtn mode

✅ Multiple Recipes Choose from auto-round-best, auto-round, and auto-round-light to suit your needs. Details are shown in quantization recipes

✅ Advanced Utilities Includes multiple gpus quantization, multiple calibration datasets and support for 10+ runtime backends.

✅ Beyond weight only quantization. We are actively expanding support for additional datatypes such as MXFP, NVFP, W8A8, and more.

Installation

Install from pypi

# CPU/Intel GPU/CUDA
pip install auto-round

# HPU
pip install auto-round-lib

Build from Source

# CPU/Intel GPU/CUDA
pip install .

# HPU
python setup.py install lib

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

CLI Usage

The full list of supported arguments is provided by calling auto-round -h on the terminal.

auto-round \
    --model Qwen/Qwen3-0.6B \
    --scheme "W4A16" \
    --format "auto_round" \
    --output_dir ./tmp_autoround

We offer another two recipes, auto-round-best and auto-round-light, designed for optimal accuracy and improved speed, respectively. Details are as follows.

Other Recipes

# Best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
auto-round-best \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16" \
  --low_gpu_mem_usage

# 2-3X speedup, slight accuracy drop at W4 and larger accuracy drop at W2
auto-round-light \
  --model Qwen/Qwen3-0.6B \
  --scheme "W4A16"

In conclusion, we recommend using auto-round for W4A16 and auto-round-best with enable_alg_ext for W2A16. However, you may adjust the configuration to suit your specific requirements and available resources.

API Usage

from auto_round import AutoRound

# Load a model (supports FP8/BF16/FP16/FP32)
model_name_or_path = "Qwen/Qwen3-0.6B"

# Available schemes: "W2A16", "W3A16", "W4A16", "W8A16", "NVFP4", "MXFP4" (no real kernels), "GGUF:Q4_K_M", etc.
ar = AutoRound(model_name_or_path, scheme="W4A16")

# Highest accuracy (4–5× slower).
# `low_gpu_mem_usage=True` saves ~20GB VRAM but runs ~30% slower.
# ar = AutoRound(model_name_or_path, nsamples=512, iters=1000, low_gpu_mem_usage=True)

# Faster quantization (2–3× speedup) with slight accuracy drop at W4G128.
# ar = AutoRound(model_name_or_path, nsamples=128, iters=50, lr=5e-3)

# Supported formats: "auto_round" (default), "auto_gptq", "auto_awq", "llm_compressor", "gguf:q4_k_m", etc.
ar.quantize_and_save(output_dir="./qmodel", format="auto_round")

AutoScheme Usage

Please refer to the user guide for more details on AutoScheme.

from auto_round import AutoRound, AutoScheme

model_name = "Qwen/Qwen3-8B"
avg_bits = 3.0
scheme = AutoScheme(avg_bits=avg_bits, options=("GGUF:Q2_K_S", "GGUF:Q4_K_S"), ignore_scale_zp_bits=True)
layer_config = {"lm_head": "GGUF:Q6_K"}

# Change iters to 200 for non-GGUF schemes
ar = AutoRound(model=model_name, scheme=scheme, layer_config=layer_config, iters=0)
ar.quantize_and_save()

Important Hyperparameters

Quantization Scheme & Configuration

scheme (str|dict|AutoScheme): The predefined quantization keys, e.g. W4A16, MXFP4, NVFP4, GGUF:Q4_K_M.
bits (int): Number of bits for quantization (default is None). If not None, it will override the scheme setting.
group_size (int): Size of the quantization group (default is None). If not None, it will override the scheme setting.
sym (bool): Whether to use symmetric quantization (default is None). If not None, it will override the scheme setting.
layer_config (dict): Configuration for weight quantization (default is None), mainly for mixed schemes.

Algorithm Settings

enable_alg_ext (bool): Enable algorithm variants for specific schemes (e.g., MXFP4/W2A16) that could bring notable improvements. Default is False.
disable_opt_rtn (bool): Use pure RTN mode for specific schemes (e.g., GGUF and WOQ). Default is False (improved RTN enabled).

Tuning Process Parameters

iters (int): Number of tuning iterations (default is 200). Common values: 0 (RTN mode), 50 (with lr=5e-3 recommended), 1000. Higher values increase accuracy but slow down tuning.
lr (float): The learning rate for rounding value (default is None). When None, it will be set to 1.0/iters automatically.
batch_size (int): Batch size for training (default is 8). 4 is also commonly used.

Calibration Dataset

dataset (str|list|tuple|torch.utils.data.DataLoader): The dataset for tuning (default is "NeelNanda/pile-10k"). Supports local JSON files and dataset combinations, e.g. "./tmp.json,NeelNanda/pile-10k:train,mbpp:train+validation+test".
nsamples (int): Number of samples for tuning (default is 128).
seqlen (int): Data length of the sequence for tuning (default is 2048).

Device/Speed Configuration

enable_torch_compile (bool): If no exception is raised, typically we recommend setting it to True for faster quantization with lower resource.
low_gpu_mem_usage (bool): Whether to offload intermediate features to CPU at the cost of ~20% more tuning time (default is False).
device_map (str|dict|int): The device to be used for tuning, e.g., "cpu", "cuda", "0,1,2" (default is '0').

API Usage for VLMs

If you encounter issues during quantization, try setting iters=0 (to enable RTN) and use group_size=32 for better results.

Click to expand

This feature is experimental and may be subject to changes.

By default, AutoRoundMLLM only quantize the text module of VLMs and uses NeelNanda/pile-10k for calibration. To quantize the entire model, you can enable quant_nontext_module by setting it to True, though support for this feature is limited. For more information, please refer to the AutoRoundMLLM readme.

from auto_round import AutoRoundMLLM

# Load the model
model_name_or_path = "Qwen/Qwen2.5-VL-7B-Instruct"
# Quantize the model
ar = AutoRoundMLLM(model_name_or_path, scheme="W4A16")
output_dir = "./qmodel"
ar.quantize_and_save(output_dir)

Model Inference

vLLM (CPU/Intel GPU/CUDA)

from vllm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
]
sampling_params = SamplingParams(temperature=0.6, top_p=0.95)
model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
llm = LLM(model=model_name)

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

SGLang (Intel GPU/CUDA)

Please note that support for the MoE models and visual language models is currently limited.

import sglang as sgl

llm = sgl.Engine(model_path="Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound")
prompts = [
    "Hello, my name is",
]
sampling_params = {"temperature": 0.6, "top_p": 0.95}

outputs = llm.generate(prompts, sampling_params)
for prompt, output in zip(prompts, outputs):
    print(f"Prompt: {prompt}\nGenerated text: {output['text']}")

Transformers (CPU/Intel GPU/Gaudi/CUDA)

AutoRound support 10+ backends and automatically selects the best available backend based on the installed libraries and prompts the user to install additional libraries when a better backend is found.

Please avoid manually moving the quantized model to a different device (e.g., model.to('cpu')) during inference, as this may cause unexpected exceptions.

The support for Gaudi device is limited.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound"
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "There is a girl who likes adventure,"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))

Acknowledgement

Special thanks to open-source low precision libraries such as AutoGPTQ, AutoAWQ, GPTQModel, Triton, Marlin, and ExLLaMAV2 for providing low-precision CUDA kernels, which are leveraged in AutoRound.

🌟 Support Us

If you find AutoRound helpful, please ⭐ star the repo and share it with your community!

Name		Name	Last commit message	Last commit date
Latest commit History 717 Commits
.azure-pipelines		.azure-pipelines
auto_round		auto_round
auto_round_extension		auto_round_extension
docs		docs
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml
requirements-cpu.txt		requirements-cpu.txt
requirements-lib.txt		requirements-lib.txt
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
third-party-programs.txt		third-party-programs.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Advanced Quantization Algorithm for LLMs

🚀 What is AutoRound?

🆕 What's New

✨ Key Features

Installation

Install from pypi

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

CLI Usage

API Usage

AutoScheme Usage

Quantization Scheme & Configuration

Algorithm Settings

Tuning Process Parameters

Calibration Dataset

Device/Speed Configuration

API Usage for VLMs

Model Inference

vLLM (CPU/Intel GPU/CUDA)

SGLang (Intel GPU/CUDA)

Transformers (CPU/Intel GPU/Gaudi/CUDA)

Acknowledgement

🌟 Support Us

About

Uh oh!

Releases 18

Uh oh!

Contributors 24

Languages

Uh oh!

License

Uh oh!

intel/auto-round

Folders and files

Latest commit

History

Repository files navigation

Advanced Quantization Algorithm for LLMs

🚀 What is AutoRound?

🆕 What's New

✨ Key Features

Installation

Install from pypi

Model Quantization (CPU/Intel GPU/Gaudi/CUDA)

CLI Usage

API Usage

AutoScheme Usage

Quantization Scheme & Configuration

Algorithm Settings

Tuning Process Parameters

Calibration Dataset

Device/Speed Configuration

API Usage for VLMs

Model Inference

vLLM (CPU/Intel GPU/CUDA)

SGLang (Intel GPU/CUDA)

Transformers (CPU/Intel GPU/Gaudi/CUDA)

Acknowledgement

🌟 Support Us

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 18

Uh oh!

Contributors 24

Languages