Skip to content

z-lab/paroquant

Repository files navigation

ParoQuant

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Paper Blog Models PyPI

State-of-the-art INT4 quantization for LLMs. ParoQuant uses learned pairwise rotations to suppress weight outliers, closing the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX).

Quick Start

Installation

# NVIDIA GPU
pip install "paroquant[vllm]"

# Apple Silicon
pip install "paroquant[mlx]"

Pick a model from our Hugging Face collection:

export MODEL=z-lab/Qwen3.5-4B-PARO

Interactive Chat

python -m paroquant.cli.chat --model $MODEL

OpenAI-Compatible API Server

python -m paroquant.cli.serve --model $MODEL --port 8000

Agent with Tool Calling

Start the API server first, then install the agent dependencies and run:

pip install "paroquant[agent]"
python -m paroquant.cli.agent --model $MODEL

Tool use (web fetch, filesystem, time) requires Node.js.

Docker (NVIDIA GPU)

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  ghcr.io/z-lab/paroquant:chat --model $MODEL

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  ghcr.io/z-lab/paroquant:serve --model $MODEL

Models

All models are available on Hugging Face. Swap the model name in the commands above to try any of them.

Qwen3.5

Model Checkpoint
Qwen3.5-0.8B z-lab/Qwen3.5-0.8B-PARO
Qwen3.5-2B z-lab/Qwen3.5-2B-PARO
Qwen3.5-4B z-lab/Qwen3.5-4B-PARO
Qwen3.5-9B z-lab/Qwen3.5-9B-PARO

Qwen3

Model Checkpoint
Qwen3-0.6B z-lab/Qwen3-0.6B-PARO
Qwen3-1.7B z-lab/Qwen3-1.7B-PARO
Qwen3-4B z-lab/Qwen3-4B-PARO
Qwen3-8B z-lab/Qwen3-8B-PARO
Qwen3-14B z-lab/Qwen3-14B-PARO

Llama

Model Checkpoint
Llama-2-7B z-lab/Llama-2-7b-hf-PARO
Llama-3-8B z-lab/Meta-Llama-3-8B-PARO
Llama-3.1-8B-Instruct z-lab/Llama-3.1-8B-Instruct-PARO

Want a model that's not listed? Open an issue and let us know.

Reproduction

Note

The main branch of this repository is under active development, and reproducibility is not guaranteed. Please use the legacy branch to reproduce results from the paper.

Quantize Your Own Model

git clone https://github.com/z-lab/paroquant && cd paroquant
pip install -e ".[optim,eval]"

# 1. Optimize rotation parameters
experiments/optimize/4bit.sh Qwen/Qwen3-8B

# 2. Export to HF checkpoint (--mode real for INT4, --mode pseudo for FP16)
python -m paroquant.cli.convert \
  --model Qwen/Qwen3-8B \
  --result-dir output/Qwen3-8B \
  --output-path models/Qwen3-8B-PARO

Docker Images

Image Purpose
ghcr.io/z-lab/paroquant:chat Interactive chat
ghcr.io/z-lab/paroquant:chat-cu129 Interactive chat (CUDA 12.9)
ghcr.io/z-lab/paroquant:serve OpenAI-compatible API server
ghcr.io/z-lab/paroquant:latest Optimization & evaluation
ghcr.io/z-lab/paroquant:eval Reasoning task evaluation

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}