FBQuant: FeedBack Quantization for Large Language Models

Accepted to IJCAI 2025. Paper link.

Authors

Yijiang Liu, Hengyu Fang, Liulu He, Rongyu Zhang, Yichuan Bai, Yuan Du, Li Du

Authors are from Nanjing University

Overview

Deploying Large Language Models (LLMs) on edge devices is increasingly important, as it eliminates reliance on network connections, reduces expensive API calls, and enhances user privacy. However, on-device deployment is challenging due to the limited computational resources of edge devices. In particular, the key bottleneck stems from memory bandwidth constraints related to weight loading. Weight-only quantization effectively reduces memory access, yet often induces significant accuracy degradation. Recent efforts to incorporate sub-branches have shown promise for mitigating quantization errors, but these methods either lack robust optimization strategies or rely on suboptimal objectives. To address these gaps, we propose FeedBack Quantization (FBQuant), a novel approach inspired by negative feedback mechanisms in automatic control. FBQuant inherently ensures that the reconstructed weights remain bounded by the quantization process, thereby reducing the risk of overfitting. To further offset the additional latency introduced by sub-branches, we develop an efficient CUDA kernel that decreases 60% of extra inference time. Comprehensive experiments demonstrate the efficiency and effectiveness of FBQuant across various LLMs. Notably, for 3-bit Llama2-7B, FBQuant improves zero-shot accuracy by 1.2%.

Checklist

Clean the codebase: Ensure the code is organized, readable, and optimized.
Write detailed instructions: Provide a clear guide for running the program.
Open-source the CUDA Kernel: Release the efficient CUDA kernel developed for reducing inference latency.

Acknowledgment

We extend our gratitude to the following works that inspired and supported this research:

GPTQ
AWQ
OmniQuant

Citation

@article{liu2025fbquant,
  title={FBQuant: FeedBack Quantization for Large Language Models},
  author={Liu, Yijiang and Fang, Hengyu and He, Liulu and Zhang, Rongyu and Bai, Yichuan and Du, Yuan and Du, Li},
  journal={arXiv preprint arXiv:2501.16385},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.vscode		.vscode
3rdparty/peft-0.13.1		3rdparty/peft-0.13.1
optimum-1.22.0		optimum-1.22.0
README.md		README.md
bnb_llm.py		bnb_llm.py
eval_results.md		eval_results.md
feedback_quantization.py		feedback_quantization.py
feedback_quantization_peft.py		feedback_quantization_peft.py
hf_builtin_quantizers.py		hf_builtin_quantizers.py
lm_python_script.py		lm_python_script.py
omniuquant_utils.py		omniuquant_utils.py
print_results.py		print_results.py
set_envs.sh		set_envs.sh
smallest_demo.py		smallest_demo.py
utils.py		utils.py
validate.py		validate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FBQuant: FeedBack Quantization for Large Language Models

Authors

Overview

Checklist

Acknowledgment

Citation

About

Uh oh!

Releases

Packages

Languages

kriskrisliu/FeedBackQuant

Folders and files

Latest commit

History

Repository files navigation

FBQuant: FeedBack Quantization for Large Language Models

Authors

Overview

Checklist

Acknowledgment

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages