Feature request
Integrate AWQ models with TGI. AWQ is a quantization method that has better speedups than GPTQ. They mainly quantize linear layers which they replace with an optimized GEMM kernel. It is W4A16 quantization.
Code: https://github.com/mit-han-lab/llm-awq
Paper: https://arxiv.org/pdf/2306.00978.pdf
cc @michaelfeil @Atry
Motivation
The main motivation is simply to speed up models further. I achieved 134 tokens/s on a 4090+i9-13900k with AWQ quantization on an MPT 7B model (LLaMa gets 100+ tokens).
Your contribution
Currently, I am not able to contribute.
Feature request
Integrate AWQ models with TGI. AWQ is a quantization method that has better speedups than GPTQ. They mainly quantize linear layers which they replace with an optimized GEMM kernel. It is W4A16 quantization.
Code: https://github.com/mit-han-lab/llm-awq
Paper: https://arxiv.org/pdf/2306.00978.pdf
cc @michaelfeil @Atry
Motivation
The main motivation is simply to speed up models further. I achieved 134 tokens/s on a 4090+i9-13900k with AWQ quantization on an MPT 7B model (LLaMa gets 100+ tokens).
Your contribution
Currently, I am not able to contribute.