-
Notifications
You must be signed in to change notification settings - Fork 510
Description
Requirements
Quantization Methods: Ensure compatibility with bitsandbytes to provide 8-bit and 4-bit quantization options within the existing model inference workflow.
Documentation: Provide clear instructions on how to toggle quantization modes, list necessary dependencies, and specify supported/unsupported model architectures.
Examples & Benchmarks: Include integration examples and API usage code. Provide a comparative analysis of model accuracy, inference speed, and memory usage before and after quantization.
Apple Silicon Support (Optional): Include compatibility notes or specific configurations required for running quantized models on Apple Silicon (M-series) hardware.
Motivation
Resource Efficiency: Lower VRAM/RAM consumption to allow the deployment of larger models on hardware with limited resources.
Inference Speed: Improve throughput to facilitate faster deployment and real-world application responsiveness.