8B Bitwise Autoregressive Generation on Edge GPUs
This repository is a specialized fork of the DeepCompressor framework, tailored specifically to democratize high-fidelity Visual Autoregressive (VAR) models for edge deployment.
- What this fork adds: We introduce a comprehensive W4A4 and INT8 KV-cache quantization pipeline specifically designed for the Infinity family of generative models (2B and 8B). It mitigates extreme activation outliers using SVDQuant and compresses the monotonically growing KV-cache via Asymmetric Per-Channel INT8 Quantization. This allows the 8B model to run natively on 16GB edge silicon.
- Paper: For full methodological details, evaluation metrics, and edge hardware deployment strategies on NVIDIA Jetson architectures, please refer to our paper: Enabling 8B Bitwise Autoregressive Image Generation on Edge GPUs (Available Soon).
This work builds upon exceptional foundational research. For additional insights, upstream features, and the original codebases, please refer to the following projects:
- SVDQuant & DeepCompressor: The foundational quantization engine used in this fork. SVDQuant absorbs outliers by shifting them from activations to weights, then employing a high-precision low-rank branch with Singular Value Decomposition (SVD). For the original implementation, additional diffusion model support, and LLM quantization (QServe), visit the MIT HAN Lab DeepCompressor repository and read the SVDQuant paper.
- Infinity VAR: The target architecture of this fork. Infinity is a Bitwise Visual AutoRegressive Modeling framework capable of generating high-resolution, photorealistic images by predicting bitwise tokens across scales. It refactors visual generation with an infinite-vocabulary classifier and bitwise self-correction. For core model insights, visit the Infinity Project Page & GitHub and read the Infinity paper.
- Clone this repository and navigate to the folder:
git clone https://github.com/Henvezz95/deepcompressor.git
cd deepcompressor- Install dependencies:
pip install -e .
cd Infinity_rep
pip install -r requirements.txtThe quantization strategies implemented in this repository are driven by a deep structural analysis of the Infinity VAR architecture. Unlike standard LLMs, Visual Autoregressive models exhibit unique activation patterns that necessitate specialized treatment.
To reproduce our diagnostic findings, use the profiling suite:
python -m evaluation.activations_measurements configs/models/infinity-8b.yaml configs/collect/qdiff.yamlThrough our profiling, we identified extreme activation outliers in the FFN down-projections, with Kurtosis values significantly exceeding Gaussian distributions.
- Max-to-Median Ratio: Reaches up to 353x in the 2B model.
- Implication: Standard Min-Max quantization would lead to massive precision loss; this justifies our use of SVDQuant to decouple these outliers into a high-precision low-rank branch.
Analysis of the monotonically growing KV-cache reveals that variance is not uniform across dimensions.
| Metric | Measured Value (8B) | Technical Requirement |
|---|---|---|
| > 1.2 | Per-Channel Scaling: Variance is driven by specific channels rather than tokens. | |
| Skewness | ~0.85 (Key Cache) | Asymmetric Mapping: Distributions are highly skewed (in some channels), requiring non-centered zero-points. |
By running the diagnostic script, users can verify that these structural characteristics are consistent across both 2B and 8B variants, validating the selection of Asymmetric Per-Channel INT8 for the cache pipeline.
The following command executes the baseline quantization pipeline for the Infinity 8B model, utilizing the specific calibration settings defined in qdiff.yaml and running the complete INT4 SVDQuant pipeline (incorporating both activation smoothing and low-rank weight branches):
python -m deepcompressor.app.diffusion.ptq_infinity configs/models/infinity-8b.yaml configs/collect/qdiff.yaml configs/svdquant/int4.yamlConfiguration Override Hierarchy: Positional arguments dictate the override order (files passed later in the command override overlapping keys in earlier files). Ensure all files remain within the relative paths of your working directory. For rapid debugging, you can append additional override configurations (e.g., reducing calibration steps via fast.yaml) at the end of the execution chain:
python -m deepcompressor.app.diffusion.ptq_infinity configs/models/infinity-8b.yaml configs/collect/qdiff.yaml configs/svdquant/int4.yaml configs/svdquant/fast.yaml(Note: End-to-end evaluation and image generation using the quantized models are handled via separate evaluation scripts, not during this initial PTQ pass).
The repository provides several example configurations to demonstrate different quantization strategies for the Infinity models.
-
Base Model Configurations:
configs/models/infinity-8b.yaml: Defines the pipeline architecture, precision (W4A4 + SVDQuant LoRA), and paths for the 8B model.configs/models/infinity-2b.yaml: Defines the pipeline architecture, precision (W4A4 + SVDQuant LoRA), and paths for the 2B model.
-
Other Quantization Strategies (Ablation Studies):
configs/models/infinity-2b-smoothquant.yaml: Enables activation smoothing to mitigate outliers without utilizing the low-rank branch for weights.configs/models/infinity-2b-naive.yaml: Performs standard block-wise quantization (e.g., 64-group) on the weights. This is useful as a baseline but may cause degradation, especially in the 2B model.
To generate the optimal Asymmetric Per-Channel INT8 quantization scales for the KV-cache, execute the calibrate_cache_quantization module. Unlike standard LLM cache quantization, our analysis of VAR models indicates that variance is predominantly channel-driven across both Keys and Values.
The script employs a Golden-Section Search to optimize clipping bounds per channel, minimizing the reconstruction Mean Squared Error (MSE). This deterministic strategy accommodates highly skewed distributions (peaking at 11.56 in the 2B Key Cache) without the control-flow overhead of dynamic token pruning.
It requires the base model configuration and the calibration collection parameters:
python -m deepcompressor.app.diffusion.calibrate_cache_quantization configs/models/infinity-8b.yaml configs/collect/qdiff.yamlKey Implementation Details:
- Asymmetric Mapping: Uses affine quantization to align scaling factors with the axes of highest variance and shift zero-points to accommodate skewed dynamic ranges.
- Optimization: Scans a logarithmic grid of percentiles before refining the optimal clipping bounds using a coarse-to-fine search.
- Output: Generates the
scaleandzero_pointparameters saved tokv_scales/kv_quant_calib.pt, which are required to run the full W4A4+KV8 inference pipeline.
(Note: This routine calculates the scale and zero_point parameters saved to kv_scales/kv_quant_calib.pt, which are subsequently required to run the full W4A4+KV8 inference pipeline).
To assess the generative fidelity (FID, ImageReward, CLIP-IQA) before deploying to edge hardware, the benchmark_assembled_model.py script provides a bit-accurate simulation of the quantization noise. By using fake-quantization, the framework applies low-bit logic (e.g., INT4 or INT8) to the model weights and activations while performing the underlying computation in bfloat16.
This allows for granular ablation studies—independently toggling Weight, Activation, and KV-cache quantization to identify the precise impact on aesthetic quality.
The following command evaluates an Infinity 8B model on the MJHQ and DCI benchmarks. It simulates a complete W4A4 + KV8 pipeline by fusing the SVD low-rank branches and activation scales generated during the PTQ phase:
python -m evaluation.benchmark_assembled_model \
configs/models/infinity-8b.yaml \
configs/svdquant/int4.yaml \
--ref-root ./evaluation_output/infinity_fp16_8b \
--gen-root ./evaluation_output/infinity_w4a4_kv8_8b \
--base-path ./runs/diffusion/int4_rank32_8b/ \
--enable_weight_quant true \
--enable_activation_quant true \
--enable_kv_quant true \
--eval-benchmarks MJHQ DCI \
--eval-num-samples 5000 \
--eval-gt-metrics clip_iqa clip_score fid image_reward psnr ssim lpips| Argument | Type | Description |
|---|---|---|
--base-path |
str |
Directory containing the PTQ artifacts: model.pt, smooth.pt, and branch.pt. |
--enable_weight_quant |
bool |
Enables fake-quantization for transformer weights. |
--enable_activation_quant |
bool |
Enables fake-quantization for linear layer input activations. |
--enable_kv_quant |
bool |
Enables Asymmetric Per-Channel INT8 simulation for the KV-cache using optimized scales. |
--gen-root |
str |
Destination for generated images and the final results.json. |
--ref-root |
str |
Path to the ground-truth reference dataset for metrics that require a reference (e.g., SSIM, PSNR, LPIPS). |
Note on Artifacts: The script automatically looks for cache scales in runs/kv_scales/kv_quant_calib.pt. Ensure you have run the calibrate_cache_quantization script before enabling the --enable_kv_quant flag.
To measure the actual memory savings and inference speed on edge hardware (e.g., NVIDIA Jetson), use the infinity_w4a4_test.py script. Unlike the quality evaluation script, this routine swaps standard layers for real SVDQuantLinear modules and executes optimized low-bit kernels.
This script requires specialized hardware-accelerated kernels for 4-bit weight and 4-bit activation computation. You must install the following dependency:
- Nunchaku (Specialized Fork): Henvezz95/nunchaku-fork
The script performs the following hardware validation steps:
- Model Transformation: Swaps standard
nn.Linearlayers forSVDQuantLinear(defaulting to Rank-32) while excluding sensitive layers like the transformer head and embeddings. - Artifact Injection: Loads the real quantized weights (
model.pt) and the high-precision SVD branches (branch.pt) directly into the specialized modules. - Cache Activation: Integrates the calibrated INT8 KV-cache parameters via the
attach_kv_qparamsutility. - Footprint Profiling: Executes multiple generation loops and reports the absolute peak GPU memory usage using
torch.cuda.max_memory_allocated().
The real-quantization benchmark now follows the same configuration hierarchy as the rest of the VAR-Compressor pipeline. You must provide the model YAML file and the path to your PTQ artifacts:
# Benchmark the 8B model with real W4A4 + KV8 kernels
python infinity_w4a4_test.py configs/models/infinity-8b.yaml \
--base-path ./runs/diffusion/int4_rank32_8b/ \
--enable_kv_quant true \
--prompt "A cinematic photo of a robot in Zurich"Custom Arguments:
--base-path: (Required) The directory containing yourmodel.pt,smooth.pt, andbranch.ptartifacts.--enable_kv_quant: Set totrueto enable real INT8 KV-cache kernels.--prompt: The text description used for the generation benchmark.--seed: Fixed seed for reproducibility during latency measurement.
Visual Autoregressive models achieve state-of-the-art fidelity, but the monotonically growing KV-cache introduces a severe Memory Wall, confining these systems to data-center infrastructure. This fork provides a specialized compression pipeline to break that wall.
Through structural profiling, we diagnosed extreme activation outliers in the FFN down-projections of the Infinity architecture (peaking at 353x the median). To resolve this, ptq_infinity.py extends the SVDQuant paradigm to VAR models, decoupling outliers via a low-rank branch. To mitigate the cache footprint without runtime overhead, we implement Asymmetric Per-Channel INT8 Quantization, mapping highly skewed channel variances to static 8-bit limits optimized via Golden-Section Search.
This pipeline reduces the peak memory of the Infinity 8B model by 64% (from 37.1 GB to 13.3 GB), enabling local execution on mid-range edge devices.
Below is the generation quality evaluated with 5,000 samples from the MJHQ-30K dataset. Our quantization pipeline retains near-FP16 aesthetic alignment (ImageReward) while compressing the model severely.
| Model | Precision | Method | FID (↓) | ImageReward (↑) | CLIP-IQA (↑) |
|---|---|---|---|---|---|
| Infinity 8B | FP16 | -- | 19.6 | 1.18 | 0.945 |
| INT W4A4 | SVDQuant + KV8 | 19.0 | 1.13 | 0.935 | |
| Infinity 2B | FP16 | -- | 21.3 | 0.981 | 0.947 |
| INT W4A4 | SVDQuant + KV8 | 20.2 | 0.840 | 0.919 |
System footprint and end-to-end latency measured on an NVIDIA Jetson AGX Orin 64GB. The "Feasible HW" tier indicates the minimum commercial module required to run the model natively in memory.
| Model | Precision | Peak Memory | Latency | Feasible HW |
|---|---|---|---|---|
| Flux.1-dev | INT W4A4 | 11.8 GB | 112.0 s | Orin NX (16GB) |
| Infinity 8B | FP16 | 37.1 GB | 25.1 s | AGX Orin (64GB) |
| Infinity 8B | INT W4A4 + KV8 | 13.3 GB | 27.0 s | Orin NX (16GB) |
| Infinity 2B | FP16 | 16.0 GB | 8.46 s | AGX Orin (32GB) |
| Infinity 2B | INT W4A4 + KV8 | 7.71 GB | 11.5 s | Orin Nano (8GB) |
Qualitative Comparison of our compressed Infinity 8B model. We evaluate fidelity across four scenarios: Detailed Portrait, Architectural Geometry, Landscape Gradients, and Object Representation. Columns compare Infinity 8B in FP16 vs.\ our W4A4 quantization pipeline against Flux.1-dev (quantized via SVDQuant W4A4 INT4). The quantized Infinity 8B retains near-FP16 quality, showing comparable or even superior output quality to the baseline Flux.1-dev.

Prompts are: (1) "Portrait, photograph, canon 5d, magazine, editorial, full profile shot, photorealism, Annie Lebowitz, middle aged man, realistic, accurate"; (2) "A photograph of an intricate wooden gazebo with a traditional Asian-style tiled roof, set in a dense wooded forest clearing. Towering mountains in the background under a partly cloudy sky. Natural daylight, highly detailed wood grain and foliage, 8k"; (3) "Photorealistic. 4k. A hidden beach accessible only by boat, surrounded by towering rock formations, lush vegetation, and colorful coral reefs, the sun sets behind the mountains, subtle warm orange glow close over the water, creating a peaceful and romantic setting, Multiple light sources. high detail. ultra realistic"; (4) "A detailed close-up photograph of a small shrine against a solid black background. A weathered stone statue in the center, surrounded by fresh colorful flowers, several burning candles casting warm light, and fruit wrapped in clear cellophane plastic. Macro lens, highly textured, cinematic lighting, 8k". (1) and (4) are taken from MJHQ, (2) and (3) from sDCI. The same seed is used for all models.

