A neural image codec exploring Polar Vector Quantization — decomposing latent representations into radius (magnitude) and direction (unit hypersphere), then applying Product Quantization + Residual VQ for learned image compression.
Status: Research prototype. Demonstrates a working end-to-end neural codec pipeline with a novel quantization approach. Does not yet match JPEG at equal bitrate — see honest assessment and what we learned.
Traditional VQ treats all dimensions equally. Polar-VQ separates each latent vector into:
- Radius
r = ‖v‖— how strong a feature is (4-bit scalar quantization) - Direction
d = v / ‖v‖— what kind of feature it is (codebook lookup on the unit hypersphere)
This separation is motivated by the observation that in high-dimensional spaces, most information is in the direction of vectors (cosine similarity), while magnitude varies smoothly and can be cheaply quantized.
Latent vector v ∈ ℝ²⁵⁶
│
├── Radius: r = ‖v‖ → 4-bit scalar (16 learned levels)
│
└── Direction: d = v/‖v‖ → Product Quantization (8 heads × 32-D)
│ → 4-stage Residual VQ
│ Stage 1: Spherical (cosine sim)
│ Stages 2-4: Cartesian (L2)
└── Entropy coding via checkerboard context model
| Component | Design | Parameters |
|---|---|---|
| Encoder | CNN: 3→128→192→256, 8× downsample | 3 strided conv + 1 refinement |
| Polar-VQ | Radius quant + PQ + RVQ | 8 heads × 32-D, 1024 codebook entries, 4 stages |
| Context Model | Checkerboard two-pass entropy predictor | Spatial CNN + per-stage MLPs |
| Decoder | Transposed CNN, Sigmoid output | Mirror of encoder |
See docs/ARCHITECTURE.md for the full technical deep-dive.
Evaluated on the Kodak dataset (24 images, 768×512).
| Metric | Polar-VQ V4 | JPEG Q90 (similar BPP) |
|---|---|---|
| BPP | 3.35 | 2.35 |
| PSNR | 34.60 dB | 38.03 dB |
| MS-SSIM | 0.9857 | 0.9930 |
| Version | BPP | PSNR (dB) | MS-SSIM | Training Data | Key Change |
|---|---|---|---|---|---|
| V1 | 2.36 | 26.06 | 0.9488 | DIV2K (800) | Baseline implementation |
| V2 | 2.56 | 15.00 | 0.6965 | DIV2K (800) | ❌ Aggressive λ warmup → rate collapse |
| V3 | 4.44 | 26.32 | 0.9609 | DIV2K (800) | Context pre-training, gradual warmup |
| V4 | 3.35 | 34.60 | 0.9857 | COCO+DIV2K (124K) | Bug fixes, larger dataset, AMP |
V4 achieved +8.5 dB PSNR over V1 and is still training (epoch 15 of 100). Full results and analysis in docs/RESULTS.md.
git clone https://github.com/acieslik/Polar-VQ-Codec.git
cd Polar-VQ-Codec
pip install -r requirements.txt
# For .pvq file compression/decompression:
pip install -e ".[full]"Requires Python 3.10+ and PyTorch 2.0+ with CUDA.
# Kodak benchmark only (~15 MB)
python scripts/download_data.py --kodak-only
# Full training data (COCO 2017 Unlabeled + DIV2K, ~22 GB)
python scripts/download_data.py# Default: 3-stage curriculum, 100 epochs
python scripts/train.py --data-dir data/train --epochs 100 --target-lambda 0.01
# With validation on Kodak every 5 epochs
python scripts/train.py --data-dir data/train --target-lambda 0.01 \
--val-dir data/kodak --val-interval 5
# Resume from checkpoint
python scripts/train.py --data-dir data/train --resume checkpoints/latest.pthTraining stages:
| Stage | Epochs | Focus | λ |
|---|---|---|---|
| A | 0–10 | Geometric foundation (MSE only) | 1e-6 |
| B | 10–60 | Rate-Distortion optimization | 0 → target λ (15-epoch warmup) |
| C | 60–100 | Perceptual fine-tuning (MS-SSIM+L1) | target λ / 20 |
python scripts/benchmark.py --dataset data/kodak --weights checkpoints/latest.pthGenerates R-D curves (BPP vs PSNR/MS-SSIM) comparing against JPEG, WebP, and PNG.
python scripts/compress.py photo.png photo.pvq --weights checkpoints/latest.pth
python scripts/decompress.py photo.pvq decoded.png --weights checkpoints/latest.pthPolar-VQ-Codec/
├── polar_vq/ # Core library
│ ├── encoder.py # CNN encoder (8× downsample)
│ ├── decoder.py # CNN decoder (Sigmoid output)
│ ├── quantizer.py # PolarVQ: radius + PQ + hybrid RVQ
│ ├── context_model.py # Checkerboard entropy predictor
│ ├── codec.py # Full pipeline + .pvq bitstream
│ └── losses.py # MSE / MS-SSIM+L1 R-D loss
├── scripts/
│ ├── train.py # Multi-stage training with curriculum
│ ├── benchmark.py # R-D curves vs JPEG/WebP/PNG
│ ├── compress.py # Image → .pvq
│ ├── decompress.py # .pvq → image
│ └── download_data.py # Dataset downloader
├── docs/
│ ├── ARCHITECTURE.md # Technical deep-dive
│ ├── RESULTS.md # Benchmark analysis (V1–V4)
│ └── ROADMAP.md # Future directions
├── tests/ # Unit tests (pytest)
├── checkpoints/ # Saved model weights
└── results/ # Benchmark outputs
This is an honest assessment of where the project stands:
-
Not yet competitive with JPEG. At 3.35 BPP, JPEG achieves ~38 dB vs our 34.6 dB. The context model hasn't learned to compress the index entropy well enough — the raw bit budget (5.06 BPP) is too close to the output.
-
Only one operating point per trained model. JPEG/WebP can sweep quality parameters at encode time. Each Polar-VQ quality level requires a separately trained model (~4 days each on a single GPU).
-
Training is ongoing. V4 has only completed 15 of 100 epochs. Stage B (rate optimization) has just begun — BPP is actively declining (5.13 → 3.35 over 5 epochs).
-
Simple CNN architecture. State-of-the-art neural codecs use attention mechanisms, hyperprior networks, and deeper architectures. Our 7-layer CNN is deliberately minimal.
-
GPU required for decode. No hardware decoder exists — inference requires PyTorch + CUDA.
The iterative development from V1 to V4 produced several insights about training VQ-based neural codecs:
-
Dataset diversity matters more than quantity for VQ. Switching from 800 DIV2K images to 124K COCO+DIV2K images gave the largest single improvement (+8.5 dB). Codebooks need semantic diversity to populate properly.
-
λ warmup is critical. Jumping to full rate penalty caused catastrophic collapse (V2: 15 dB). A 15-epoch linear warmup into Stage B prevented this.
-
Detach BPP gradients during geometric training. Stage A trains encoder/decoder/codebooks without rate pressure, letting the quantizer establish a stable latent topology first.
-
Dead codebook restarts only in Stage A. Restarting entries during rate optimization (Stage B) destabilizes the learned distributions.
-
The PQ scale factor is not optional. Each head of a unit vector has expected norm
1/√num_heads, not 1. Without scaling, RVQ stages waste capacity correcting a scale error.
The Polar-VQ quantization approach may be more impactful outside image compression. See docs/ROADMAP.md for analysis of:
- Completing the image codec (hyperprior, attention, multi-λ sweep)
- LLM weight compression — Polar-VQ preserves directional information that scalar quantizers (GPTQ, AWQ) discard
- Vector database compression — cosine similarity is the standard metric, and Polar-VQ directly optimizes for angular preservation
- Satellite/medical imaging — high-dimensional multispectral data is a natural fit
If you use this code in your research, please cite:
@software{polar_vq_codec,
title={Polar-VQ Codec: Neural Image Compression with Hyperspherical Vector Quantization},
year={2026},
url={https://github.com/acieslik/Polar-VQ-Codec}
}AGPL-3.0 — see LICENSE. Free for research and personal use. Commercial use requires opening your source code or obtaining a commercial license.

