Skip to content

acieslik/Polar-VQ-Codec

Repository files navigation

Polar-VQ Codec

A neural image codec exploring Polar Vector Quantization — decomposing latent representations into radius (magnitude) and direction (unit hypersphere), then applying Product Quantization + Residual VQ for learned image compression.

Rate-Distortion curve on Kodak dataset

Status: Research prototype. Demonstrates a working end-to-end neural codec pipeline with a novel quantization approach. Does not yet match JPEG at equal bitrate — see honest assessment and what we learned.


Key Idea

Traditional VQ treats all dimensions equally. Polar-VQ separates each latent vector into:

  • Radius r = ‖v‖how strong a feature is (4-bit scalar quantization)
  • Direction d = v / ‖v‖what kind of feature it is (codebook lookup on the unit hypersphere)

This separation is motivated by the observation that in high-dimensional spaces, most information is in the direction of vectors (cosine similarity), while magnitude varies smoothly and can be cheaply quantized.

Latent vector v ∈ ℝ²⁵⁶
    │
    ├── Radius: r = ‖v‖           → 4-bit scalar (16 learned levels)
    │
    └── Direction: d = v/‖v‖      → Product Quantization (8 heads × 32-D)
         │                            → 4-stage Residual VQ
         │                               Stage 1: Spherical (cosine sim)
         │                               Stages 2-4: Cartesian (L2)
         └── Entropy coding via checkerboard context model

Architecture

Component Design Parameters
Encoder CNN: 3→128→192→256, 8× downsample 3 strided conv + 1 refinement
Polar-VQ Radius quant + PQ + RVQ 8 heads × 32-D, 1024 codebook entries, 4 stages
Context Model Checkerboard two-pass entropy predictor Spatial CNN + per-stage MLPs
Decoder Transposed CNN, Sigmoid output Mirror of encoder

See docs/ARCHITECTURE.md for the full technical deep-dive.

Results

Evaluated on the Kodak dataset (24 images, 768×512).

Current Best (V4 — 15 epochs of 100)

Metric Polar-VQ V4 JPEG Q90 (similar BPP)
BPP 3.35 2.35
PSNR 34.60 dB 38.03 dB
MS-SSIM 0.9857 0.9930

Training Evolution

V1 through V4 evolution on RD curve

Version BPP PSNR (dB) MS-SSIM Training Data Key Change
V1 2.36 26.06 0.9488 DIV2K (800) Baseline implementation
V2 2.56 15.00 0.6965 DIV2K (800) ❌ Aggressive λ warmup → rate collapse
V3 4.44 26.32 0.9609 DIV2K (800) Context pre-training, gradual warmup
V4 3.35 34.60 0.9857 COCO+DIV2K (124K) Bug fixes, larger dataset, AMP

V4 achieved +8.5 dB PSNR over V1 and is still training (epoch 15 of 100). Full results and analysis in docs/RESULTS.md.

Installation

git clone https://github.com/acieslik/Polar-VQ-Codec.git
cd Polar-VQ-Codec
pip install -r requirements.txt

# For .pvq file compression/decompression:
pip install -e ".[full]"

Requires Python 3.10+ and PyTorch 2.0+ with CUDA.

Quick Start

Download Data

# Kodak benchmark only (~15 MB)
python scripts/download_data.py --kodak-only

# Full training data (COCO 2017 Unlabeled + DIV2K, ~22 GB)
python scripts/download_data.py

Train

# Default: 3-stage curriculum, 100 epochs
python scripts/train.py --data-dir data/train --epochs 100 --target-lambda 0.01

# With validation on Kodak every 5 epochs
python scripts/train.py --data-dir data/train --target-lambda 0.01 \
    --val-dir data/kodak --val-interval 5

# Resume from checkpoint
python scripts/train.py --data-dir data/train --resume checkpoints/latest.pth

Training stages:

Stage Epochs Focus λ
A 0–10 Geometric foundation (MSE only) 1e-6
B 10–60 Rate-Distortion optimization 0 → target λ (15-epoch warmup)
C 60–100 Perceptual fine-tuning (MS-SSIM+L1) target λ / 20

Benchmark

python scripts/benchmark.py --dataset data/kodak --weights checkpoints/latest.pth

Generates R-D curves (BPP vs PSNR/MS-SSIM) comparing against JPEG, WebP, and PNG.

Compress / Decompress

python scripts/compress.py photo.png photo.pvq --weights checkpoints/latest.pth
python scripts/decompress.py photo.pvq decoded.png --weights checkpoints/latest.pth

Project Structure

Polar-VQ-Codec/
├── polar_vq/                  # Core library
│   ├── encoder.py             # CNN encoder (8× downsample)
│   ├── decoder.py             # CNN decoder (Sigmoid output)
│   ├── quantizer.py           # PolarVQ: radius + PQ + hybrid RVQ
│   ├── context_model.py       # Checkerboard entropy predictor
│   ├── codec.py               # Full pipeline + .pvq bitstream
│   └── losses.py              # MSE / MS-SSIM+L1 R-D loss
├── scripts/
│   ├── train.py               # Multi-stage training with curriculum
│   ├── benchmark.py           # R-D curves vs JPEG/WebP/PNG
│   ├── compress.py            # Image → .pvq
│   ├── decompress.py          # .pvq → image
│   └── download_data.py       # Dataset downloader
├── docs/
│   ├── ARCHITECTURE.md        # Technical deep-dive
│   ├── RESULTS.md             # Benchmark analysis (V1–V4)
│   └── ROADMAP.md             # Future directions
├── tests/                     # Unit tests (pytest)
├── checkpoints/               # Saved model weights
└── results/                   # Benchmark outputs

Current Limitations

This is an honest assessment of where the project stands:

  1. Not yet competitive with JPEG. At 3.35 BPP, JPEG achieves ~38 dB vs our 34.6 dB. The context model hasn't learned to compress the index entropy well enough — the raw bit budget (5.06 BPP) is too close to the output.

  2. Only one operating point per trained model. JPEG/WebP can sweep quality parameters at encode time. Each Polar-VQ quality level requires a separately trained model (~4 days each on a single GPU).

  3. Training is ongoing. V4 has only completed 15 of 100 epochs. Stage B (rate optimization) has just begun — BPP is actively declining (5.13 → 3.35 over 5 epochs).

  4. Simple CNN architecture. State-of-the-art neural codecs use attention mechanisms, hyperprior networks, and deeper architectures. Our 7-layer CNN is deliberately minimal.

  5. GPU required for decode. No hardware decoder exists — inference requires PyTorch + CUDA.

What We Learned

The iterative development from V1 to V4 produced several insights about training VQ-based neural codecs:

  • Dataset diversity matters more than quantity for VQ. Switching from 800 DIV2K images to 124K COCO+DIV2K images gave the largest single improvement (+8.5 dB). Codebooks need semantic diversity to populate properly.

  • λ warmup is critical. Jumping to full rate penalty caused catastrophic collapse (V2: 15 dB). A 15-epoch linear warmup into Stage B prevented this.

  • Detach BPP gradients during geometric training. Stage A trains encoder/decoder/codebooks without rate pressure, letting the quantizer establish a stable latent topology first.

  • Dead codebook restarts only in Stage A. Restarting entries during rate optimization (Stage B) destabilizes the learned distributions.

  • The PQ scale factor is not optional. Each head of a unit vector has expected norm 1/√num_heads, not 1. Without scaling, RVQ stages waste capacity correcting a scale error.

Roadmap

The Polar-VQ quantization approach may be more impactful outside image compression. See docs/ROADMAP.md for analysis of:

  • Completing the image codec (hyperprior, attention, multi-λ sweep)
  • LLM weight compression — Polar-VQ preserves directional information that scalar quantizers (GPTQ, AWQ) discard
  • Vector database compression — cosine similarity is the standard metric, and Polar-VQ directly optimizes for angular preservation
  • Satellite/medical imaging — high-dimensional multispectral data is a natural fit

Citation

If you use this code in your research, please cite:

@software{polar_vq_codec,
  title={Polar-VQ Codec: Neural Image Compression with Hyperspherical Vector Quantization},
  year={2026},
  url={https://github.com/acieslik/Polar-VQ-Codec}
}

License

AGPL-3.0 — see LICENSE. Free for research and personal use. Commercial use requires opening your source code or obtaining a commercial license.

About

Neural image compression using Hyperspherical (Polar) Vector Quantization AGPL-3.0

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages