rANS Entropy Coding on Top of 4-bit Quantization #3040

drxddy · 2026-01-22T10:00:35Z

drxddy
Jan 22, 2026

I've been experimenting with rANS entropy coding applied to MLX's 4-bit quantized weights.

Key Finding

4-bit quantized LLM weights have Shannon entropy of only ~1.5 bits (not 4 bits) due to their Gaussian distribution. This means we can losslessly compress them further.

Results (M2 Pro)

Additional compression: 1.84x over 4-bit
Measured speedup: 2.32x (bandwidth-bound inference)
Decode overhead: 29% of memory bandwidth

Prototype

I have a working Metal kernel with fused decode+dequantize+GEMV:

Physical interleaving for coalesced GPU memory access
Register-cached frequency tables
Tested on Qwen2.5-0.5B

Questions

Is there interest in this direction?
Would this complement or conflict with the ASTC approach (Feature request: Add ASTC weight compression + hardware decoding support #2418)?
Happy to contribute a PR if there's appetite

Here's my prototype and experiments i did https://github.com/drxddy/ecq

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rANS Entropy Coding on Top of 4-bit Quantization #3040

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

rANS Entropy Coding on Top of 4-bit Quantization #3040

Uh oh!

Uh oh!

drxddy Jan 22, 2026

Key Finding

Results (M2 Pro)

Prototype

Questions

Replies: 0 comments

drxddy
Jan 22, 2026