Norm-Separated Quantization

A Training-Free Fix for KV Cache INT4 Failures

TL;DR

Naive INT4 quantization of KV caches fails catastrophically on some models (ΔPPL = +8293 on Qwen2-7B at 4096 tokens). We fix this by separating the L2 norm before quantizing:

# Before: naive INT4 (can fail catastrophically)
scale = x.abs().amax(dim=-1, keepdim=True) / 7
x_q = (x / scale).round().clamp(-7, 7) * scale

# After: norm-separated per-channel INT4 (always safe)
norm = x.norm(dim=-1, keepdim=True)
direction = x / norm
scale = direction.abs().amax(dim=0, keepdim=True) / 7  # per-channel
direction_q = (direction / scale).round().clamp(-7, 7) * scale
direction_q = direction_q / direction_q.norm(dim=-1, keepdim=True)
x_q = norm * direction_q

Result: ΔPPL +8293 → +0.19 at 4096 tokens (44,000x improvement). Never hurts models where naive INT4 already works.

Key Results

WikiText-2 (8 models, 124M to 14B)

Model	Params	naive INT4	nsep+pchan	Improvement
GPT-2	124M	+1.60	+1.13	1.4x
Pythia-410M	410M	+171.13	+23.46	7.3x
Pythia-2.8B	2.8B	+14.74	+1.99	7.4x
Pythia-6.9B	6.9B	+22.56	+1.21	18.7x
Mistral-7B	7B	+0.10	+0.04	2.6x
Qwen2-7B	7B	+811.61	+0.43	1885x
Pythia-12B	12B	+34.22	+4.01	8.5x
Qwen2.5-14B	14B	+0.55	+0.45	1.2x

Long Context (4096 tokens)

Model	naive INT4	nsep+pchan	Improvement
Qwen2-7B	+8293	+0.19	44,000x
Mistral-7B	+0.11	+0.12	1x (no harm)

Needle-in-Haystack: Factual Retrieval

Does KV cache quantization make models forget facts buried in context?

Single-needle (secret code hidden in 185-1413 token haystack):

Model	Outlier ratio	naive INT4	nsep+pchan
Qwen2-7B	8.6x	0/15	15/15
Pythia-6.9B	4.6x	15/15	15/15
Mistral-7B	3.1x	15/15	15/15

Multi-needle (3-5 secrets across 280-2459 tokens):

Model	Outlier ratio	baseline	naive INT4	nsep+pchan
Qwen2-7B	8.6x	26/26	0/26	26/26
Qwen2.5-14B	3.5x	26/26	26/26	26/26

On Qwen2-7B, naive INT4 causes complete loss of all embedded facts (0/26). nsep+pchan4 fully recovers every needle (26/26). Models with low outlier ratios (Qwen2.5-14B: 3.5x) retain all needles even with naive INT4.

12-Model Sweep (124M to 40B)

Validated on GPT-2, OPT (125M, 1.3B, 13B), Pythia (410M, 2.8B, 6.9B, 12B), Qwen2 (0.5B, 7B), Qwen2.5-14B, Mistral-7B, and Falcon-40B. See paper Table 2.

Why It Works

Two independent problems cause naive INT4 to fail:

Token-wise norm variation: KV vector norms vary 2-5x across tokens, making per-row quantization scales inconsistent
Activation outlier channels: Specific dimensions have values 10-100x larger than average, corrupting the quantization scale

Norm separation fixes (1) by decoupling magnitude from direction. Per-channel quantization fixes (2) by giving each dimension its own scale. Neither alone is sufficient — on Qwen2-7B, nsep alone gives 4.1x improvement, perchan alone gives 2.4x, but the combination gives 744x.

Simulated vs Real Quantization

The main experiment code uses simulated (fake) quantization: values are quantized to INT4 and immediately dequantized back to floating point. This is standard practice in quantization research (KIVI, SmoothQuant, GPTQ use the same approach) and accurately measures the quality impact (ΔPPL) of quantization.

Real INT4 Packing PoC

We also provide a real INT4 packing implementation (experiments/poc_real_int4.py, poc_real_int4_7b.py) that stores quantized values in packed uint8 tensors (2 values per byte), achieving actual memory reduction:

Model	FP16 KV	Real INT4	Compression	naive4 ΔPPL	nsep+pchan (real)
GPT-2	1.4 MB	0.4 MB	3.43x	—	-3.86
Qwen2-7B	3.6 MB	1.0 MB	3.65x	+401.3	+0.29

Fake vs real ΔPPL difference: < 0.005 — packing is lossless. The quality results from simulated quantization are fully reproduced with actual memory reduction.

A production deployment would additionally require a fused CUDA kernel for quantize-on-write and dequantize-on-read to eliminate the Python overhead.

Paper

📄 Norm-Separated Quantization: A Training-Free Fix for KV Cache INT4 Failures (18 pages, 5 figures, 7 appendices)

Repository Structure

norm-separated-quantization/
├── paper/                      # LaTeX source, PDF, figures
├── experiments/
│   ├── phase0_*.py            # Structure verification (local, M1)
│   ├── phase1_*.py            # Hidden-state compression (local)
│   ├── phase4*_*.py           # KV cache compression (local)
│   ├── phase5*_*.py           # 7B+ scaling (Colab GPU)
│   ├── phase6*_*.py           # WikiText-2 benchmarks
│   ├── phase7*_*.py           # Appendix experiments (long ctx, KIVI, memory)
│   ├── phase8_*.py            # Post-LN control experiment
│   ├── poc_real_int4*.py      # Real INT4 packing PoC
│   ├── poc_needle*.py         # Needle-in-Haystack experiments
│   ├── poc_multi_needle*.py   # Multi-needle retrieval experiments
│   ├── generate_figures.py    # Reproduce all paper figures
│   ├── compressors.py         # Compression primitives
│   └── requirements.txt
├── results/                   # All experiment results (JSON)
├── docs/                      # Experiment report, plan
├── LICENSE                    # Apache 2.0
└── README.md

Quick Start

Reproduce paper figures (no GPU needed)

cd experiments
pip install -r requirements.txt
python generate_figures.py

Run local experiments (M1 Mac, 16GB)

# Phase 0: Verify arc structure
python phase0_structure_verification.py

# Phase 4b: KV cache quantization (GPT-2)
python phase4b_asymmetric_quantization.py

# WikiText-2 benchmark (GPT-2)
python phase6_figure1_wikitext.py

# Real INT4 packing PoC
python poc_real_int4.py

Run GPU experiments (Colab)

Copy-paste scripts from experiments/phase5*.py, phase7*.py, or poc_*.py into Google Colab cells. Split at the # === CELL 1 === / # === CELL 2 === markers.

Models Tested

Model	Params	KV Heads	head_dim	Arch	Source
GPT-2	124M	12	64	MHA	`gpt2`
OPT-125m	125M	12	64	MHA	`facebook/opt-125m`
Pythia-410M	410M	16	64	MHA	`EleutherAI/pythia-410m`
Qwen2-0.5B	0.5B	2	64	GQA	`Qwen/Qwen2-0.5B`
OPT-1.3B	1.3B	32	64	MHA	`facebook/opt-1.3b`
Pythia-2.8B	2.8B	32	80	MHA	`EleutherAI/pythia-2.8b`
Pythia-6.9B	6.9B	32	128	MHA	`EleutherAI/pythia-6.9b`
Mistral-7B	7B	8	128	GQA	`mistralai/Mistral-7B-v0.1`
Qwen2-7B	7B	4	128	GQA	`Qwen/Qwen2-7B`
Pythia-12B	12B	40	128	MHA	`EleutherAI/pythia-12b`
OPT-13B	13B	40	128	MHA	`facebook/opt-13b`
Qwen2.5-14B	14B	8	128	GQA	`Qwen/Qwen2.5-14B`
Falcon-40B	40B	128	64	MHA	`tiiuae/falcon-40b`

Citation

@article{sato2026normsep,
  title={Norm-Separated Quantization: A Training-Free Fix for KV Cache INT4 Failures},
  author={Sato, Kentaro},
  year={2026},
  doi={10.5281/zenodo.19602981},
  url={https://doi.org/10.5281/zenodo.19602981}
}

Related Work

This work builds on the geometric observations from The Arc and Its Thickness (Sato, 2026), which established that Pre-LN Transformer hidden states concentrate on a norm-dominant subspace.

License

Apache License 2.0. See LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Norm-Separated Quantization

TL;DR

Key Results

WikiText-2 (8 models, 124M to 14B)

Long Context (4096 tokens)

Needle-in-Haystack: Factual Retrieval

12-Model Sweep (124M to 40B)

Why It Works

Simulated vs Real Quantization

Real INT4 Packing PoC

Paper

Repository Structure

Quick Start

Reproduce paper figures (no GPU needed)

Run local experiments (M1 Mac, 16GB)

Run GPU experiments (Colab)

Models Tested

Citation

Related Work

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
docs		docs
experiments		experiments
paper		paper
results		results
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Norm-Separated Quantization

TL;DR

Key Results

WikiText-2 (8 models, 124M to 14B)

Long Context (4096 tokens)

Needle-in-Haystack: Factual Retrieval

12-Model Sweep (124M to 40B)

Why It Works

Simulated vs Real Quantization

Real INT4 Packing PoC

Paper

Repository Structure

Quick Start

Reproduce paper figures (no GPU needed)

Run local experiments (M1 Mac, 16GB)

Run GPU experiments (Colab)

Models Tested

Citation

Related Work

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages