Norm correction validated on vLLM + Nemotron hybrid

Validated your norm correction finding on vLLM (PR #38479) with Nemotron-Cascade-2-30B-A3B on 8x RTX A4000. The correction lowered reconstruction error as expected but did not change pass/fail on our 14-check reasoning benchmark — the dominant quality factor turned out to be value precision (2-bit values fail at 71.4%, 4-bit at 85.7%, FP8 values pass at 100%). Keys at 3-bit with norm correction are fine.

The key finding for you: **value quantization precision is the bottleneck**, not key reconstruction error. FP8 values + 3-bit keys = lossless quality at 2x KV compression on a hybrid Mamba+MoE+Attention model.

Thought you would want the cross-implementation signal. Thanks for the norm correction work — it pointed us in the right direction even though the benchmark impact came from a different axis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Norm correction validated on vLLM + Nemotron hybrid #55

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Norm correction validated on vLLM + Nemotron hybrid #55

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions