-
-
Notifications
You must be signed in to change notification settings - Fork 791
Norm correction validated on vLLM + Nemotron hybrid #55
Description
Validated your norm correction finding on vLLM (PR #38479) with Nemotron-Cascade-2-30B-A3B on 8x RTX A4000. The correction lowered reconstruction error as expected but did not change pass/fail on our 14-check reasoning benchmark — the dominant quality factor turned out to be value precision (2-bit values fail at 71.4%, 4-bit at 85.7%, FP8 values pass at 100%). Keys at 3-bit with norm correction are fine.
The key finding for you: value quantization precision is the bottleneck, not key reconstruction error. FP8 values + 3-bit keys = lossless quality at 2x KV compression on a hybrid Mamba+MoE+Attention model.
Thought you would want the cross-implementation signal. Thanks for the norm correction work — it pointed us in the right direction even though the benchmark impact came from a different axis.