Skip to content

Norm correction validated on vLLM + Nemotron hybrid #55

@MidasMining

Description

@MidasMining

Validated your norm correction finding on vLLM (PR #38479) with Nemotron-Cascade-2-30B-A3B on 8x RTX A4000. The correction lowered reconstruction error as expected but did not change pass/fail on our 14-check reasoning benchmark — the dominant quality factor turned out to be value precision (2-bit values fail at 71.4%, 4-bit at 85.7%, FP8 values pass at 100%). Keys at 3-bit with norm correction are fine.

The key finding for you: value quantization precision is the bottleneck, not key reconstruction error. FP8 values + 3-bit keys = lossless quality at 2x KV compression on a hybrid Mamba+MoE+Attention model.

Thought you would want the cross-implementation signal. Thanks for the norm correction work — it pointed us in the right direction even though the benchmark impact came from a different axis.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions