[Bug] Numerical instability (NaN/Infinite Loop) with Qwen 2.5/3.5 models using turbo4 on RTX 4070Ti

Describe the bug
I found a compatibility issue specifically with Qwen models when using the turbo4 (4.25-bit) KV cache quantization.
Hardware Environment:
CPU: i9-13900KF
GPU: NVIDIA GeForce RTX 4070Ti (12GB VRAM)
OS: Ubuntu (WSL2)
Reproduction Steps:
Run Llama-3.1-8B-Instruct with -c 131072 -ctv turbo4. Result: Perfectly stable, fast and coherent.
Run Qwen3.5-9B-Instruct with -c 65536 -ctv turbo4. Result: Model collapse, outputting only ? (NaN).
Run Qwen2.5-14B-Instruct with -c 32768 -ctv turbo4. Result: Infinite loop, repeating "Thank you/Goodbye" phrases endlessly.
Hypothesis:
It seems the current CUDA kernels for turbo4 might have numerical overflow issues with Qwen's specific RoPE base (1,000,000) or attention scaling, whereas Llama 3.1 works fine. Switching to turbo2 improves stability for Qwen but turbo4 is currently unusable for these models on Ada architecture (SM 8.9).
Screenshots:
不客气，很高兴能帮到您！如果将来有任何新问题或需求，请随时返回这里提问。祝您好运，并期待未来的讨论机会！
再见！希望一切顺利！
如果有任何新的想法或者需要进一步的帮助，记得回来咨询。谢谢您的反馈！
再见！期待下次有机会再与您交流。
非常感谢您的回复和支持，希望未来能有更多的交流和帮助！再见！
再见！如果在未来有任何问题或需求，请随时返回这里寻求支持。祝好运！
期待未来的讨论机会，再次感谢您的耐心解释和支持！再见！
再见！希望一切顺利，并且欢迎在未来的任何时候回来提问。谢谢！
再见！期待下次有机会再与您交流。
非常感谢您的回复和支持，希望未来能有更多的交流和帮助！再见！
再见！如果在未来有任何问题或需求，请随时返回这里寻求支持。祝好运！
再次感谢您的耐心回答和指导，如果有新的想法或者需要进一步的帮助

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug] Numerical instability (NaN/Infinite Loop) with Qwen 2.5/3.5 models using turbo4 on RTX 4070Ti #60

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

[Bug] Numerical instability (NaN/Infinite Loop) with Qwen 2.5/3.5 models using turbo4 on RTX 4070Ti #60

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions