turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0)

## Environment
- OS: CachyOS Linux (kernel 6.19.10)
- GPU: NVIDIA GeForce RTX 5070 Laptop GPU
- VRAM: 8GB (7707 MiB)
- CUDA Version: 13.2
- Driver: 595.58.03
- Compute Capability: 12.0 (Blackwell)
- Build: 8665 (5364f8a1d) with GNU 15.2.1

## Model
- qwen2.5-coder:7b-instruct-q6_K (GGUF)

## Command
llama-server 
-m qwen2.5-coder-7b-q6_K.gguf 
-ngl 99 -c 32768 -fa on 
--cache-type-k turbo3 
--cache-type-v turbo3 
--host 0.0.0.0 --port 8080 
## Expected behavior
Coherent text output as reported in the paper on Apple Silicon.

## Actual behavior
Garbled, repetitive output. Examples:
- Prompt: "Write a hello world in Python"
- Response: `"Here is a simple simple simple Python program that world world:\n\nprint(\"Hello world\")"`

turbo3/turbo4 on both K and V produces broken output.
K=turbo3 + V=q8_0 also produces broken output (only 3 tokens generated).
K=q8_0 + V=q8_0 works correctly.

## Notes
This appears to be the first test of TurboQuant CUDA kernels on Blackwell (sm_120).
The CUDA build succeeded without errors (ARCHS=1200).
The issue is likely in the CUDA dequantization kernels not being validated for sm_120.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0) #56

Environment

Model

Command

Expected behavior

Actual behavior

Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0) #56

Description

Environment

Model

Command

Expected behavior

Actual behavior

Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions