Skip to content

turbo3/turbo4 cache produces garbled output on NVIDIA Blackwell GPU (RTX 5070 Laptop, compute capability 12.0) #56

@arifkurnaz

Description

@arifkurnaz

Environment

  • OS: CachyOS Linux (kernel 6.19.10)
  • GPU: NVIDIA GeForce RTX 5070 Laptop GPU
  • VRAM: 8GB (7707 MiB)
  • CUDA Version: 13.2
  • Driver: 595.58.03
  • Compute Capability: 12.0 (Blackwell)
  • Build: 8665 (5364f8a) with GNU 15.2.1

Model

  • qwen2.5-coder:7b-instruct-q6_K (GGUF)

Command

llama-server
-m qwen2.5-coder-7b-q6_K.gguf
-ngl 99 -c 32768 -fa on
--cache-type-k turbo3
--cache-type-v turbo3
--host 0.0.0.0 --port 8080

Expected behavior

Coherent text output as reported in the paper on Apple Silicon.

Actual behavior

Garbled, repetitive output. Examples:

  • Prompt: "Write a hello world in Python"
  • Response: "Here is a simple simple simple Python program that world world:\n\nprint(\"Hello world\")"

turbo3/turbo4 on both K and V produces broken output.
K=turbo3 + V=q8_0 also produces broken output (only 3 tokens generated).
K=q8_0 + V=q8_0 works correctly.

Notes

This appears to be the first test of TurboQuant CUDA kernels on Blackwell (sm_120).
The CUDA build succeeded without errors (ARCHS=1200).
The issue is likely in the CUDA dequantization kernels not being validated for sm_120.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions