Skip to content

Question on Performance Comparison using Different Cache Bit Precision #46

@soumendukrg

Description

@soumendukrg

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command:
python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16

Bits: 4

  • Decode tokens per sec: 13.57
  • Cache Memory used: 0.07 GB

Bits: 8

  • Decode tokens per sec: 17.56
  • Memory used: 0.13 GB

Bits: 16

  • Decode tokens per sec: 26.09
  • Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions