Question on Performance Comparison using Different Cache Bit Precision

Testing the impact of KV cache quantization on the performance of llama2 model demonstrates decrease in tokens/sec as the cache bits is reduced. However, the reduction in cache memory is observed.

Command:
`python generate.py --cache_strategy full --prompt "What is a cold compress?" --checkpoint_path ./checkpoints/meta-llama/Llama-2-7b-chat-hf/model.pth --device cuda:0 --cache_bits 4/8/16`

**Bits: 4**

- Decode tokens per sec: 13.57
- Cache Memory used: 0.07 GB

**Bits: 8**

- Decode tokens per sec: 17.56
- Memory used: 0.13 GB

**Bits: 16**

- Decode tokens per sec: 26.09
- Memory used: 0.26 GB

Is this reduction in performance expected? Is this caused due to extra quantize-dequantize operations?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on Performance Comparison using Different Cache Bit Precision #46

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question on Performance Comparison using Different Cache Bit Precision #46

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions