Skip to content

BoJiang03/PackKV

Repository files navigation

PackKV

This is the official repository for the paper "PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression" (IPDPS 2026). [Paper]

PackKV is a high-performance framework designed to reduce the memory footprint of KV cache for Large Language Models Inference with Lossy Compression. By utilizing custom CUDA kernels and sophisticated lossy compression techniques, PackKV aims to reduce memory usage and improve inference throughput.

Installation

Prerequisites

  • Linux
  • NVIDIA GPU with CUDA(13.0 for RTX Pro 6000 Blackwell Workstation Edition, 12.8 for 4XA100) support
  • Anaconda or Miniconda or Miniforge

Step 1: Set up the Environment

Create a new Conda environment using the provided configuration file:

For RTX Pro 6000 machine:

conda env create -f environment.yml
conda activate packkv_pub
pip install torch==2.10.0 --index-url https://download.pytorch.org/whl/cu130 # to support RTX Pro 6000
pip install -r requirements.txt
pip install flash-attn==2.8.1 --no-build-isolation # this may take quit a while to compile flash-attn

For 4XA100 machine:

conda env create -f environment.yml
conda activate packkv_pub
pip install torch==2.9.0 --index-url https://download.pytorch.org/whl/cu128
pip install -r requirements.txt
pip install "https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.1/flash_attn-2.8.1+cu12torch2.9cxx11abiTRUE-cp312-cp312-linux_x86_64.whl"

Step 2: Install CUDA Extensions

Compile and install the custom CUDA kernels required for PackKV:

cd packkv_cuda_ext
pip install -e . --no-build-isolation
cd ..

if you failed to compile this extension, you can try to modify the setup.py file:

'nvcc': [
    '-O3',
    # '-gencode=arch=compute_70,code=compute_70',
    # '-gencode=arch=compute_80,code=sm_80', # this works for A100
    # '-gencode=arch=compute_89,code=sm_89',
    '-gencode=arch=compute_120,code=sm_120', # this works for RTX Pro 6000
]

Step 3: Test Run

cd scripts
CUDA_VISIBLE_DEVICES=0 PYTHONPATH=.. python ./rebuttal_throughout.py
cd ..

Project Structure

  • packkv_cuda_ext/: C++ and CUDA source code for the packkv custom extension.
  • models/: implementations of LLMs used in PackKV Experiments.
  • scripts/: Scripts for automation, benchmarking, and generating experimental results.
  • evaluation/: Code related to model evaluation.
  • utils/: Helper functions and utilities.

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Official repository for the paper "PackKV: Reducing KV Cache Memory Footprint through LLM-Aware Lossy Compression" (IPDPS 2026).

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors