LightMem is a high-performance KV cache management library designed for large language model (LLM) inference systems. It provides efficient disk-based caching solutions for key-value pairs, enabling memory-efficient long-context processing with minimal performance overhead.
LightMem serves as a storage optimization layer for LLM inference frameworks, offering:
- Disk-Based KV Cache: Persistent storage of key-value cache with efficient read/write operations
- Asynchronous I/O: Non-blocking cache operations using multi-threaded task queues
- Memory Efficiency: Reduced GPU/CPU memory footprint by offloading KV cache to disk
- Scalability: Support for large-scale inference workloads with configurable storage sharding
| Module | Description |
|---|---|
| Storage | Pluggable storage engine interface with local file system implementation |
| Service | Cache service layer managing read/write operations with task scheduling |
| Task Queue | Asynchronous task processing system with configurable worker threads |
| Core | Cache block management and task state tracking for reliable operations |
- Block-Level Management: KV cache divided into fixed-size blocks for efficient I/O
- Hash-Based Indexing: Fast cache lookup using content-based hashing
- Zero-Copy Design: Direct memory mapping between PyTorch tensors and storage
- Thread-Safe Operations: Concurrent read/write support with fine-grained locking
- Python 3.10 or higher
- CMake 3.25 or higher
- C++17 compatible compiler
- PyTorch (with CPU support)
- Boost C++ Libraries
- pybind11 (automatically installed via pip dependencies)
Platform Notes:
- Linux: Full support with optimized page cache management via
posix_fadvise - macOS: Supported, but without
posix_fadviseoptimization (not available on macOS)
On Ubuntu/Debian:
sudo apt-get update
sudo apt-get install cmake build-essential libboost-all-devOn macOS:
brew install cmake boostUsing Conda (Cross-platform):
conda install -c conda-forge cmake cxx-compiler boost libboost-develInstall PyTorch:
pip install torchpip install -v .# Build wheel package
python -m build --wheel
# Install the built wheel
pip install dist/*.whlControls the maximum size of each cache block in megabytes (MB).
- Default:
64(64MB) - Purpose: Determines the granularity of cache I/O operations. Each cache block is read from or written to disk as a single unit.
- Usage:
export LIGHTMEM_MAX_BLOCK_SIZE_MB=32 # Set to 32MB
- Considerations:
- Larger blocks (e.g., 128): Reduce overhead, better for sequential access, but may increase latency for small operations
- Smaller blocks (e.g., 16): More fine-grained control, better for random access, but higher overhead per operation
- Must be set before starting the cache service
Data, Hashes, Pages, and Blocks:
- Each data element is hashed using 128-bit cumulative hashing (xxHash3)
- Each cumulative hash corresponds to one page in the KV cache (one-to-one mapping via
kv_page_indexer) - Cumulative hash: Each position contains the hash of all data from the start up to that position
- Hashes are automatically grouped into blocks for I/O operations
- Block size =
LIGHTMEM_MAX_BLOCK_SIZE_MB(default 64MB) - Pages per block = block_size / page_size
- Example: With 64MB blocks and 16KB pages, each block contains ~4096 pages
- For block operations: The last cumulative hash of each block represents the entire block
import torch
from light_mem import PyLocalCacheService
# Create a CPU-based KV cache tensor
# Shape: [num_pages, page_size] - must be 2D uint8 tensor
kv_cache = torch.zeros((1000, 40 * 8192), dtype=torch.float16).view(dtype=torch.uint8)
# Initialize cache service
cache_service = PyLocalCacheService(
kvcache_tensor=kv_cache, # KV cache tensor (2D uint8)
file="./cache_storage", # Storage directory
storage_size=10 * 1024**3, # 10GB storage limit
num_shard=4, # Number of storage shards
num_worker=8 # Number of worker threads
)
# Service starts automatically after initialization
# Compute cumulative hashes from data
hash_128s = [hash_1, hash_2, hash_3, hash_4] # list of 128-bit integers (cumulative hashes)
# Query if caches exist (returns list of booleans)
exists_list = cache_service.query(hash_128s)
# Create write/read tasks
# Note: hash_128s and kv_page_indexer must have the same length (one-to-one mapping)
task = cache_service.create(
hash_128s=hash_128s, # List of 128-bit cumulative hash integers
kv_page_indexer=torch.tensor([0, 1, 2, 3], dtype=torch.int32), # Page indices (same length as hash_128s)
mode="w" # "w" for write, "r" for read
)
# Check task status
if task.ready():
print("Task completed!")# Check task state
states = task.state() # Returns PyState enum list for each block
# Abort a running task
cache_service.abort(task)
# Get pages already cached on disk (write mode only)
cached_pages = task.page_already_list # Property, not method
# Check if data is safe to modify (for write tasks)
if task.data_safe():
# Safe to modify source tensor
passLightMem has a layered architecture with C++ core and Python bindings:
- PyLocalCacheService: Main Python interface for cache operations
- PyTask: Python wrapper for task management
- PyState: Enum for task state tracking (Initial, Working, Finished, Aborted)
- StorageEngine: Abstract interface for pluggable storage backends
- LocalStorageEngine: File-based storage implementation with sharding support
- CacheService: Base class defining cache service interface
- LocalCacheService: Concrete implementation managing local disk cache
- CacheTask: Represents a complete read/write operation
- CacheBlock: Individual block within a task, processed independently
- TaskQueue: Thread pool managing asynchronous task execution
The block size determines the granularity of cache operations:
# Default: 64MB per block
# Override via environment variable (value in MB)
export LIGHTMEM_MAX_BLOCK_SIZE_MB=128 # Set to 128MBDistribute cache files across multiple shards for better I/O parallelism:
num_shard=8 # Creates 8 separate storage files- Worker Threads: More workers improve I/O parallelism but increase CPU overhead
- Block Size: Larger blocks reduce overhead but may increase latency for small operations
- Storage Sharding: More shards improve concurrent access but increase file descriptor usage
- Memory Alignment: KV cache tensors must be contiguous for optimal performance
PyLocalCacheService(
kvcache_tensor: torch.Tensor,
file: str,
storage_size: int = 32 * 1024**3, # Default: 32GB
num_shard: int = 32,
num_worker: int = 16
)Parameters:
kvcache_tensor: 2D uint8 tensor with shape[num_pages, page_size], must be CPU and contiguousfile: Path to storage directory/filestorage_size: Total storage size in bytes (distributed across shards)num_shard: Number of storage file shardsnum_worker: Number of worker threads
query(hash_128s: List[int]) -> List[bool]: Check if caches exist for given cumulative hashes- Input: List of 128-bit cumulative hash integers
- Returns one boolean per block (not per hash)
- Data is grouped into blocks internally based on block_size / page_size
create(hash_128s: List[int], kv_page_indexer: torch.Tensor, mode: str, start_pos: int = 0) -> PyTask: Create cache taskhash_128s: List of 128-bit cumulative hash integerskv_page_indexer: Int32 tensor containing page indices, must have the same length as hash_128s (one-to-one mapping)mode:"w"for write,"r"for readstart_pos: Optional starting position in token list (default: 0)
abort(task: PyTask): Cancel a running taskactive_threads(mode: str) -> int: Get count of active read/write tasks ("w"or"r")
ready() -> bool: Check if all blocks are finisheddata_safe() -> bool: Check if source data can be safely modified (write: data copied; read: equivalent to ready())state() -> List[PyState]: Get PyState enum for each blockPyState.Initial(0): Task just createdPyState.Working(1): Task in progressPyState.Finished(2): Task completed successfullyPyState.Aborted(3): Task aborted (possibly due to error)
page_already_list -> List[int]: Get list of page indices already on disk (write mode: pages found in cache via hash query)
Contributions are welcome! Please ensure:
- Code follows C++17 and Python 3.10+ standards
- All tests pass before submitting PRs
- Documentation is updated for new features
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
LightMem is developed as part of the ModelTC ecosystem for efficient LLM inference.