LightMem

LightMem is a high-performance KV cache management library designed for large language model (LLM) inference systems. It provides efficient disk-based caching solutions for key-value pairs, enabling memory-efficient long-context processing with minimal performance overhead.

Project Overview

LightMem serves as a storage optimization layer for LLM inference frameworks, offering:

Disk-Based KV Cache: Persistent storage of key-value cache with efficient read/write operations
Asynchronous I/O: Non-blocking cache operations using multi-threaded task queues
Memory Efficiency: Reduced GPU/CPU memory footprint by offloading KV cache to disk
Scalability: Support for large-scale inference workloads with configurable storage sharding

Key Features

Core Modules

Module	Description
Storage	Pluggable storage engine interface with local file system implementation
Service	Cache service layer managing read/write operations with task scheduling
Task Queue	Asynchronous task processing system with configurable worker threads
Core	Cache block management and task state tracking for reliable operations

Architecture Highlights

Block-Level Management: KV cache divided into fixed-size blocks for efficient I/O
Hash-Based Indexing: Fast cache lookup using content-based hashing
Zero-Copy Design: Direct memory mapping between PyTorch tensors and storage
Thread-Safe Operations: Concurrent read/write support with fine-grained locking

Installation

System Requirements

Python 3.10 or higher
CMake 3.25 or higher
C++17 compatible compiler
PyTorch (with CPU support)
Boost C++ Libraries
pybind11 (automatically installed via pip dependencies)

Platform Notes:

Linux: Full support with optimized page cache management via posix_fadvise
macOS: Supported, but without posix_fadvise optimization (not available on macOS)

Installation Methods

Install system dependencies

On Ubuntu/Debian:

sudo apt-get update
sudo apt-get install cmake build-essential libboost-all-dev

On macOS:

brew install cmake boost

Using Conda (Cross-platform):

conda install -c conda-forge cmake cxx-compiler boost libboost-devel

Install PyTorch:

pip install torch

Using pip (Recommended)

pip install -v .

Build and install from source

# Build wheel package
python -m build --wheel

# Install the built wheel
pip install dist/*.whl

Environment Variables

`LIGHTMEM_MAX_BLOCK_SIZE_MB`

Controls the maximum size of each cache block in megabytes (MB).

Default: 64 (64MB)
Purpose: Determines the granularity of cache I/O operations. Each cache block is read from or written to disk as a single unit.

Usage:

export LIGHTMEM_MAX_BLOCK_SIZE_MB=32  # Set to 32MB

Considerations:
- Larger blocks (e.g., 128): Reduce overhead, better for sequential access, but may increase latency for small operations
- Smaller blocks (e.g., 16): More fine-grained control, better for random access, but higher overhead per operation
- Must be set before starting the cache service

Quick Start

Key Concepts

Data, Hashes, Pages, and Blocks:

Each data element is hashed using 128-bit cumulative hashing (xxHash3)
Each cumulative hash corresponds to one page in the KV cache (one-to-one mapping via kv_page_indexer)
Cumulative hash: Each position contains the hash of all data from the start up to that position
Hashes are automatically grouped into blocks for I/O operations
Block size = LIGHTMEM_MAX_BLOCK_SIZE_MB (default 64MB)
Pages per block = block_size / page_size
Example: With 64MB blocks and 16KB pages, each block contains ~4096 pages
For block operations: The last cumulative hash of each block represents the entire block

Basic Usage

import torch
from light_mem import PyLocalCacheService

# Create a CPU-based KV cache tensor
# Shape: [num_pages, page_size] - must be 2D uint8 tensor
kv_cache = torch.zeros((1000, 40 * 8192), dtype=torch.float16).view(dtype=torch.uint8)

# Initialize cache service
cache_service = PyLocalCacheService(
    kvcache_tensor=kv_cache,     # KV cache tensor (2D uint8)
    file="./cache_storage",      # Storage directory
    storage_size=10 * 1024**3,   # 10GB storage limit
    num_shard=4,                 # Number of storage shards
    num_worker=8                 # Number of worker threads
)

# Service starts automatically after initialization

# Compute cumulative hashes from data
hash_128s = [hash_1, hash_2, hash_3, hash_4]  # list of 128-bit integers (cumulative hashes)

# Query if caches exist (returns list of booleans)
exists_list = cache_service.query(hash_128s)

# Create write/read tasks
# Note: hash_128s and kv_page_indexer must have the same length (one-to-one mapping)
task = cache_service.create(
    hash_128s=hash_128s,          # List of 128-bit cumulative hash integers
    kv_page_indexer=torch.tensor([0, 1, 2, 3], dtype=torch.int32),  # Page indices (same length as hash_128s)
    mode="w"                      # "w" for write, "r" for read
)

# Check task status
if task.ready():
    print("Task completed!")

Task Management

# Check task state
states = task.state()  # Returns PyState enum list for each block

# Abort a running task
cache_service.abort(task)

# Get pages already cached on disk (write mode only)
cached_pages = task.page_already_list  # Property, not method

# Check if data is safe to modify (for write tasks)
if task.data_safe():
    # Safe to modify source tensor
    pass

Architecture

LightMem has a layered architecture with C++ core and Python bindings:

Python Layer

PyLocalCacheService: Main Python interface for cache operations
PyTask: Python wrapper for task management
PyState: Enum for task state tracking (Initial, Working, Finished, Aborted)

C++ Core (Internal)

Storage Layer

StorageEngine: Abstract interface for pluggable storage backends
LocalStorageEngine: File-based storage implementation with sharding support

Service Layer

CacheService: Base class defining cache service interface
LocalCacheService: Concrete implementation managing local disk cache

Task Processing

CacheTask: Represents a complete read/write operation
CacheBlock: Individual block within a task, processed independently
TaskQueue: Thread pool managing asynchronous task execution

Configuration

Block Size Configuration

The block size determines the granularity of cache operations:

# Default: 64MB per block
# Override via environment variable (value in MB)
export LIGHTMEM_MAX_BLOCK_SIZE_MB=128  # Set to 128MB

Storage Sharding

Distribute cache files across multiple shards for better I/O parallelism:

num_shard=8  # Creates 8 separate storage files

Performance Considerations

Worker Threads: More workers improve I/O parallelism but increase CPU overhead
Block Size: Larger blocks reduce overhead but may increase latency for small operations
Storage Sharding: More shards improve concurrent access but increase file descriptor usage
Memory Alignment: KV cache tensors must be contiguous for optimal performance

API Reference

PyLocalCacheService

Constructor

PyLocalCacheService(
    kvcache_tensor: torch.Tensor,
    file: str,
    storage_size: int = 32 * 1024**3,  # Default: 32GB
    num_shard: int = 32,
    num_worker: int = 16
)

Parameters:

kvcache_tensor: 2D uint8 tensor with shape [num_pages, page_size], must be CPU and contiguous
file: Path to storage directory/file
storage_size: Total storage size in bytes (distributed across shards)
num_shard: Number of storage file shards
num_worker: Number of worker threads

Methods

query(hash_128s: List[int]) -> List[bool]: Check if caches exist for given cumulative hashes
- Input: List of 128-bit cumulative hash integers
- Returns one boolean per block (not per hash)
- Data is grouped into blocks internally based on block_size / page_size
create(hash_128s: List[int], kv_page_indexer: torch.Tensor, mode: str, start_pos: int = 0) -> PyTask: Create cache task
- hash_128s: List of 128-bit cumulative hash integers
- kv_page_indexer: Int32 tensor containing page indices, must have the same length as hash_128s (one-to-one mapping)
- mode: "w" for write, "r" for read
- start_pos: Optional starting position in token list (default: 0)
abort(task: PyTask): Cancel a running task
active_threads(mode: str) -> int: Get count of active read/write tasks ("w" or "r")

PyTask

Methods

ready() -> bool: Check if all blocks are finished
data_safe() -> bool: Check if source data can be safely modified (write: data copied; read: equivalent to ready())
state() -> List[PyState]: Get PyState enum for each block
- PyState.Initial (0): Task just created
- PyState.Working (1): Task in progress
- PyState.Finished (2): Task completed successfully
- PyState.Aborted (3): Task aborted (possibly due to error)

Properties

page_already_list -> List[int]: Get list of page indices already on disk (write mode: pages found in cache via hash query)

Contributing

Contributions are welcome! Please ensure:

Code follows C++17 and Python 3.10+ standards
All tests pass before submitting PRs
Documentation is updated for new features

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

LightMem is developed as part of the ModelTC ecosystem for efficient LLM inference.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
python/light_mem		python/light_mem
src		src
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.clangd		.clangd
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

License

ModelTC/LightMem

Folders and files

Latest commit

History

Repository files navigation

LightMem

Project Overview

Key Features

Core Modules

Architecture Highlights

Installation

System Requirements

Installation Methods

Install system dependencies

Using pip (Recommended)

Build and install from source

Environment Variables

LIGHTMEM_MAX_BLOCK_SIZE_MB

Quick Start

Key Concepts

Basic Usage

Task Management

Architecture

Python Layer

C++ Core (Internal)

Storage Layer

Service Layer

Task Processing

Configuration

Block Size Configuration

Storage Sharding

Performance Considerations

API Reference

PyLocalCacheService

Constructor

Methods

PyTask

Methods

Properties

Contributing

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

`LIGHTMEM_MAX_BLOCK_SIZE_MB`

Packages