Skip to content

ModelTC/LightMem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LightMem

License Python

LightMem is a high-performance KV cache management library designed for large language model (LLM) inference systems. It provides efficient disk-based caching solutions for key-value pairs, enabling memory-efficient long-context processing with minimal performance overhead.

Project Overview

LightMem serves as a storage optimization layer for LLM inference frameworks, offering:

  • Disk-Based KV Cache: Persistent storage of key-value cache with efficient read/write operations
  • Asynchronous I/O: Non-blocking cache operations using multi-threaded task queues
  • Memory Efficiency: Reduced GPU/CPU memory footprint by offloading KV cache to disk
  • Scalability: Support for large-scale inference workloads with configurable storage sharding

Key Features

Core Modules

Module Description
Storage Pluggable storage engine interface with local file system implementation
Service Cache service layer managing read/write operations with task scheduling
Task Queue Asynchronous task processing system with configurable worker threads
Core Cache block management and task state tracking for reliable operations

Architecture Highlights

  • Block-Level Management: KV cache divided into fixed-size blocks for efficient I/O
  • Hash-Based Indexing: Fast cache lookup using content-based hashing
  • Zero-Copy Design: Direct memory mapping between PyTorch tensors and storage
  • Thread-Safe Operations: Concurrent read/write support with fine-grained locking

Installation

System Requirements

  • Python 3.10 or higher
  • CMake 3.25 or higher
  • C++17 compatible compiler
  • PyTorch (with CPU support)
  • Boost C++ Libraries
  • pybind11 (automatically installed via pip dependencies)

Platform Notes:

  • Linux: Full support with optimized page cache management via posix_fadvise
  • macOS: Supported, but without posix_fadvise optimization (not available on macOS)

Installation Methods

Install system dependencies

On Ubuntu/Debian:

sudo apt-get update
sudo apt-get install cmake build-essential libboost-all-dev

On macOS:

brew install cmake boost

Using Conda (Cross-platform):

conda install -c conda-forge cmake cxx-compiler boost libboost-devel

Install PyTorch:

pip install torch

Using pip (Recommended)

pip install -v .

Build and install from source

# Build wheel package
python -m build --wheel

# Install the built wheel
pip install dist/*.whl

Environment Variables

LIGHTMEM_MAX_BLOCK_SIZE_MB

Controls the maximum size of each cache block in megabytes (MB).

  • Default: 64 (64MB)
  • Purpose: Determines the granularity of cache I/O operations. Each cache block is read from or written to disk as a single unit.
  • Usage:
    export LIGHTMEM_MAX_BLOCK_SIZE_MB=32  # Set to 32MB
  • Considerations:
    • Larger blocks (e.g., 128): Reduce overhead, better for sequential access, but may increase latency for small operations
    • Smaller blocks (e.g., 16): More fine-grained control, better for random access, but higher overhead per operation
    • Must be set before starting the cache service

Quick Start

Key Concepts

Data, Hashes, Pages, and Blocks:

  • Each data element is hashed using 128-bit cumulative hashing (xxHash3)
  • Each cumulative hash corresponds to one page in the KV cache (one-to-one mapping via kv_page_indexer)
  • Cumulative hash: Each position contains the hash of all data from the start up to that position
  • Hashes are automatically grouped into blocks for I/O operations
  • Block size = LIGHTMEM_MAX_BLOCK_SIZE_MB (default 64MB)
  • Pages per block = block_size / page_size
  • Example: With 64MB blocks and 16KB pages, each block contains ~4096 pages
  • For block operations: The last cumulative hash of each block represents the entire block

Basic Usage

import torch
from light_mem import PyLocalCacheService

# Create a CPU-based KV cache tensor
# Shape: [num_pages, page_size] - must be 2D uint8 tensor
kv_cache = torch.zeros((1000, 40 * 8192), dtype=torch.float16).view(dtype=torch.uint8)

# Initialize cache service
cache_service = PyLocalCacheService(
    kvcache_tensor=kv_cache,     # KV cache tensor (2D uint8)
    file="./cache_storage",      # Storage directory
    storage_size=10 * 1024**3,   # 10GB storage limit
    num_shard=4,                 # Number of storage shards
    num_worker=8                 # Number of worker threads
)

# Service starts automatically after initialization

# Compute cumulative hashes from data
hash_128s = [hash_1, hash_2, hash_3, hash_4]  # list of 128-bit integers (cumulative hashes)

# Query if caches exist (returns list of booleans)
exists_list = cache_service.query(hash_128s)

# Create write/read tasks
# Note: hash_128s and kv_page_indexer must have the same length (one-to-one mapping)
task = cache_service.create(
    hash_128s=hash_128s,          # List of 128-bit cumulative hash integers
    kv_page_indexer=torch.tensor([0, 1, 2, 3], dtype=torch.int32),  # Page indices (same length as hash_128s)
    mode="w"                      # "w" for write, "r" for read
)

# Check task status
if task.ready():
    print("Task completed!")

Task Management

# Check task state
states = task.state()  # Returns PyState enum list for each block

# Abort a running task
cache_service.abort(task)

# Get pages already cached on disk (write mode only)
cached_pages = task.page_already_list  # Property, not method

# Check if data is safe to modify (for write tasks)
if task.data_safe():
    # Safe to modify source tensor
    pass

Architecture

LightMem has a layered architecture with C++ core and Python bindings:

Python Layer

  • PyLocalCacheService: Main Python interface for cache operations
  • PyTask: Python wrapper for task management
  • PyState: Enum for task state tracking (Initial, Working, Finished, Aborted)

C++ Core (Internal)

Storage Layer

  • StorageEngine: Abstract interface for pluggable storage backends
  • LocalStorageEngine: File-based storage implementation with sharding support

Service Layer

  • CacheService: Base class defining cache service interface
  • LocalCacheService: Concrete implementation managing local disk cache

Task Processing

  • CacheTask: Represents a complete read/write operation
  • CacheBlock: Individual block within a task, processed independently
  • TaskQueue: Thread pool managing asynchronous task execution

Configuration

Block Size Configuration

The block size determines the granularity of cache operations:

# Default: 64MB per block
# Override via environment variable (value in MB)
export LIGHTMEM_MAX_BLOCK_SIZE_MB=128  # Set to 128MB

Storage Sharding

Distribute cache files across multiple shards for better I/O parallelism:

num_shard=8  # Creates 8 separate storage files

Performance Considerations

  • Worker Threads: More workers improve I/O parallelism but increase CPU overhead
  • Block Size: Larger blocks reduce overhead but may increase latency for small operations
  • Storage Sharding: More shards improve concurrent access but increase file descriptor usage
  • Memory Alignment: KV cache tensors must be contiguous for optimal performance

API Reference

PyLocalCacheService

Constructor

PyLocalCacheService(
    kvcache_tensor: torch.Tensor,
    file: str,
    storage_size: int = 32 * 1024**3,  # Default: 32GB
    num_shard: int = 32,
    num_worker: int = 16
)

Parameters:

  • kvcache_tensor: 2D uint8 tensor with shape [num_pages, page_size], must be CPU and contiguous
  • file: Path to storage directory/file
  • storage_size: Total storage size in bytes (distributed across shards)
  • num_shard: Number of storage file shards
  • num_worker: Number of worker threads

Methods

  • query(hash_128s: List[int]) -> List[bool]: Check if caches exist for given cumulative hashes
    • Input: List of 128-bit cumulative hash integers
    • Returns one boolean per block (not per hash)
    • Data is grouped into blocks internally based on block_size / page_size
  • create(hash_128s: List[int], kv_page_indexer: torch.Tensor, mode: str, start_pos: int = 0) -> PyTask: Create cache task
    • hash_128s: List of 128-bit cumulative hash integers
    • kv_page_indexer: Int32 tensor containing page indices, must have the same length as hash_128s (one-to-one mapping)
    • mode: "w" for write, "r" for read
    • start_pos: Optional starting position in token list (default: 0)
  • abort(task: PyTask): Cancel a running task
  • active_threads(mode: str) -> int: Get count of active read/write tasks ("w" or "r")

PyTask

Methods

  • ready() -> bool: Check if all blocks are finished
  • data_safe() -> bool: Check if source data can be safely modified (write: data copied; read: equivalent to ready())
  • state() -> List[PyState]: Get PyState enum for each block
    • PyState.Initial (0): Task just created
    • PyState.Working (1): Task in progress
    • PyState.Finished (2): Task completed successfully
    • PyState.Aborted (3): Task aborted (possibly due to error)

Properties

  • page_already_list -> List[int]: Get list of page indices already on disk (write mode: pages found in cache via hash query)

Contributing

Contributions are welcome! Please ensure:

  • Code follows C++17 and Python 3.10+ standards
  • All tests pass before submitting PRs
  • Documentation is updated for new features

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Acknowledgments

LightMem is developed as part of the ModelTC ecosystem for efficient LLM inference.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •