Overview
Implement memory-mapped file loading for large SafeTensors models.
Motivation
Large models (70B+) exceed VRAM capacity. Memory mapping enables:
- Streaming weights from disk to GPU
- Layer-by-layer loading without full RAM usage
- Fast model switching without full reload
Features
Design
from pygpukit.llm import LazyModel
# Model weights stay on disk until accessed
model = LazyModel.from_safetensors("path/to/model", lazy=True)
# Only loads embedding layer
embeddings = model.embed_tokens(input_ids)
# Loads layer 0 on demand
hidden = model.blocks[0](embeddings)
Related
- SafeTensors mmap support
- HuggingFace Accelerate disk offload
Overview
Implement memory-mapped file loading for large SafeTensors models.
Motivation
Large models (70B+) exceed VRAM capacity. Memory mapping enables:
Features
Design
Related