A lightweight yet powerful LLM inference engine with PagedAttention
基于PagedAttention的轻量级高性能大模型推理引擎
Inspired by vLLM, optimized for learning and performance
- ⚡ High Performance: Target 85-90% of vLLM throughput
- 💾 Memory Efficient: PagedAttention-based KV Cache management
- 🔧 Custom Kernels: Optimized Triton implementations (5-8x faster than PyTorch)
- 📊 Continuous Batching: Dynamic request scheduling for better throughput
- 🎯 Well-Tested: Comprehensive unit tests (target 85%+ coverage)
# Performance Targets
RMSNorm Kernel: 5-8x speedup vs PyTorch
Memory Utilization: 85%+ (PagedAttention)
End-to-End: 80-90% of vLLM performance# Clone repository
git clone https://github.com/psmarter/mini-infer.git
cd mini-infer
# Create virtual environment
conda create -n mini-infer python=3.10
conda activate mini-infer
# Install dependencies (coming soon)
pip install -r requirements.txtfrom mini_infer import LLMEngine
from mini_infer.config import EngineConfig
# Initialize engine
config = EngineConfig(
model="meta-llama/Llama-2-7b-hf",
max_num_seqs=64,
block_size=16
)
engine = LLMEngine(config)
# Generate
prompts = ["Hello, how are you?"]
outputs = engine.generate(prompts, max_tokens=100)
print(outputs[0].text)| Component | Baseline | Mini-Infer | Target Speedup |
|---|---|---|---|
| RMSNorm Kernel | PyTorch | Triton | 5-8x |
| RoPE Kernel | PyTorch | Triton | 6-8x |
| Memory Util | 40% | PagedAttention | 85%+ |
| Throughput | Static Batch | Continuous Batch | 2-3x |
┌─────────────────────────────────────┐
│ LLM Engine │
├─────────────────────────────────────┤
│ ┌──────────┐ ┌──────────┐ │
│ │Scheduler │───▶│Model │ │
│ │(C-Batch) │ │Runner │ │
│ └──────────┘ └──────────┘ │
├─────────────────────────────────────┤
│ Memory Management │
│ ┌──────────────────────────┐ │
│ │ Block Manager │ │
│ │ (PagedAttention) │ │
│ └──────────────────────────┘ │
├─────────────────────────────────────┤
│ Custom Kernels (Triton) │
│ ┌───────┐ ┌────┐ ┌──────────┐ │
│ │RMSNorm│ │RoPE│ │Attention │ │
│ └───────┘ └────┘ └──────────┘ │
└─────────────────────────────────────┘
- Project structure
- Basic documentation
- Development environment setup
- Triton kernels (RMSNorm, RoPE)
- PagedAttention Block Manager
- Continuous Batching Scheduler
- End-to-end inference engine
- Comprehensive benchmarks
- Unit tests (85%+ coverage)
- Performance optimization
- API documentation
- Usage examples
- Technical blog posts
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
This project is licensed under the MIT License - see the LICENSE file for details.
This project is inspired by and learns from:
- vLLM - PagedAttention and continuous batching
- FlashAttention - Efficient attention mechanisms
- Triton - GPU programming framework
- ⚡ 高性能: 目标达到vLLM 85-90%的吞吐量
- 💾 显存优化: 基于PagedAttention的KV Cache管理
- 🔧 自定义算子: 优化的Triton实现 (比PyTorch快5-8倍)
- 📊 连续批处理: 动态请求调度,提升吞吐量
- 🎯 测试完善: 完整的单元测试 (目标覆盖率85%+)
# 性能目标
RMSNorm算子: 相比PyTorch加速5-8倍
显存利用率: 85%+ (PagedAttention)
端到端性能: vLLM的80-90%# 克隆仓库
git clone https://github.com/YOUR_USERNAME/mini-infer.git
cd mini-infer
# 创建虚拟环境
conda create -n mini-infer python=3.10
conda activate mini-infer
# 安装依赖 (即将推出)
pip install -r requirements.txtfrom mini_infer import LLMEngine
from mini_infer.config import EngineConfig
# 初始化引擎
config = EngineConfig(
model="meta-llama/Llama-2-7b-hf",
max_num_seqs=64,
block_size=16
)
engine = LLMEngine(config)
# 生成文本
prompts = ["你好,最近怎么样?"]
outputs = engine.generate(prompts, max_tokens=100)
print(outputs[0].text)| 组件 | 基准 | Mini-Infer | 目标加速比 |
|---|---|---|---|
| RMSNorm算子 | PyTorch | Triton | 5-8x |
| RoPE算子 | PyTorch | Triton | 6-8x |
| 显存利用率 | 40% | PagedAttention | 85%+ |
| 吞吐量 | 静态批处理 | 连续批处理 | 2-3x |
- 项目结构
- 基础文档
- 开发环境配置
- Triton算子 (RMSNorm, RoPE)
- PagedAttention块管理器
- 连续批处理调度器
- 端到端推理引擎
- 完整的基准测试
- 单元测试 (覆盖率85%+)
- 性能优化
- API文档
- 使用示例
- 技术博客
欢迎贡献!请查看 CONTRIBUTING.md 了解详情。
本项目采用MIT协议 - 详见 LICENSE 文件。
本项目受以下优秀开源项目启发:
- vLLM - PagedAttention和连续批处理
- FlashAttention - 高效注意力机制
- Triton - GPU编程框架
⭐ 如果这个项目对你有帮助,请给个Star! ⭐
⭐ If you find this project helpful, please star it! ⭐