支持pagedattention , continuous batch 提供基础支持 , KVcachemanger 用于 KVcache管理 ,支持调度器调度#24
Open
jinTang-coder wants to merge 10 commits intozjhellofss:mainfrom
Open
支持pagedattention , continuous batch 提供基础支持 , KVcachemanger 用于 KVcache管理 ,支持调度器调度#24jinTang-coder wants to merge 10 commits intozjhellofss:mainfrom
jinTang-coder wants to merge 10 commits intozjhellofss:mainfrom
Conversation
实现 KV Cache 的物理块分配管理
实现基于 BlockAllocator 的 KV Cache 管理,支持分页式缓存分配与释放
管理批处理请求的元数据信息,支持 Prefill 和 Decode 阶段
实现请求调度机制,包含: - Sequence 序列状态管理 - SchedulerConfig 调度配置 - Scheduler 调度器核心逻辑(Prefill/Decode 阶段调度、抢占驱逐策略)
- PagedAttention kernel:支持分页式 KV Cache 的注意力计算 - GEMM kernel:矩阵乘法优化 - RoPE kernel:支持批处理的旋转位置编码 - 更新 kernels_interface 接口
- test_cu_gemm:GEMM kernel 单元测试 - test_cu_rope_batch:批处理 RoPE kernel 单元测试
- 新增 Request 请求结构 - Model 基类添加分页推理接口 - model_paged.cpp 实现分页推理逻辑 - LLaMA3 适配 PagedAttention 和连续批处理
- base.h:添加相关类型定义 - tensor:支持新的内存布局 - swiglu:算子优化调整
- mainpaged.cpp:单批次分页推理示例 - main_continuous.cpp:连续批处理推理示例 - main_unified.cpp:统一推理接口示例 - 更新 demo CMakeLists.txt
- CMakeLists.txt:添加新模块编译 - .vscode:IDE 配置文件
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
支持pagedattention , Scheduler ,KVcachemanger ,schelduler
mainpaged.cpp:单批次分页推理示例
main_continuous.cpp:连续批处理推理示例
main_unified.cpp:统一推理接口示例
更新 demo CMakeLists.txt
新增 Request 请求结构
Model 基类添加分页推理接口
LLaMA3 适配 PagedAttention 和连续批处理
PagedAttention kernel:支持分页式 KV Cache 的注意力计算
GEMM kernel:矩阵乘法优化
更新 kernels_interface 接口|
实现请求调度机制,包含:
Sequence 序列状态管理
SchedulerConfig 调度配置
Scheduler 调度器核心逻辑(Prefill/Decode 阶段调度、抢占驱逐策略)
管理批处理请求的元数据信息,支持 Prefill 和 Decode 阶段
eat: 添加 KVCacheManager KV缓存管理器
实现基于 BlockAllocator 的 KV Cache 管理,支持分页式缓存分配与释放
实现 KV Cache 的物理块分配管理