-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers
Description
Currently, the prefill stage is done through the NPU. Because the NPU requires a static shape and the decode stage is accepts one token at a time, prefill going through sequentially is suboptimal.
A possible solution is to offload prefill to a CPU or GPU model, or alternatively a fixed size on the NPU. This would require extending the functionality of the converter to create another model for the prefill stage, and updating LLMBase and Pipeline.
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers