Skip to content

Offload Prefill to CPU/GPU #23

@ayf7

Description

@ayf7

Currently, the prefill stage is done through the NPU. Because the NPU requires a static shape and the decode stage is accepts one token at a time, prefill going through sequentially is suboptimal.

A possible solution is to offload prefill to a CPU or GPU model, or alternatively a fixed size on the NPU. This would require extending the functionality of the converter to create another model for the prefill stage, and updating LLMBase and Pipeline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions