Offload Prefill to CPU/GPU

Currently, the prefill stage is done through the NPU. Because the NPU requires a static shape and the decode stage is accepts one token at a time, prefill going through sequentially is suboptimal.

A possible solution is to offload prefill to a CPU or GPU model, or alternatively a fixed size on the NPU. This would require extending the functionality of the converter to create another model for the prefill stage, and updating LLMBase and Pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Offload Prefill to CPU/GPU #23

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Offload Prefill to CPU/GPU #23

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions