Add possibility to offload inactive KV Caches to RAM

Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;

Implementation idea: `--offload-kv` flag.

This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add possibility to offload inactive KV Caches to RAM #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Add possibility to offload inactive KV Caches to RAM #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions