Skip to content

Add possibility to offload inactive KV Caches to RAM #33

@davmacario

Description

@davmacario

Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;

Implementation idea: --offload-kv flag.

This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestextrasNot directly related to the thesis, low priorityideaA new idea - may or may not bring improvements

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions