Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;
Implementation idea: --offload-kv flag.
This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...
Since with the Llama architecture came KV caching, and since each node has to cache the K and V matrices for each of the generated samples, this increases memory usage for the device used.
By storing the non-active KV caches in the RAM instead of the VRAM, it is possible to save memory, especially when the number of samples is high;
Implementation idea:
--offload-kvflag.This could potentially slow down inference, as it requires to transfer data from CPU to GPU memory at each local processing...