-
Notifications
You must be signed in to change notification settings - Fork 21
Expand file tree
/
Copy pathconfig.yml
More file actions
13 lines (12 loc) · 866 Bytes
/
config.yml
File metadata and controls
13 lines (12 loc) · 866 Bytes
1
2
3
4
5
6
7
8
9
10
11
12
13
models:
- name: "llama3.1_8b_instruct_q5km"
display_name: "LLaMA 3.1 8B Instruct Q5_K_M"
file_name: "Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf"
model_context_size: 128000 # Right now only informative.
kv_cache_size: 16000 # Which size should the llama.cpp KV Cache have?
kv_cache_quants: "q8_0" # e.g. "q8_0", "q4_0" or "f16" - requires flash attention
flash_attention: true # does not work for some models
mlock: true
server_slots: 1 # How many requests should be processed in parallel. Please note: The size of each slot is kv_cache_size / server_slots!
seed: 42 # Random initialization
n_gpu_layers: 200 # How many layers to offload to the GPU. You should always try to offload all! e.g. 33 for Llama 3.1 8B or 82 for Llama 3.1 70B. Can be set to e.g. 200 to make sure all layers are offloaded for (almost) all models.