common: add LLAMA_LOG_FILE env var #17609

taronaeo · 2025-11-30T03:12:34Z

Introduces the LLAMA_LOG_FILE environment variable as reported in the issue mentioned above.

Tests

$ LLAMA_LOG_FILE="test.log" build/bin/llama-cli -hf ggml-org/gpt-oss-20b-GGUF -no-cnv -p "Write me a dog walking business idea 1. " -n 0

ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
* Host huggingface.co:443 was resolved.
* IPv6: (none)
* IPv4: 13.35.202.34, 13.35.202.97, 13.35.202.40, 13.35.202.121
*   Trying 13.35.202.34:443...
* Connected to huggingface.co (13.35.202.34) port 443
* ALPN: curl offers h2,http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=huggingface.co
*  start date: Apr 13 00:00:00 2025 GMT
*  expire date: May 12 23:59:59 2026 GMT
*  subjectAltName: host "huggingface.co" matched cert's "huggingface.co"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://huggingface.co/v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: huggingface.co]
* [HTTP/2] [1] [:path: /v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest]
* [HTTP/2] [1] [user-agent: llama-cpp]
* [HTTP/2] [1] [accept: application/json]
> GET /v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest HTTP/2
Host: huggingface.co
User-Agent: llama-cpp
Accept: application/json

* Request completely sent off
< HTTP/2 200 
< content-type: application/json; charset=utf-8
< content-length: 982
< date: Sun, 30 Nov 2025 03:13:27 GMT
< etag: W/"3d6-fo+LSZBH+fXLeln9FK2QfwBO31Y"
< x-powered-by: huggingface-moon
< x-request-id: Root=1-692bb657-291d00093b2f577043d84f8d
< ratelimit: "pages";r=95;t=21
< ratelimit-policy: "fixed window";"pages";q=100;w=300
< cross-origin-opener-policy: same-origin
< referrer-policy: strict-origin-when-cross-origin
< access-control-max-age: 86400
< access-control-allow-origin: https://huggingface.co
< vary: Origin
< access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
< x-cache: Miss from cloudfront
< via: 1.1 1c4964336b4fc412a86181b6d86b042e.cloudfront.net (CloudFront)
< x-amz-cf-pop: SIN2-P7
< x-amz-cf-id: uzGxvfZsEokh4NbKEpLPxB9C44A5InxeKvTmL7Ag8LNk_l-gi84nZg==
< 
* Connection #0 to host huggingface.co left intact
common_download_file_single_online: using cached file: /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
build: 7205 (fa0465954) with Homebrew clang version 19.1.7 for arm64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) (unknown id) - 21844 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  20:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  21:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  22:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  23: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_embd_inp       = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 200007 '<|end|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors: Metal_Mapped model buffer size = 11536.18 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:      Metal KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache:      Metal KV buffer size =    18.00 MiB
llama_kv_cache: size =   18.00 MiB (   768 cells,  12 layers,  1/1 seqs), K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      Metal compute buffer size =   404.00 MiB
llama_context:        CPU compute buffer size =    15.15 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 2399610517
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 0, n_keep = 0



common_perf_print:    sampling time =       0.00 ms
common_perf_print:    samplers time =       0.00 ms /     0 tokens
common_perf_print:        load time =     913.96 ms
common_perf_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =       0.46 ms /     2 tokens
common_perf_print: unaccounted time =       0.46 ms / 100.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]   | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Metal (Apple M1 Pro) | 21845 = 9790 + (12054 = 11536 +     114 +     404) +           0 |
llama_memory_breakdown_print: |   - Host                 |                   601 =   586 +       0 +      15                |
ggml_metal_free: deallocating

$ cat test.log

common_download_file_single_online: using cached file: /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
build: 7205 (fa0465954) with Homebrew clang version 19.1.7 for arm64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) (unknown id) - 21844 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  20:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  21:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  22:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  23: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200013 '<|reserved_200013|>' is not marked as EOG
load: control token: 200009 '<|reserved_200009|>' is not marked as EOG
load: control token: 200008 '<|message|>' is not marked as EOG
load: control token: 200003 '<|constrain|>' is not marked as EOG
load: control token: 200001 '<|reserved_200001|>' is not marked as EOG
load: control token: 200000 '<|reserved_200000|>' is not marked as EOG
load: control token: 200005 '<|channel|>' is not marked as EOG
load: control token: 200010 '<|reserved_200010|>' is not marked as EOG
load: control token: 199998 '<|startoftext|>' is not marked as EOG
load: control token: 200006 '<|start|>' is not marked as EOG
load: control token: 200017 '<|reserved_200017|>' is not marked as EOG
load: control token: 200016 '<|reserved_200016|>' is not marked as EOG
load: control token: 200004 '<|reserved_200004|>' is not marked as EOG
load: control token: 200014 '<|reserved_200014|>' is not marked as EOG
load: control token: 200015 '<|reserved_200015|>' is not marked as EOG
load: control token: 200011 '<|reserved_200011|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_embd_inp       = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 200007 '<|end|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 1
load_tensors: layer   1 assigned to device Metal, is_swa = 0
load_tensors: layer   2 assigned to device Metal, is_swa = 1
load_tensors: layer   3 assigned to device Metal, is_swa = 0
load_tensors: layer   4 assigned to device Metal, is_swa = 1
load_tensors: layer   5 assigned to device Metal, is_swa = 0
load_tensors: layer   6 assigned to device Metal, is_swa = 1
load_tensors: layer   7 assigned to device Metal, is_swa = 0
load_tensors: layer   8 assigned to device Metal, is_swa = 1
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 1
load_tensors: layer  11 assigned to device Metal, is_swa = 0
load_tensors: layer  12 assigned to device Metal, is_swa = 1
load_tensors: layer  13 assigned to device Metal, is_swa = 0
load_tensors: layer  14 assigned to device Metal, is_swa = 1
load_tensors: layer  15 assigned to device Metal, is_swa = 0
load_tensors: layer  16 assigned to device Metal, is_swa = 1
load_tensors: layer  17 assigned to device Metal, is_swa = 0
load_tensors: layer  18 assigned to device Metal, is_swa = 1
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 1
load_tensors: layer  21 assigned to device Metal, is_swa = 0
load_tensors: layer  22 assigned to device Metal, is_swa = 1
load_tensors: layer  23 assigned to device Metal, is_swa = 0
load_tensors: layer  24 assigned to device Metal, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_sinks.weight
create_tensor: loading tensor blk.0.ffn_gate_inp.weight
create_tensor: loading tensor blk.0.ffn_gate_exps.weight
create_tensor: loading tensor blk.0.ffn_down_exps.weight
create_tensor: loading tensor blk.0.ffn_up_exps.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.bias
create_tensor: loading tensor blk.0.ffn_gate_inp.bias
create_tensor: loading tensor blk.0.ffn_gate_exps.bias
create_tensor: loading tensor blk.0.ffn_down_exps.bias
create_tensor: loading tensor blk.0.ffn_up_exps.bias
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.post_attention_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_sinks.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.attn_output.bias
create_tensor: loading tensor blk.1.ffn_gate_inp.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.bias
create_tensor: loading tensor blk.1.ffn_down_exps.bias
create_tensor: loading tensor blk.1.ffn_up_exps.bias
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.post_attention_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_sinks.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.attn_output.bias
create_tensor: loading tensor blk.2.ffn_gate_inp.bias
create_tensor: loading tensor blk.2.ffn_gate_exps.bias
create_tensor: loading tensor blk.2.ffn_down_exps.bias
create_tensor: loading tensor blk.2.ffn_up_exps.bias
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.post_attention_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_sinks.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.attn_output.bias
create_tensor: loading tensor blk.3.ffn_gate_inp.bias
create_tensor: loading tensor blk.3.ffn_gate_exps.bias
create_tensor: loading tensor blk.3.ffn_down_exps.bias
create_tensor: loading tensor blk.3.ffn_up_exps.bias
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.post_attention_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_sinks.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.attn_output.bias
create_tensor: loading tensor blk.4.ffn_gate_inp.bias
create_tensor: loading tensor blk.4.ffn_gate_exps.bias
create_tensor: loading tensor blk.4.ffn_down_exps.bias
create_tensor: loading tensor blk.4.ffn_up_exps.bias
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.post_attention_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_sinks.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.attn_output.bias
create_tensor: loading tensor blk.5.ffn_gate_inp.bias
create_tensor: loading tensor blk.5.ffn_gate_exps.bias
create_tensor: loading tensor blk.5.ffn_down_exps.bias
create_tensor: loading tensor blk.5.ffn_up_exps.bias
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.post_attention_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_sinks.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.attn_output.bias
create_tensor: loading tensor blk.6.ffn_gate_inp.bias
create_tensor: loading tensor blk.6.ffn_gate_exps.bias
create_tensor: loading tensor blk.6.ffn_down_exps.bias
create_tensor: loading tensor blk.6.ffn_up_exps.bias
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.post_attention_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_sinks.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.attn_output.bias
create_tensor: loading tensor blk.7.ffn_gate_inp.bias
create_tensor: loading tensor blk.7.ffn_gate_exps.bias
create_tensor: loading tensor blk.7.ffn_down_exps.bias
create_tensor: loading tensor blk.7.ffn_up_exps.bias
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.post_attention_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_sinks.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.attn_output.bias
create_tensor: loading tensor blk.8.ffn_gate_inp.bias
create_tensor: loading tensor blk.8.ffn_gate_exps.bias
create_tensor: loading tensor blk.8.ffn_down_exps.bias
create_tensor: loading tensor blk.8.ffn_up_exps.bias
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.post_attention_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_sinks.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.attn_output.bias
create_tensor: loading tensor blk.9.ffn_gate_inp.bias
create_tensor: loading tensor blk.9.ffn_gate_exps.bias
create_tensor: loading tensor blk.9.ffn_down_exps.bias
create_tensor: loading tensor blk.9.ffn_up_exps.bias
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.post_attention_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_sinks.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.attn_output.bias
create_tensor: loading tensor blk.10.ffn_gate_inp.bias
create_tensor: loading tensor blk.10.ffn_gate_exps.bias
create_tensor: loading tensor blk.10.ffn_down_exps.bias
create_tensor: loading tensor blk.10.ffn_up_exps.bias
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.post_attention_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_sinks.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.attn_output.bias
create_tensor: loading tensor blk.11.ffn_gate_inp.bias
create_tensor: loading tensor blk.11.ffn_gate_exps.bias
create_tensor: loading tensor blk.11.ffn_down_exps.bias
create_tensor: loading tensor blk.11.ffn_up_exps.bias
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.post_attention_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_sinks.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.attn_output.bias
create_tensor: loading tensor blk.12.ffn_gate_inp.bias
create_tensor: loading tensor blk.12.ffn_gate_exps.bias
create_tensor: loading tensor blk.12.ffn_down_exps.bias
create_tensor: loading tensor blk.12.ffn_up_exps.bias
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.post_attention_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_sinks.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.attn_output.bias
create_tensor: loading tensor blk.13.ffn_gate_inp.bias
create_tensor: loading tensor blk.13.ffn_gate_exps.bias
create_tensor: loading tensor blk.13.ffn_down_exps.bias
create_tensor: loading tensor blk.13.ffn_up_exps.bias
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.post_attention_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_sinks.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.attn_output.bias
create_tensor: loading tensor blk.14.ffn_gate_inp.bias
create_tensor: loading tensor blk.14.ffn_gate_exps.bias
create_tensor: loading tensor blk.14.ffn_down_exps.bias
create_tensor: loading tensor blk.14.ffn_up_exps.bias
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.post_attention_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_sinks.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.attn_output.bias
create_tensor: loading tensor blk.15.ffn_gate_inp.bias
create_tensor: loading tensor blk.15.ffn_gate_exps.bias
create_tensor: loading tensor blk.15.ffn_down_exps.bias
create_tensor: loading tensor blk.15.ffn_up_exps.bias
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.post_attention_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_sinks.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.attn_output.bias
create_tensor: loading tensor blk.16.ffn_gate_inp.bias
create_tensor: loading tensor blk.16.ffn_gate_exps.bias
create_tensor: loading tensor blk.16.ffn_down_exps.bias
create_tensor: loading tensor blk.16.ffn_up_exps.bias
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.post_attention_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_sinks.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.attn_output.bias
create_tensor: loading tensor blk.17.ffn_gate_inp.bias
create_tensor: loading tensor blk.17.ffn_gate_exps.bias
create_tensor: loading tensor blk.17.ffn_down_exps.bias
create_tensor: loading tensor blk.17.ffn_up_exps.bias
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.post_attention_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_sinks.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.attn_output.bias
create_tensor: loading tensor blk.18.ffn_gate_inp.bias
create_tensor: loading tensor blk.18.ffn_gate_exps.bias
create_tensor: loading tensor blk.18.ffn_down_exps.bias
create_tensor: loading tensor blk.18.ffn_up_exps.bias
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.post_attention_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_sinks.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.attn_output.bias
create_tensor: loading tensor blk.19.ffn_gate_inp.bias
create_tensor: loading tensor blk.19.ffn_gate_exps.bias
create_tensor: loading tensor blk.19.ffn_down_exps.bias
create_tensor: loading tensor blk.19.ffn_up_exps.bias
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.post_attention_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_sinks.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.attn_output.bias
create_tensor: loading tensor blk.20.ffn_gate_inp.bias
create_tensor: loading tensor blk.20.ffn_gate_exps.bias
create_tensor: loading tensor blk.20.ffn_down_exps.bias
create_tensor: loading tensor blk.20.ffn_up_exps.bias
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.post_attention_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_sinks.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.attn_output.bias
create_tensor: loading tensor blk.21.ffn_gate_inp.bias
create_tensor: loading tensor blk.21.ffn_gate_exps.bias
create_tensor: loading tensor blk.21.ffn_down_exps.bias
create_tensor: loading tensor blk.21.ffn_up_exps.bias
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.post_attention_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_sinks.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.attn_output.bias
create_tensor: loading tensor blk.22.ffn_gate_inp.bias
create_tensor: loading tensor blk.22.ffn_gate_exps.bias
create_tensor: loading tensor blk.22.ffn_down_exps.bias
create_tensor: loading tensor blk.22.ffn_up_exps.bias
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.post_attention_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_sinks.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.attn_output.bias
create_tensor: loading tensor blk.23.ffn_gate_inp.bias
create_tensor: loading tensor blk.23.ffn_gate_exps.bias
create_tensor: loading tensor blk.23.ffn_down_exps.bias
create_tensor: loading tensor blk.23.ffn_up_exps.bias
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead
ggml_metal_log_allocated_size: allocated buffer, size = 11536.20 MiB, (11536.58 / 21845.34)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors: Metal_Mapped model buffer size = 11536.18 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: dev = Metal
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = Metal
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = Metal
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = Metal
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: dev = Metal
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = Metal
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: dev = Metal
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = Metal
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = Metal
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = Metal
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: dev = Metal
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = Metal
llama_kv_cache:      Metal KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache: layer   0: dev = Metal
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: dev = Metal
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: dev = Metal
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = Metal
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: dev = Metal
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: dev = Metal
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = Metal
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: dev = Metal
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: dev = Metal
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = Metal
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: dev = Metal
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: dev = Metal
llama_kv_cache: layer  23: filtered
llama_kv_cache:      Metal KV buffer size =    18.00 MiB
llama_kv_cache: size =   18.00 MiB (   768 cells,  12 layers,  1/1 seqs), K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3672
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   404.00 MiB
llama_context:        CPU compute buffer size =    15.15 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x113ebebe0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4                  0x113ec1860 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x113e748f0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32_imrope=0'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32_imrope=0                 0x113ec2600 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x113ec2160 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x113ec3010 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk64_dv64', name = 'kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32      0x113ec3e50 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32      0x113ec4830 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_ext_q8_0_f32_r1_2', name = 'kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16      0x113ec5b80 | th_max =  832 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x113ec36d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f32_f32_4', name = 'kernel_mul_mv_f32_f32_4_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f32_f32_4_nsg=4                 0x113ec6170 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_argsort_f32_i32_desc', name = 'kernel_argsort_f32_i32_desc'
ggml_metal_library_compile_pipeline: loaded kernel_argsort_f32_i32_desc                   0x113ec7660 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_argsort_merge_f32_i32_desc', name = 'kernel_argsort_merge_f32_i32_desc'
ggml_metal_library_compile_pipeline: loaded kernel_argsort_merge_f32_i32_desc             0x113ec44b0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x113ec8d90 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4                         0x113ec96d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_id_mxfp4_f32', name = 'kernel_mul_mv_id_mxfp4_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_id_mxfp4_f32_nsg=2              0x113ec9f60 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_id', name = 'kernel_add_id'
ggml_metal_library_compile_pipeline: loaded kernel_add_id                                 0x113eca1c0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_oai_f32', name = 'kernel_swiglu_oai_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_oai_f32                         0x113ecb0e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_fuse_1', name = 'kernel_mul_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_fuse_1                             0x113ecbea0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_3', name = 'kernel_add_fuse_3'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_3                             0x113ecc8d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_3', name = 'kernel_add_row_c4_fuse_3'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_3                      0x113ed6f90 | th_max = 1024 | th_width =   32
set_warmup: value = 0
main: llama threadpool init, n_threads = 8
attach_threadpool: call

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 2399610517
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 0, n_keep = 0



common_perf_print:    sampling time =       0.00 ms
common_perf_print:    samplers time =       0.00 ms /     0 tokens
common_perf_print:        load time =     913.96 ms
common_perf_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =       0.46 ms /     2 tokens
common_perf_print: unaccounted time =       0.46 ms / 100.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]   | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Metal (Apple M1 Pro) | 21845 = 9790 + (12054 = 11536 +     114 +     404) +           0 |
llama_memory_breakdown_print: |   - Host                 |                   601 =   586 +       0 +      15                |
ggml_metal_free: deallocating

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

common: add LLAMA_LOG_FILE env var

c3c2099

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

taronaeo requested a review from ggerganov as a code owner November 30, 2025 03:12

taronaeo mentioned this pull request Nov 30, 2025

Feature Request: LAMA_LOG_FILE env variable. #17601

Closed

4 tasks

taronaeo requested a review from CISC November 30, 2025 03:15

loci-dev mentioned this pull request Nov 30, 2025

UPSTREAM PR #17609: common: add LLAMA_LOG_FILE env var auroralabs-loci/llama.cpp#370

Open

ngxson approved these changes Nov 30, 2025

View reviewed changes

ngxson merged commit def5404 into ggml-org:master Nov 30, 2025
72 of 74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

common: add LLAMA_LOG_FILE env var #17609

common: add LLAMA_LOG_FILE env var #17609

taronaeo commented Nov 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

common: add LLAMA_LOG_FILE env var #17609

common: add LLAMA_LOG_FILE env var #17609

Conversation

taronaeo commented Nov 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

taronaeo commented Nov 30, 2025 •

edited

Loading