Skip to content

Conversation

@taronaeo
Copy link
Collaborator

@taronaeo taronaeo commented Nov 30, 2025

fixes #17601

Introduces the LLAMA_LOG_FILE environment variable as reported in the issue mentioned above.

Tests

$ LLAMA_LOG_FILE="test.log" build/bin/llama-cli -hf ggml-org/gpt-oss-20b-GGUF -no-cnv -p "Write me a dog walking business idea 1. " -n 0
ggml_metal_device_init: tensor API disabled for pre-M5 and pre-A19 devices
ggml_metal_library_init: using embedded metal library
ggml_metal_library_init: loaded in 0.008 sec
ggml_metal_device_init: GPU name:   Apple M1 Pro
ggml_metal_device_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_device_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_device_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_device_init: simdgroup reduction   = true
ggml_metal_device_init: simdgroup matrix mul. = true
ggml_metal_device_init: has unified memory    = true
ggml_metal_device_init: has bfloat            = true
ggml_metal_device_init: has tensor            = false
ggml_metal_device_init: use residency sets    = true
ggml_metal_device_init: use shared buffers    = true
ggml_metal_device_init: recommendedMaxWorkingSetSize  = 22906.50 MB
* Host huggingface.co:443 was resolved.
* IPv6: (none)
* IPv4: 13.35.202.34, 13.35.202.97, 13.35.202.40, 13.35.202.121
*   Trying 13.35.202.34:443...
* Connected to huggingface.co (13.35.202.34) port 443
* ALPN: curl offers h2,http/1.1
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* SSL connection using TLSv1.3 / AEAD-AES128-GCM-SHA256 / [blank] / UNDEF
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=huggingface.co
*  start date: Apr 13 00:00:00 2025 GMT
*  expire date: May 12 23:59:59 2026 GMT
*  subjectAltName: host "huggingface.co" matched cert's "huggingface.co"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M02
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://huggingface.co/v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: huggingface.co]
* [HTTP/2] [1] [:path: /v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest]
* [HTTP/2] [1] [user-agent: llama-cpp]
* [HTTP/2] [1] [accept: application/json]
> GET /v2/ggml-org/gpt-oss-20b-GGUF/manifests/latest HTTP/2
Host: huggingface.co
User-Agent: llama-cpp
Accept: application/json

* Request completely sent off
< HTTP/2 200 
< content-type: application/json; charset=utf-8
< content-length: 982
< date: Sun, 30 Nov 2025 03:13:27 GMT
< etag: W/"3d6-fo+LSZBH+fXLeln9FK2QfwBO31Y"
< x-powered-by: huggingface-moon
< x-request-id: Root=1-692bb657-291d00093b2f577043d84f8d
< ratelimit: "pages";r=95;t=21
< ratelimit-policy: "fixed window";"pages";q=100;w=300
< cross-origin-opener-policy: same-origin
< referrer-policy: strict-origin-when-cross-origin
< access-control-max-age: 86400
< access-control-allow-origin: https://huggingface.co
< vary: Origin
< access-control-expose-headers: X-Repo-Commit,X-Request-Id,X-Error-Code,X-Error-Message,X-Total-Count,ETag,Link,Accept-Ranges,Content-Range,X-Linked-Size,X-Linked-ETag,X-Xet-Hash
< x-cache: Miss from cloudfront
< via: 1.1 1c4964336b4fc412a86181b6d86b042e.cloudfront.net (CloudFront)
< x-amz-cf-pop: SIN2-P7
< x-amz-cf-id: uzGxvfZsEokh4NbKEpLPxB9C44A5InxeKvTmL7Ag8LNk_l-gi84nZg==
< 
* Connection #0 to host huggingface.co left intact
common_download_file_single_online: using cached file: /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
build: 7205 (fa0465954) with Homebrew clang version 19.1.7 for arm64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) (unknown id) - 21844 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  20:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  21:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  22:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  23: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_embd_inp       = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 200007 '<|end|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors: Metal_Mapped model buffer size = 11536.18 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache:      Metal KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache:      Metal KV buffer size =    18.00 MiB
llama_kv_cache: size =   18.00 MiB (   768 cells,  12 layers,  1/1 seqs), K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_context: Flash Attention was auto, set to enabled
llama_context:      Metal compute buffer size =   404.00 MiB
llama_context:        CPU compute buffer size =    15.15 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 8

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 2399610517
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 0, n_keep = 0



common_perf_print:    sampling time =       0.00 ms
common_perf_print:    samplers time =       0.00 ms /     0 tokens
common_perf_print:        load time =     913.96 ms
common_perf_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =       0.46 ms /     2 tokens
common_perf_print: unaccounted time =       0.46 ms / 100.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]   | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Metal (Apple M1 Pro) | 21845 = 9790 + (12054 = 11536 +     114 +     404) +           0 |
llama_memory_breakdown_print: |   - Host                 |                   601 =   586 +       0 +      15                |
ggml_metal_free: deallocating
$ cat test.log
common_download_file_single_online: using cached file: /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf
build: 7205 (fa0465954) with Homebrew clang version 19.1.7 for arm64-apple-darwin24.1.0
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) (unknown id) - 21844 MiB free
llama_model_loader: loaded meta data with 35 key-value pairs and 459 tensors from /Users/taronaeo/Library/Caches/llama.cpp/ggml-org_gpt-oss-20b-GGUF_gpt-oss-20b-mxfp4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gpt-oss
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gpt Oss 20b
llama_model_loader: - kv   3:                           general.basename str              = gpt-oss
llama_model_loader: - kv   4:                         general.size_label str              = 20B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["vllm", "text-generation"]
llama_model_loader: - kv   7:                        gpt-oss.block_count u32              = 24
llama_model_loader: - kv   8:                     gpt-oss.context_length u32              = 131072
llama_model_loader: - kv   9:                   gpt-oss.embedding_length u32              = 2880
llama_model_loader: - kv  10:                gpt-oss.feed_forward_length u32              = 2880
llama_model_loader: - kv  11:               gpt-oss.attention.head_count u32              = 64
llama_model_loader: - kv  12:            gpt-oss.attention.head_count_kv u32              = 8
llama_model_loader: - kv  13:                     gpt-oss.rope.freq_base f32              = 150000.000000
llama_model_loader: - kv  14:   gpt-oss.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  15:                       gpt-oss.expert_count u32              = 32
llama_model_loader: - kv  16:                  gpt-oss.expert_used_count u32              = 4
llama_model_loader: - kv  17:               gpt-oss.attention.key_length u32              = 64
llama_model_loader: - kv  18:             gpt-oss.attention.value_length u32              = 64
llama_model_loader: - kv  19:           gpt-oss.attention.sliding_window u32              = 128
llama_model_loader: - kv  20:         gpt-oss.expert_feed_forward_length u32              = 2880
llama_model_loader: - kv  21:                  gpt-oss.rope.scaling.type str              = yarn
llama_model_loader: - kv  22:                gpt-oss.rope.scaling.factor f32              = 32.000000
llama_model_loader: - kv  23: gpt-oss.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,201088]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,201088]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,446189]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 199998
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 200002
llama_model_loader: - kv  31:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  32:                    tokenizer.chat_template str              = {#-\n  In addition to the normal input...
llama_model_loader: - kv  33:               general.quantization_version u32              = 2
llama_model_loader: - kv  34:                          general.file_type u32              = 38
llama_model_loader: - type  f32:  289 tensors
llama_model_loader: - type q8_0:   98 tensors
llama_model_loader: - type mxfp4:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = MXFP4 MoE
print_info: file size   = 11.27 GiB (4.63 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200013 '<|reserved_200013|>' is not marked as EOG
load: control token: 200009 '<|reserved_200009|>' is not marked as EOG
load: control token: 200008 '<|message|>' is not marked as EOG
load: control token: 200003 '<|constrain|>' is not marked as EOG
load: control token: 200001 '<|reserved_200001|>' is not marked as EOG
load: control token: 200000 '<|reserved_200000|>' is not marked as EOG
load: control token: 200005 '<|channel|>' is not marked as EOG
load: control token: 200010 '<|reserved_200010|>' is not marked as EOG
load: control token: 199998 '<|startoftext|>' is not marked as EOG
load: control token: 200006 '<|start|>' is not marked as EOG
load: control token: 200017 '<|reserved_200017|>' is not marked as EOG
load: control token: 200016 '<|reserved_200016|>' is not marked as EOG
load: control token: 200004 '<|reserved_200004|>' is not marked as EOG
load: control token: 200014 '<|reserved_200014|>' is not marked as EOG
load: control token: 200015 '<|reserved_200015|>' is not marked as EOG
load: control token: 200011 '<|reserved_200011|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 199999 ('<|endoftext|>')
load:   - 200002 ('<|return|>')
load:   - 200007 ('<|end|>')
load:   - 200012 ('<|call|>')
load: special_eog_ids contains both '<|return|>' and '<|call|>' tokens, removing '<|end|>' token from EOG list
load: special tokens cache size = 21
load: token to piece cache size = 1.3332 MB
print_info: arch             = gpt-oss
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 2880
print_info: n_embd_inp       = 2880
print_info: n_layer          = 24
print_info: n_head           = 64
print_info: n_head_kv        = 8
print_info: n_rot            = 64
print_info: n_swa            = 128
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 64
print_info: n_embd_head_v    = 64
print_info: n_gqa            = 8
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 2880
print_info: n_expert         = 32
print_info: n_expert_used    = 4
print_info: n_expert_groups  = 0
print_info: n_group_used     = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = yarn
print_info: freq_base_train  = 150000.0
print_info: freq_scale_train = 0.03125
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: model type       = 20B
print_info: model params     = 20.91 B
print_info: general.name     = Gpt Oss 20b
print_info: n_ff_exp         = 2880
print_info: vocab type       = BPE
print_info: n_vocab          = 201088
print_info: n_merges         = 446189
print_info: BOS token        = 199998 '<|startoftext|>'
print_info: EOS token        = 200002 '<|return|>'
print_info: EOT token        = 200007 '<|end|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200002 '<|return|>'
print_info: EOG token        = 200012 '<|call|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal, is_swa = 1
load_tensors: layer   1 assigned to device Metal, is_swa = 0
load_tensors: layer   2 assigned to device Metal, is_swa = 1
load_tensors: layer   3 assigned to device Metal, is_swa = 0
load_tensors: layer   4 assigned to device Metal, is_swa = 1
load_tensors: layer   5 assigned to device Metal, is_swa = 0
load_tensors: layer   6 assigned to device Metal, is_swa = 1
load_tensors: layer   7 assigned to device Metal, is_swa = 0
load_tensors: layer   8 assigned to device Metal, is_swa = 1
load_tensors: layer   9 assigned to device Metal, is_swa = 0
load_tensors: layer  10 assigned to device Metal, is_swa = 1
load_tensors: layer  11 assigned to device Metal, is_swa = 0
load_tensors: layer  12 assigned to device Metal, is_swa = 1
load_tensors: layer  13 assigned to device Metal, is_swa = 0
load_tensors: layer  14 assigned to device Metal, is_swa = 1
load_tensors: layer  15 assigned to device Metal, is_swa = 0
load_tensors: layer  16 assigned to device Metal, is_swa = 1
load_tensors: layer  17 assigned to device Metal, is_swa = 0
load_tensors: layer  18 assigned to device Metal, is_swa = 1
load_tensors: layer  19 assigned to device Metal, is_swa = 0
load_tensors: layer  20 assigned to device Metal, is_swa = 1
load_tensors: layer  21 assigned to device Metal, is_swa = 0
load_tensors: layer  22 assigned to device Metal, is_swa = 1
load_tensors: layer  23 assigned to device Metal, is_swa = 0
load_tensors: layer  24 assigned to device Metal, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.post_attention_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.attn_sinks.weight
create_tensor: loading tensor blk.0.ffn_gate_inp.weight
create_tensor: loading tensor blk.0.ffn_gate_exps.weight
create_tensor: loading tensor blk.0.ffn_down_exps.weight
create_tensor: loading tensor blk.0.ffn_up_exps.weight
create_tensor: loading tensor blk.0.attn_q.bias
create_tensor: loading tensor blk.0.attn_k.bias
create_tensor: loading tensor blk.0.attn_v.bias
create_tensor: loading tensor blk.0.attn_output.bias
create_tensor: loading tensor blk.0.ffn_gate_inp.bias
create_tensor: loading tensor blk.0.ffn_gate_exps.bias
create_tensor: loading tensor blk.0.ffn_down_exps.bias
create_tensor: loading tensor blk.0.ffn_up_exps.bias
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.post_attention_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.attn_sinks.weight
create_tensor: loading tensor blk.1.ffn_gate_inp.weight
create_tensor: loading tensor blk.1.ffn_gate_exps.weight
create_tensor: loading tensor blk.1.ffn_down_exps.weight
create_tensor: loading tensor blk.1.ffn_up_exps.weight
create_tensor: loading tensor blk.1.attn_q.bias
create_tensor: loading tensor blk.1.attn_k.bias
create_tensor: loading tensor blk.1.attn_v.bias
create_tensor: loading tensor blk.1.attn_output.bias
create_tensor: loading tensor blk.1.ffn_gate_inp.bias
create_tensor: loading tensor blk.1.ffn_gate_exps.bias
create_tensor: loading tensor blk.1.ffn_down_exps.bias
create_tensor: loading tensor blk.1.ffn_up_exps.bias
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.post_attention_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.attn_sinks.weight
create_tensor: loading tensor blk.2.ffn_gate_inp.weight
create_tensor: loading tensor blk.2.ffn_gate_exps.weight
create_tensor: loading tensor blk.2.ffn_down_exps.weight
create_tensor: loading tensor blk.2.ffn_up_exps.weight
create_tensor: loading tensor blk.2.attn_q.bias
create_tensor: loading tensor blk.2.attn_k.bias
create_tensor: loading tensor blk.2.attn_v.bias
create_tensor: loading tensor blk.2.attn_output.bias
create_tensor: loading tensor blk.2.ffn_gate_inp.bias
create_tensor: loading tensor blk.2.ffn_gate_exps.bias
create_tensor: loading tensor blk.2.ffn_down_exps.bias
create_tensor: loading tensor blk.2.ffn_up_exps.bias
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.post_attention_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.attn_sinks.weight
create_tensor: loading tensor blk.3.ffn_gate_inp.weight
create_tensor: loading tensor blk.3.ffn_gate_exps.weight
create_tensor: loading tensor blk.3.ffn_down_exps.weight
create_tensor: loading tensor blk.3.ffn_up_exps.weight
create_tensor: loading tensor blk.3.attn_q.bias
create_tensor: loading tensor blk.3.attn_k.bias
create_tensor: loading tensor blk.3.attn_v.bias
create_tensor: loading tensor blk.3.attn_output.bias
create_tensor: loading tensor blk.3.ffn_gate_inp.bias
create_tensor: loading tensor blk.3.ffn_gate_exps.bias
create_tensor: loading tensor blk.3.ffn_down_exps.bias
create_tensor: loading tensor blk.3.ffn_up_exps.bias
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.post_attention_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.attn_sinks.weight
create_tensor: loading tensor blk.4.ffn_gate_inp.weight
create_tensor: loading tensor blk.4.ffn_gate_exps.weight
create_tensor: loading tensor blk.4.ffn_down_exps.weight
create_tensor: loading tensor blk.4.ffn_up_exps.weight
create_tensor: loading tensor blk.4.attn_q.bias
create_tensor: loading tensor blk.4.attn_k.bias
create_tensor: loading tensor blk.4.attn_v.bias
create_tensor: loading tensor blk.4.attn_output.bias
create_tensor: loading tensor blk.4.ffn_gate_inp.bias
create_tensor: loading tensor blk.4.ffn_gate_exps.bias
create_tensor: loading tensor blk.4.ffn_down_exps.bias
create_tensor: loading tensor blk.4.ffn_up_exps.bias
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.post_attention_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.attn_sinks.weight
create_tensor: loading tensor blk.5.ffn_gate_inp.weight
create_tensor: loading tensor blk.5.ffn_gate_exps.weight
create_tensor: loading tensor blk.5.ffn_down_exps.weight
create_tensor: loading tensor blk.5.ffn_up_exps.weight
create_tensor: loading tensor blk.5.attn_q.bias
create_tensor: loading tensor blk.5.attn_k.bias
create_tensor: loading tensor blk.5.attn_v.bias
create_tensor: loading tensor blk.5.attn_output.bias
create_tensor: loading tensor blk.5.ffn_gate_inp.bias
create_tensor: loading tensor blk.5.ffn_gate_exps.bias
create_tensor: loading tensor blk.5.ffn_down_exps.bias
create_tensor: loading tensor blk.5.ffn_up_exps.bias
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.post_attention_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.attn_sinks.weight
create_tensor: loading tensor blk.6.ffn_gate_inp.weight
create_tensor: loading tensor blk.6.ffn_gate_exps.weight
create_tensor: loading tensor blk.6.ffn_down_exps.weight
create_tensor: loading tensor blk.6.ffn_up_exps.weight
create_tensor: loading tensor blk.6.attn_q.bias
create_tensor: loading tensor blk.6.attn_k.bias
create_tensor: loading tensor blk.6.attn_v.bias
create_tensor: loading tensor blk.6.attn_output.bias
create_tensor: loading tensor blk.6.ffn_gate_inp.bias
create_tensor: loading tensor blk.6.ffn_gate_exps.bias
create_tensor: loading tensor blk.6.ffn_down_exps.bias
create_tensor: loading tensor blk.6.ffn_up_exps.bias
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.post_attention_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.attn_sinks.weight
create_tensor: loading tensor blk.7.ffn_gate_inp.weight
create_tensor: loading tensor blk.7.ffn_gate_exps.weight
create_tensor: loading tensor blk.7.ffn_down_exps.weight
create_tensor: loading tensor blk.7.ffn_up_exps.weight
create_tensor: loading tensor blk.7.attn_q.bias
create_tensor: loading tensor blk.7.attn_k.bias
create_tensor: loading tensor blk.7.attn_v.bias
create_tensor: loading tensor blk.7.attn_output.bias
create_tensor: loading tensor blk.7.ffn_gate_inp.bias
create_tensor: loading tensor blk.7.ffn_gate_exps.bias
create_tensor: loading tensor blk.7.ffn_down_exps.bias
create_tensor: loading tensor blk.7.ffn_up_exps.bias
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.post_attention_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.attn_sinks.weight
create_tensor: loading tensor blk.8.ffn_gate_inp.weight
create_tensor: loading tensor blk.8.ffn_gate_exps.weight
create_tensor: loading tensor blk.8.ffn_down_exps.weight
create_tensor: loading tensor blk.8.ffn_up_exps.weight
create_tensor: loading tensor blk.8.attn_q.bias
create_tensor: loading tensor blk.8.attn_k.bias
create_tensor: loading tensor blk.8.attn_v.bias
create_tensor: loading tensor blk.8.attn_output.bias
create_tensor: loading tensor blk.8.ffn_gate_inp.bias
create_tensor: loading tensor blk.8.ffn_gate_exps.bias
create_tensor: loading tensor blk.8.ffn_down_exps.bias
create_tensor: loading tensor blk.8.ffn_up_exps.bias
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.post_attention_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.attn_sinks.weight
create_tensor: loading tensor blk.9.ffn_gate_inp.weight
create_tensor: loading tensor blk.9.ffn_gate_exps.weight
create_tensor: loading tensor blk.9.ffn_down_exps.weight
create_tensor: loading tensor blk.9.ffn_up_exps.weight
create_tensor: loading tensor blk.9.attn_q.bias
create_tensor: loading tensor blk.9.attn_k.bias
create_tensor: loading tensor blk.9.attn_v.bias
create_tensor: loading tensor blk.9.attn_output.bias
create_tensor: loading tensor blk.9.ffn_gate_inp.bias
create_tensor: loading tensor blk.9.ffn_gate_exps.bias
create_tensor: loading tensor blk.9.ffn_down_exps.bias
create_tensor: loading tensor blk.9.ffn_up_exps.bias
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.post_attention_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.attn_sinks.weight
create_tensor: loading tensor blk.10.ffn_gate_inp.weight
create_tensor: loading tensor blk.10.ffn_gate_exps.weight
create_tensor: loading tensor blk.10.ffn_down_exps.weight
create_tensor: loading tensor blk.10.ffn_up_exps.weight
create_tensor: loading tensor blk.10.attn_q.bias
create_tensor: loading tensor blk.10.attn_k.bias
create_tensor: loading tensor blk.10.attn_v.bias
create_tensor: loading tensor blk.10.attn_output.bias
create_tensor: loading tensor blk.10.ffn_gate_inp.bias
create_tensor: loading tensor blk.10.ffn_gate_exps.bias
create_tensor: loading tensor blk.10.ffn_down_exps.bias
create_tensor: loading tensor blk.10.ffn_up_exps.bias
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.post_attention_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.attn_sinks.weight
create_tensor: loading tensor blk.11.ffn_gate_inp.weight
create_tensor: loading tensor blk.11.ffn_gate_exps.weight
create_tensor: loading tensor blk.11.ffn_down_exps.weight
create_tensor: loading tensor blk.11.ffn_up_exps.weight
create_tensor: loading tensor blk.11.attn_q.bias
create_tensor: loading tensor blk.11.attn_k.bias
create_tensor: loading tensor blk.11.attn_v.bias
create_tensor: loading tensor blk.11.attn_output.bias
create_tensor: loading tensor blk.11.ffn_gate_inp.bias
create_tensor: loading tensor blk.11.ffn_gate_exps.bias
create_tensor: loading tensor blk.11.ffn_down_exps.bias
create_tensor: loading tensor blk.11.ffn_up_exps.bias
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.post_attention_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.attn_sinks.weight
create_tensor: loading tensor blk.12.ffn_gate_inp.weight
create_tensor: loading tensor blk.12.ffn_gate_exps.weight
create_tensor: loading tensor blk.12.ffn_down_exps.weight
create_tensor: loading tensor blk.12.ffn_up_exps.weight
create_tensor: loading tensor blk.12.attn_q.bias
create_tensor: loading tensor blk.12.attn_k.bias
create_tensor: loading tensor blk.12.attn_v.bias
create_tensor: loading tensor blk.12.attn_output.bias
create_tensor: loading tensor blk.12.ffn_gate_inp.bias
create_tensor: loading tensor blk.12.ffn_gate_exps.bias
create_tensor: loading tensor blk.12.ffn_down_exps.bias
create_tensor: loading tensor blk.12.ffn_up_exps.bias
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.post_attention_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.attn_sinks.weight
create_tensor: loading tensor blk.13.ffn_gate_inp.weight
create_tensor: loading tensor blk.13.ffn_gate_exps.weight
create_tensor: loading tensor blk.13.ffn_down_exps.weight
create_tensor: loading tensor blk.13.ffn_up_exps.weight
create_tensor: loading tensor blk.13.attn_q.bias
create_tensor: loading tensor blk.13.attn_k.bias
create_tensor: loading tensor blk.13.attn_v.bias
create_tensor: loading tensor blk.13.attn_output.bias
create_tensor: loading tensor blk.13.ffn_gate_inp.bias
create_tensor: loading tensor blk.13.ffn_gate_exps.bias
create_tensor: loading tensor blk.13.ffn_down_exps.bias
create_tensor: loading tensor blk.13.ffn_up_exps.bias
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.post_attention_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.attn_sinks.weight
create_tensor: loading tensor blk.14.ffn_gate_inp.weight
create_tensor: loading tensor blk.14.ffn_gate_exps.weight
create_tensor: loading tensor blk.14.ffn_down_exps.weight
create_tensor: loading tensor blk.14.ffn_up_exps.weight
create_tensor: loading tensor blk.14.attn_q.bias
create_tensor: loading tensor blk.14.attn_k.bias
create_tensor: loading tensor blk.14.attn_v.bias
create_tensor: loading tensor blk.14.attn_output.bias
create_tensor: loading tensor blk.14.ffn_gate_inp.bias
create_tensor: loading tensor blk.14.ffn_gate_exps.bias
create_tensor: loading tensor blk.14.ffn_down_exps.bias
create_tensor: loading tensor blk.14.ffn_up_exps.bias
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.post_attention_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.attn_sinks.weight
create_tensor: loading tensor blk.15.ffn_gate_inp.weight
create_tensor: loading tensor blk.15.ffn_gate_exps.weight
create_tensor: loading tensor blk.15.ffn_down_exps.weight
create_tensor: loading tensor blk.15.ffn_up_exps.weight
create_tensor: loading tensor blk.15.attn_q.bias
create_tensor: loading tensor blk.15.attn_k.bias
create_tensor: loading tensor blk.15.attn_v.bias
create_tensor: loading tensor blk.15.attn_output.bias
create_tensor: loading tensor blk.15.ffn_gate_inp.bias
create_tensor: loading tensor blk.15.ffn_gate_exps.bias
create_tensor: loading tensor blk.15.ffn_down_exps.bias
create_tensor: loading tensor blk.15.ffn_up_exps.bias
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.post_attention_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.attn_sinks.weight
create_tensor: loading tensor blk.16.ffn_gate_inp.weight
create_tensor: loading tensor blk.16.ffn_gate_exps.weight
create_tensor: loading tensor blk.16.ffn_down_exps.weight
create_tensor: loading tensor blk.16.ffn_up_exps.weight
create_tensor: loading tensor blk.16.attn_q.bias
create_tensor: loading tensor blk.16.attn_k.bias
create_tensor: loading tensor blk.16.attn_v.bias
create_tensor: loading tensor blk.16.attn_output.bias
create_tensor: loading tensor blk.16.ffn_gate_inp.bias
create_tensor: loading tensor blk.16.ffn_gate_exps.bias
create_tensor: loading tensor blk.16.ffn_down_exps.bias
create_tensor: loading tensor blk.16.ffn_up_exps.bias
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.post_attention_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.attn_sinks.weight
create_tensor: loading tensor blk.17.ffn_gate_inp.weight
create_tensor: loading tensor blk.17.ffn_gate_exps.weight
create_tensor: loading tensor blk.17.ffn_down_exps.weight
create_tensor: loading tensor blk.17.ffn_up_exps.weight
create_tensor: loading tensor blk.17.attn_q.bias
create_tensor: loading tensor blk.17.attn_k.bias
create_tensor: loading tensor blk.17.attn_v.bias
create_tensor: loading tensor blk.17.attn_output.bias
create_tensor: loading tensor blk.17.ffn_gate_inp.bias
create_tensor: loading tensor blk.17.ffn_gate_exps.bias
create_tensor: loading tensor blk.17.ffn_down_exps.bias
create_tensor: loading tensor blk.17.ffn_up_exps.bias
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.post_attention_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.attn_sinks.weight
create_tensor: loading tensor blk.18.ffn_gate_inp.weight
create_tensor: loading tensor blk.18.ffn_gate_exps.weight
create_tensor: loading tensor blk.18.ffn_down_exps.weight
create_tensor: loading tensor blk.18.ffn_up_exps.weight
create_tensor: loading tensor blk.18.attn_q.bias
create_tensor: loading tensor blk.18.attn_k.bias
create_tensor: loading tensor blk.18.attn_v.bias
create_tensor: loading tensor blk.18.attn_output.bias
create_tensor: loading tensor blk.18.ffn_gate_inp.bias
create_tensor: loading tensor blk.18.ffn_gate_exps.bias
create_tensor: loading tensor blk.18.ffn_down_exps.bias
create_tensor: loading tensor blk.18.ffn_up_exps.bias
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.post_attention_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.attn_sinks.weight
create_tensor: loading tensor blk.19.ffn_gate_inp.weight
create_tensor: loading tensor blk.19.ffn_gate_exps.weight
create_tensor: loading tensor blk.19.ffn_down_exps.weight
create_tensor: loading tensor blk.19.ffn_up_exps.weight
create_tensor: loading tensor blk.19.attn_q.bias
create_tensor: loading tensor blk.19.attn_k.bias
create_tensor: loading tensor blk.19.attn_v.bias
create_tensor: loading tensor blk.19.attn_output.bias
create_tensor: loading tensor blk.19.ffn_gate_inp.bias
create_tensor: loading tensor blk.19.ffn_gate_exps.bias
create_tensor: loading tensor blk.19.ffn_down_exps.bias
create_tensor: loading tensor blk.19.ffn_up_exps.bias
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.post_attention_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.attn_sinks.weight
create_tensor: loading tensor blk.20.ffn_gate_inp.weight
create_tensor: loading tensor blk.20.ffn_gate_exps.weight
create_tensor: loading tensor blk.20.ffn_down_exps.weight
create_tensor: loading tensor blk.20.ffn_up_exps.weight
create_tensor: loading tensor blk.20.attn_q.bias
create_tensor: loading tensor blk.20.attn_k.bias
create_tensor: loading tensor blk.20.attn_v.bias
create_tensor: loading tensor blk.20.attn_output.bias
create_tensor: loading tensor blk.20.ffn_gate_inp.bias
create_tensor: loading tensor blk.20.ffn_gate_exps.bias
create_tensor: loading tensor blk.20.ffn_down_exps.bias
create_tensor: loading tensor blk.20.ffn_up_exps.bias
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.post_attention_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.attn_sinks.weight
create_tensor: loading tensor blk.21.ffn_gate_inp.weight
create_tensor: loading tensor blk.21.ffn_gate_exps.weight
create_tensor: loading tensor blk.21.ffn_down_exps.weight
create_tensor: loading tensor blk.21.ffn_up_exps.weight
create_tensor: loading tensor blk.21.attn_q.bias
create_tensor: loading tensor blk.21.attn_k.bias
create_tensor: loading tensor blk.21.attn_v.bias
create_tensor: loading tensor blk.21.attn_output.bias
create_tensor: loading tensor blk.21.ffn_gate_inp.bias
create_tensor: loading tensor blk.21.ffn_gate_exps.bias
create_tensor: loading tensor blk.21.ffn_down_exps.bias
create_tensor: loading tensor blk.21.ffn_up_exps.bias
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.post_attention_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.attn_sinks.weight
create_tensor: loading tensor blk.22.ffn_gate_inp.weight
create_tensor: loading tensor blk.22.ffn_gate_exps.weight
create_tensor: loading tensor blk.22.ffn_down_exps.weight
create_tensor: loading tensor blk.22.ffn_up_exps.weight
create_tensor: loading tensor blk.22.attn_q.bias
create_tensor: loading tensor blk.22.attn_k.bias
create_tensor: loading tensor blk.22.attn_v.bias
create_tensor: loading tensor blk.22.attn_output.bias
create_tensor: loading tensor blk.22.ffn_gate_inp.bias
create_tensor: loading tensor blk.22.ffn_gate_exps.bias
create_tensor: loading tensor blk.22.ffn_down_exps.bias
create_tensor: loading tensor blk.22.ffn_up_exps.bias
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.post_attention_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.attn_sinks.weight
create_tensor: loading tensor blk.23.ffn_gate_inp.weight
create_tensor: loading tensor blk.23.ffn_gate_exps.weight
create_tensor: loading tensor blk.23.ffn_down_exps.weight
create_tensor: loading tensor blk.23.ffn_up_exps.weight
create_tensor: loading tensor blk.23.attn_q.bias
create_tensor: loading tensor blk.23.attn_k.bias
create_tensor: loading tensor blk.23.attn_v.bias
create_tensor: loading tensor blk.23.attn_output.bias
create_tensor: loading tensor blk.23.ffn_gate_inp.bias
create_tensor: loading tensor blk.23.ffn_gate_exps.bias
create_tensor: loading tensor blk.23.ffn_down_exps.bias
create_tensor: loading tensor blk.23.ffn_up_exps.bias
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_REPACK, using CPU instead
ggml_metal_log_allocated_size: allocated buffer, size = 11536.20 MiB, (11536.58 / 21845.34)
load_tensors: offloading 24 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 25/25 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   586.82 MiB
load_tensors: Metal_Mapped model buffer size = 11536.18 MiB
.................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_seq     = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = auto
llama_context: kv_unified    = false
llama_context: freq_base     = 150000.0
llama_context: freq_scale    = 0.03125
llama_context: n_ctx_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: use fusion         = true
ggml_metal_init: use concurrency    = true
ggml_metal_init: use graph optimize = true
set_abort_callback: call
llama_context:        CPU  output buffer size =     0.77 MiB
llama_kv_cache_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache: layer   0: filtered
llama_kv_cache: layer   1: dev = Metal
llama_kv_cache: layer   2: filtered
llama_kv_cache: layer   3: dev = Metal
llama_kv_cache: layer   4: filtered
llama_kv_cache: layer   5: dev = Metal
llama_kv_cache: layer   6: filtered
llama_kv_cache: layer   7: dev = Metal
llama_kv_cache: layer   8: filtered
llama_kv_cache: layer   9: dev = Metal
llama_kv_cache: layer  10: filtered
llama_kv_cache: layer  11: dev = Metal
llama_kv_cache: layer  12: filtered
llama_kv_cache: layer  13: dev = Metal
llama_kv_cache: layer  14: filtered
llama_kv_cache: layer  15: dev = Metal
llama_kv_cache: layer  16: filtered
llama_kv_cache: layer  17: dev = Metal
llama_kv_cache: layer  18: filtered
llama_kv_cache: layer  19: dev = Metal
llama_kv_cache: layer  20: filtered
llama_kv_cache: layer  21: dev = Metal
llama_kv_cache: layer  22: filtered
llama_kv_cache: layer  23: dev = Metal
llama_kv_cache:      Metal KV buffer size =    96.00 MiB
llama_kv_cache: size =   96.00 MiB (  4096 cells,  12 layers,  1/1 seqs), K (f16):   48.00 MiB, V (f16):   48.00 MiB
llama_kv_cache_iswa: creating     SWA KV cache, size = 768 cells
llama_kv_cache: layer   0: dev = Metal
llama_kv_cache: layer   1: filtered
llama_kv_cache: layer   2: dev = Metal
llama_kv_cache: layer   3: filtered
llama_kv_cache: layer   4: dev = Metal
llama_kv_cache: layer   5: filtered
llama_kv_cache: layer   6: dev = Metal
llama_kv_cache: layer   7: filtered
llama_kv_cache: layer   8: dev = Metal
llama_kv_cache: layer   9: filtered
llama_kv_cache: layer  10: dev = Metal
llama_kv_cache: layer  11: filtered
llama_kv_cache: layer  12: dev = Metal
llama_kv_cache: layer  13: filtered
llama_kv_cache: layer  14: dev = Metal
llama_kv_cache: layer  15: filtered
llama_kv_cache: layer  16: dev = Metal
llama_kv_cache: layer  17: filtered
llama_kv_cache: layer  18: dev = Metal
llama_kv_cache: layer  19: filtered
llama_kv_cache: layer  20: dev = Metal
llama_kv_cache: layer  21: filtered
llama_kv_cache: layer  22: dev = Metal
llama_kv_cache: layer  23: filtered
llama_kv_cache:      Metal KV buffer size =    18.00 MiB
llama_kv_cache: size =   18.00 MiB (   768 cells,  12 layers,  1/1 seqs), K (f16):    9.00 MiB, V (f16):    9.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 3
llama_context: max_nodes = 3672
llama_context: reserving full memory module
llama_context: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
llama_context: Flash Attention was auto, set to enabled
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
llama_context:      Metal compute buffer size =   404.00 MiB
llama_context:        CPU compute buffer size =    15.15 MiB
llama_context: graph nodes  = 1352
llama_context: graph splits = 2
clear_adapter_lora: call
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|return|> logit bias = -inf
common_init_from_params: added <|call|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
set_warmup: value = 1
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rms_norm_mul_f32_4', name = 'kernel_rms_norm_mul_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_rms_norm_mul_f32_4                     0x113ebebe0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_q8_0_f32', name = 'kernel_mul_mv_q8_0_f32_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_q8_0_f32_nsg=4                  0x113ec1860 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_1', name = 'kernel_add_row_c4_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_1                      0x113e748f0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_rope_neox_f32', name = 'kernel_rope_neox_f32_imrope=0'
ggml_metal_library_compile_pipeline: loaded kernel_rope_neox_f32_imrope=0                 0x113ec2600 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_set_rows_f16_i64', name = 'kernel_set_rows_f16_i64'
ggml_metal_library_compile_pipeline: loaded kernel_set_rows_f16_i64                       0x113ec2160 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_cpy_f32_f16', name = 'kernel_cpy_f32_f16'
ggml_metal_library_compile_pipeline: loaded kernel_cpy_f32_f16                            0x113ec3010 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_f16_dk64_dv64', name = 'kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_f16_dk64_dv64_mask=1_sink=1_bias=0_scap=0_kvpad=0_ns10=512_ns20=512_nsg=1_nwg=32      0x113ec3e50 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_flash_attn_ext_vec_reduce', name = 'kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32'
ggml_metal_library_compile_pipeline: loaded kernel_flash_attn_ext_vec_reduce_dv=64_nwg=32      0x113ec4830 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_ext_q8_0_f32_r1_2', name = 'kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_ext_q8_0_f32_r1_2_nsg=2_nxpsg=16      0x113ec5b80 | th_max =  832 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_1', name = 'kernel_add_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_1                             0x113ec36d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_f32_f32_4', name = 'kernel_mul_mv_f32_f32_4_nsg=4'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_f32_f32_4_nsg=4                 0x113ec6170 | th_max =  768 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_argsort_f32_i32_desc', name = 'kernel_argsort_f32_i32_desc'
ggml_metal_library_compile_pipeline: loaded kernel_argsort_f32_i32_desc                   0x113ec7660 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_argsort_merge_f32_i32_desc', name = 'kernel_argsort_merge_f32_i32_desc'
ggml_metal_library_compile_pipeline: loaded kernel_argsort_merge_f32_i32_desc             0x113ec44b0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_get_rows_f32', name = 'kernel_get_rows_f32'
ggml_metal_library_compile_pipeline: loaded kernel_get_rows_f32                           0x113ec8d90 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_soft_max_f32_4', name = 'kernel_soft_max_f32_4'
ggml_metal_library_compile_pipeline: loaded kernel_soft_max_f32_4                         0x113ec96d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_mv_id_mxfp4_f32', name = 'kernel_mul_mv_id_mxfp4_f32_nsg=2'
ggml_metal_library_compile_pipeline: loaded kernel_mul_mv_id_mxfp4_f32_nsg=2              0x113ec9f60 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_id', name = 'kernel_add_id'
ggml_metal_library_compile_pipeline: loaded kernel_add_id                                 0x113eca1c0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_swiglu_oai_f32', name = 'kernel_swiglu_oai_f32'
ggml_metal_library_compile_pipeline: loaded kernel_swiglu_oai_f32                         0x113ecb0e0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_mul_fuse_1', name = 'kernel_mul_fuse_1'
ggml_metal_library_compile_pipeline: loaded kernel_mul_fuse_1                             0x113ecbea0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_fuse_3', name = 'kernel_add_fuse_3'
ggml_metal_library_compile_pipeline: loaded kernel_add_fuse_3                             0x113ecc8d0 | th_max = 1024 | th_width =   32
ggml_metal_library_compile_pipeline: compiling pipeline: base = 'kernel_add_row_c4_fuse_3', name = 'kernel_add_row_c4_fuse_3'
ggml_metal_library_compile_pipeline: loaded kernel_add_row_c4_fuse_3                      0x113ed6f90 | th_max = 1024 | th_width =   32
set_warmup: value = 0
main: llama threadpool init, n_threads = 8
attach_threadpool: call

system_info: n_threads = 8 (n_threads_batch = 8) / 10 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | OPENMP = 1 | REPACK = 1 | 

sampler seed: 2399610517
sampler params: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
	top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-n-sigma -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist 
generate: n_ctx = 4096, n_batch = 2048, n_predict = 0, n_keep = 0



common_perf_print:    sampling time =       0.00 ms
common_perf_print:    samplers time =       0.00 ms /     0 tokens
common_perf_print:        load time =     913.96 ms
common_perf_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
common_perf_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
common_perf_print:       total time =       0.46 ms /     2 tokens
common_perf_print: unaccounted time =       0.46 ms / 100.0 %      (total - sampling - prompt eval - eval) / (total)
common_perf_print:    graphs reused =          0
llama_memory_breakdown_print: | memory breakdown [MiB]   | total   free     self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Metal (Apple M1 Pro) | 21845 = 9790 + (12054 = 11536 +     114 +     404) +           0 |
llama_memory_breakdown_print: |   - Host                 |                   601 =   586 +       0 +      15                |
ggml_metal_free: deallocating

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
@ngxson ngxson merged commit def5404 into ggml-org:master Nov 30, 2025
72 of 74 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: LAMA_LOG_FILE env variable.

2 participants