-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
My bench on MacBook Air M4 for Qwen3.5-2B-Q8_0.gguf
TTFT: 496.3ms
Generated: 256 tokens
Decode: 25.3 tok/s
Commands ued
huggingface-cli download unsloth/Qwen3.5-0.8B-GGUF \
--include "Qwen3.5-0.8B-Q8_0.gguf" \
--local-dir ~/models \
--local-dir-use-symlinks False
% ./target/release/ane-infer generate \
-m /Users/loretoparisi/models/Qwen3.5-2B-Q8_0.gguf \
-p "The capital of France is" \
--max-tokens 256 \
--temp 0.7
Loading model from /Users/loretoparisi/models/Qwen3.5-2B-Q8_0.gguf
Architecture: qwen35
Layers: 24
Dim: 2048
Tensors: 320
Qwen3.5 hybrid: 2048d, 8h, 2kv, 6144ff, 248320v
DeltaNet: 128 state, 4 conv, 16 groups, interval=4
Layer types: 18 DeltaNet + 6 FullAttention
Loading weights (Q8_0 direct)...
Layer 0/24... ok
Layer 1/24... ok
Layer 2/24... ok
Layer 3/24... ok
Layer 4/24... ok
Layer 5/24... ok
Layer 6/24... ok
Layer 7/24... ok
Layer 8/24... ok
Layer 9/24... ok
Layer 10/24... ok
Layer 11/24... ok
Layer 12/24... ok
Layer 13/24... ok
Layer 14/24... ok
Layer 15/24... ok
Layer 16/24... ok
Layer 17/24... ok
Layer 18/24... ok
Layer 19/24... ok
Layer 20/24... ok
Layer 21/24... ok
Layer 22/24... ok
Layer 23/24... ok
Loaded in 1.6s
Building tokenizer...
Vocab: 248320 tokens, EOS: Some(248046)
Prompt: "The capital of France is"
Encoded: 5 tokens → [760, 6511, 314, 9338, 369]
Decoded: "The capital of France is"
Generating (CPU decode)...
Prefill: 5/5
TTFT: 496.3ms
...
--- Stats ---
TTFT: 496.3ms
Generated: 256 tokens
Decode: 25.3 tok/s
Surprisingly, the unsloth smaller version Qwen3.5-2B-Q4_0.gguf seems to have a Q5_K quant.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels