Skip to content

Benchmarks on M4 MacBook Air #2

@loretoparisi

Description

@loretoparisi

My bench on MacBook Air M4 for Qwen3.5-2B-Q8_0.gguf

  TTFT: 496.3ms
  Generated: 256 tokens
  Decode: 25.3 tok/s

Commands ued


huggingface-cli download unsloth/Qwen3.5-0.8B-GGUF \
  --include "Qwen3.5-0.8B-Q8_0.gguf" \
  --local-dir ~/models \
  --local-dir-use-symlinks False

% ./target/release/ane-infer generate \
  -m /Users/loretoparisi/models/Qwen3.5-2B-Q8_0.gguf \ 
  -p "The capital of France is" \
  --max-tokens 256 \
  --temp 0.7
Loading model from /Users/loretoparisi/models/Qwen3.5-2B-Q8_0.gguf
  Architecture: qwen35
  Layers: 24
  Dim: 2048
  Tensors: 320
  Qwen3.5 hybrid: 2048d, 8h, 2kv, 6144ff, 248320v
  DeltaNet: 128 state, 4 conv, 16 groups, interval=4
  Layer types: 18 DeltaNet + 6 FullAttention
Loading weights (Q8_0 direct)...
  Layer 0/24... ok
  Layer 1/24... ok
  Layer 2/24... ok
  Layer 3/24... ok
  Layer 4/24... ok
  Layer 5/24... ok
  Layer 6/24... ok
  Layer 7/24... ok
  Layer 8/24... ok
  Layer 9/24... ok
  Layer 10/24... ok
  Layer 11/24... ok
  Layer 12/24... ok
  Layer 13/24... ok
  Layer 14/24... ok
  Layer 15/24... ok
  Layer 16/24... ok
  Layer 17/24... ok
  Layer 18/24... ok
  Layer 19/24... ok
  Layer 20/24... ok
  Layer 21/24... ok
  Layer 22/24... ok
  Layer 23/24... ok
  Loaded in 1.6s
Building tokenizer...
  Vocab: 248320 tokens, EOS: Some(248046)
Prompt: "The capital of France is"
  Encoded: 5 tokens → [760, 6511, 314, 9338, 369]
  Decoded: "The capital of France is"
Generating (CPU decode)...
  Prefill: 5/5
  TTFT: 496.3ms

...

--- Stats ---
  TTFT: 496.3ms
  Generated: 256 tokens
  Decode: 25.3 tok/s

Surprisingly, the unsloth smaller version Qwen3.5-2B-Q4_0.gguf seems to have a Q5_K quant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions