Skip to content

Generation Speed #13

@CarlosS7

Description

@CarlosS7

Device & OS

  • Hardware: Raspberry Pi 3B+
  • OS: Raspberry Pi OS 64-bit, Debian 1:6.12.62-1+rpt1 (2025-12-18) aarch64 GNU/Linux
  • Compiler: gcc 14.2.0

Model

  • Model file: tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf
  • Quantization: Q4_K_M

What happened?
I am getting nowhere near the 4 tk/s for the Raspberry Pi 3B+

Command you ran

picolm models/tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf -p "The capital of France is" -n 10 -t 0 -j 4

Expected output
Generation speed that's close to 4 tk/s

Actual output

Model config:
  n_embd=2048, n_ffn=5632, n_heads=32, n_kv_heads=4
  n_layers=22, vocab_size=32000, max_seq=2048
  head_dim=64, rope_base=10000.0
Allocating 1.17 MB for runtime state (+ 44.00 MB FP16 KV cache)
Tokenizer loaded: 32000 tokens, bos=1, eos=2
Prompt: 6 tokens, generating up to 10 (temp=0.00, top_p=0.90, threads=4)
---
 Paris.

2. B.C. The
---
Prefill: 6 tokens in 166.62s (0.0 tok/s)
Generation: 11 tokens in 278.45s (0.0 tok/s)
Total: 445.07s
Memory: 45.17 MB runtime state (FP16 KV cache)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions