Throughput improvement for small batch sizes #17342

uttampc1 · 2025-11-18T01:09:15Z

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

Constant batch Size=16 with varying number of threads

	Patched/Baseline
Threads	TPS
1	1.00
2	1.00
4	1.00
8	1.00
16	`1.03`
32	`1.09`
64	`1.16`
96	`1.20`

Constant number of threads=96 with varying batch size

	Patched/Baseline
Batch Size	TPS
1	1.00
2	`1.44`
4	`1.34`
8	`1.27`
16	`1.20`
32	`1.16`
64	`1.11`
96	`1.07`
128	`1.05`
512	`1.02`
1024	`1.02`

==== Test Results =====

tee -a /tmp/results/ctest_debug-ctest.log
ctest --output-on-failure -L main -E 'test-opt|test-backend-ops'
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test Merging tensors of larger models #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test Windows VS2022 Build - Returning nonsense #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test Add missing headers for memcpy and assert #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test Repetition penalty #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test Suppress output that isn't from the model #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test Include Python dependencies in README #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test Make run without error but ./model folder is empty #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test Is there a requirements.txt ? #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test GPTQ Quantization (3-bit and 4-bit) #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test simde? #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test Unicode support #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test Segfault / Memory error with 65B model (128GB RAM) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test [Q] Memory Requirements for Different Model Sizes #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test tensor 'tok_embeddings.weight' has wrong size in model file #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test Output is garbage in INT4 model in Mac M1 Max #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test Fix a typo in model name #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test Add oneliner for batch quantization #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test faster performance on older machines #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test Implement Flash Attention Option #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test Feature/repeat penalty #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test Add LICENSE #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test Ability for ./main to keep the model in memory and pass it more text #23: test-chat-template ................ Passed 0.70 sec
Start 24: test-json-partial
24/35 Test 13b model issue tensor 'tok_embeddings.weight' has wrong size in model file #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test Remove unprintable characters from vocab list #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test Remove Unprintable #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test Fails to load 30B model after quantization #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test Too slow on m2 MBA 16gb SSD 512GB #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test ggml_new_tensor_impl: not enough space in the context's memory pool #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test use weights_only in conversion script to prevent model arbitrary code execution #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test What is the meaning of hacked? #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test benchmarks? #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test convert-pth-to-ggml.py failed with RuntimeError #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test Port to Visual C++. #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test can't compile main #37: test-alloc ........................ Passed 0.00 sec

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM) This improves throughput for cases where threads have to wait due to lack work and causing process to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride partitioning which further helps to eliminate shared counter. * remove one barrier in sgemm() * static stride partitioning

uttampc1 requested review from ggerganov and slaren as code owners November 18, 2025 01:09

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 18, 2025

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes auroralabs-loci/llama.cpp#248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Throughput improvement for small batch sizes #17342

Throughput improvement for small batch sizes #17342

uttampc1 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Throughput improvement for small batch sizes #17342

Are you sure you want to change the base?

Throughput improvement for small batch sizes #17342

Conversation

uttampc1 commented Nov 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant