Throughput improvement for small batch sizes #17342
Open
+1
−5
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.
With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.
Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.
Here are the results:
1.031.091.161.201.441.341.271.201.161.111.071.051.021.02==== Test Results =====
Test project llama.cpp/build-ci-debug
Start 1: test-tokenizer-0-bert-bge
1/35 Test Merging tensors of larger models #1: test-tokenizer-0-bert-bge ......... Passed 0.11 sec
Start 2: test-tokenizer-0-command-r
2/35 Test Windows VS2022 Build - Returning nonsense #2: test-tokenizer-0-command-r ........ Passed 1.37 sec
Start 3: test-tokenizer-0-deepseek-coder
3/35 Test Add missing headers for memcpy and assert #3: test-tokenizer-0-deepseek-coder ... Passed 0.24 sec
Start 4: test-tokenizer-0-deepseek-llm
4/35 Test Repetition penalty #4: test-tokenizer-0-deepseek-llm ..... Passed 0.60 sec
Start 5: test-tokenizer-0-falcon
5/35 Test Suppress output that isn't from the model #5: test-tokenizer-0-falcon ........... Passed 0.35 sec
Start 6: test-tokenizer-0-gpt-2
6/35 Test Include Python dependencies in README #6: test-tokenizer-0-gpt-2 ............ Passed 0.27 sec
Start 7: test-tokenizer-0-llama-bpe
7/35 Test Make run without error but ./model folder is empty #7: test-tokenizer-0-llama-bpe ........ Passed 0.90 sec
Start 8: test-tokenizer-0-llama-spm
8/35 Test Is there a requirements.txt ? #8: test-tokenizer-0-llama-spm ........ Passed 0.10 sec
Start 9: test-tokenizer-0-mpt
9/35 Test GPTQ Quantization (3-bit and 4-bit) #9: test-tokenizer-0-mpt .............. Passed 0.27 sec
Start 10: test-tokenizer-0-phi-3
10/35 Test simde? #10: test-tokenizer-0-phi-3 ............ Passed 0.10 sec
Start 11: test-tokenizer-0-qwen2
11/35 Test Unicode support #11: test-tokenizer-0-qwen2 ............ Passed 0.95 sec
Start 12: test-tokenizer-0-refact
12/35 Test Segfault / Memory error with 65B model (128GB RAM) #12: test-tokenizer-0-refact ........... Passed 0.27 sec
Start 13: test-tokenizer-0-starcoder
13/35 Test [Q] Memory Requirements for Different Model Sizes #13: test-tokenizer-0-starcoder ........ Passed 0.27 sec
Start 14: test-tokenizers-ggml-vocabs
14/35 Test tensor 'tok_embeddings.weight' has wrong size in model file #14: test-tokenizers-ggml-vocabs ....... Passed 6.92 sec
Start 15: test-sampling
15/35 Test Output is garbage in INT4 model in Mac M1 Max #15: test-sampling ..................... Passed 3.66 sec
Start 16: test-grammar-parser
16/35 Test Fix a typo in model name #16: test-grammar-parser ............... Passed 0.00 sec
Start 17: test-grammar-integration
17/35 Test Add oneliner for batch quantization #17: test-grammar-integration .......... Passed 0.02 sec
Start 18: test-llama-grammar
18/35 Test faster performance on older machines #18: test-llama-grammar ................ Passed 0.00 sec
Start 19: test-chat
19/35 Test Implement Flash Attention Option #19: test-chat ......................... Passed 7.52 sec
Start 20: test-json-schema-to-grammar
20/35 Test Feature/repeat penalty #20: test-json-schema-to-grammar ....... Passed 1.50 sec
Start 21: test-tokenizer-1-llama-spm
21/35 Test Add LICENSE #21: test-tokenizer-1-llama-spm ........ Passed 0.44 sec
Start 22: test-chat-parser
22/35 Test Windows 64-bit, Microsoft Visual Studio - it works like a charm after those fixes! #22: test-chat-parser .................. Passed 0.01 sec
Start 23: test-chat-template
23/35 Test Ability for
./mainto keep the model in memory and pass it more text #23: test-chat-template ................ Passed 0.70 secStart 24: test-json-partial
24/35 Test 13b model issue tensor 'tok_embeddings.weight' has wrong size in model file #24: test-json-partial ................. Passed 0.01 sec
Start 25: test-log
25/35 Test Remove unprintable characters from vocab list #25: test-log .......................... Passed 0.02 sec
Start 26: test-regex-partial
26/35 Test Remove Unprintable #26: test-regex-partial ................ Passed 0.01 sec
Start 27: test-thread-safety
27/35 Test Fails to load 30B model after quantization #27: test-thread-safety ................ Passed 1.03 sec
Start 28: test-arg-parser
28/35 Test Too slow on m2 MBA 16gb SSD 512GB #28: test-arg-parser ................... Passed 0.25 sec
Start 29: test-gguf
29/35 Test ggml_new_tensor_impl: not enough space in the context's memory pool #29: test-gguf ......................... Passed 0.14 sec
Start 32: test-barrier
30/35 Test use weights_only in conversion script to prevent model arbitrary code execution #32: test-barrier ...................... Passed 1.44 sec
Start 33: test-quantize-fns
31/35 Test What is the meaning of hacked? #33: test-quantize-fns ................. Passed 16.97 sec
Start 34: test-quantize-perf
32/35 Test benchmarks? #34: test-quantize-perf ................ Passed 0.22 sec
Start 35: test-rope
33/35 Test convert-pth-to-ggml.py failed with RuntimeError #35: test-rope ......................... Passed 0.07 sec
Start 36: test-mtmd-c-api
34/35 Test Port to Visual C++. #36: test-mtmd-c-api ................... Passed 0.00 sec
Start 37: test-alloc
35/35 Test can't compile main #37: test-alloc ........................ Passed 0.00 sec
100% tests passed, 0 tests failed out of 35
Label Time Summary:
main = 46.75 sec*proc (35 tests)
I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.