Skip to content

Conversation

@uttampc1
Copy link

I came across a core scaling issue while running llama-bench on large core count machine and small batch sizes. During the investigation I found cache-line contention causing scaling issue. This patch fixes the contention.

With this patch I've seen throughput improvement ranging from 2% to 44% while running with Qwen3 30B parameter model.

Results are obtained with following command,
$ llama-bench –m Qwen3-Coder-30B-A3B-Instruct-Q4_K_S.gguf -p 500 -n 0 -t n -b 16,32,64,96,128
Where "n" is number of threads.

Here are the results:

  1. Constant batch Size=16 with varying number of threads
Patched/Baseline
Threads TPS
1 1.00
2 1.00
4 1.00
8 1.00
16 1.03
32 1.09
64 1.16
96 1.20
  1. Constant number of threads=96 with varying batch size
Patched/Baseline
Batch Size TPS
1 1.00
2 1.44
4 1.34
8 1.27
16 1.20
32 1.16
64 1.11
96 1.07
128 1.05
512 1.02
1024 1.02

==== Test Results =====

100% tests passed, 0 tests failed out of 35

Label Time Summary:
main = 46.75 sec*proc (35 tests)

I greatly appreciate any feedback anyone has to get this patch accepted.
Thanks.

…-line contention (cache HITM)

This improves throughput for cases where threads have to wait due to lack work and causing process
to spend many cycles in a spin loop. This enables to update dynamic chunk counter with static stride
partitioning which further helps to eliminate shared counter.

* remove one barrier in sgemm()

* static stride partitioning
@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Nov 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant