Skip to content

concurrency implementation for llama.cpp#14

Merged
victorchall merged 1 commit intomainfrom
concurrency
Nov 3, 2025
Merged

concurrency implementation for llama.cpp#14
victorchall merged 1 commit intomainfrom
concurrency

Conversation

@victorchall
Copy link
Owner

This adds a simple batching and async.gather loop to enable batch concurrency when using hosts such as llama.cpp that allow concurrent generation. Note: LM Studio does not support batch concurrency, users will have to switch over to llama.cpp or another host that does to use this feature.

Seems like a fairly modest gain using Qwen3 VL 32B on RTX 6000 Blackwell, but still worth it.

Another enhancement for later would be adding using the OpenAI batch api spec that allows batching into JSONL, but this may or may not be supported by any hosts, will be left for a future investigation implementation.
Additionally, I'm not certain llama.cpp and this app will always be guaranteed to align the requests to the slots that have the same history and kv cache for the subsequent requests, but needs more investigation.

Example llama.cpp command to enable concurrency via -np

llama-server -np 4 -c 32768 --mmproj "mmproj-Qwen3-VL-32B-Instruct-F16.gguf" --model "Qwen3-VL-32B-Instruct-Q4_K_M.gguf" -dev cuda0 --top-k 30 --top-p 0.95 --min-p 0.05 --temp 0.5

Note the context size -c should be increased by a multiple of the value for -np to make sure each slot has sufficient context. Ex. -np 4 -c 32768 is 4 slots each with 8192 (32768/4) tokens of context.

Some more info on llama.cpp's -np here: ggml-org/llama.cpp#3677

@victorchall victorchall merged commit d0ef922 into main Nov 3, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant