feat: better Tokenization # workers heuristic #766

michaelfeil · 2025-11-25T05:28:11Z

What does this PR do?

Wanted to submit this pr for a long time. The tokenizer heuristic is incorrectly tuned, and its a bad idea to spawn as many tokenizers as CPU cores.

if you have only 4 cores, leave 1 for the candle backend. usually the cores will clogg up, and we want to keep resources for launching cuda kernels.
usually, you can see that running a inference benchmark (e.g. https://huggingface.co/TaylorAI/bge-micro the smallest model i know), that you dont need more than 2 tokenizer threads. For https://huggingface.co/TaylorAI/bge-micro, and multiple raw C clients + localhost, I can see benefits of having up to 6 tokenizers on a full H100 GPU.. Even 16 is overkill - regardless of how many cpu cores you have.
spawning 208 CPU threads takes around ~16s of cold-start time. I cap it to 64.

Fixes # (issue)

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines.
Did you write any new necessary tests? If applicable, did you include or update the insta snapshots?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

smarter

1c99226

michaelfeil changed the title ~~Tokenization workers heuristic.~~ feat: better Tokenization # workers heuristic Nov 25, 2025

alvarobartt self-requested a review November 25, 2025 12:24

alvarobartt added this to the v1.9.0 milestone Nov 25, 2025