Draft
Conversation
scheduler-level speculative decoding using a draft model for faster single-request decode. when batch has exactly 1 request and a draft model is configured, the scheduler generates N candidate tokens with the draft model then verifies all N+1 in a single main model forward pass. automatically falls back to standard continuous batching when multiple requests are active. key implementation details: - draft model loaded alongside main model via model settings - snapshot/restore approach for hybrid SSM/attention models (e.g. Qwen3-Next) where ArraysCache layers can't be trimmed - proper KVCache array slicing after trim to fix simple KVCache offset reset bug - per-request cumulative accept rate logging - admin UI with draft model selector and num_draft_tokens config - draft cache lazy prefill on first speculative step
7d954e8 to
cd10c3d
Compare
338f98a to
d90bf8a
Compare
187e87b to
dfc5b20
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
summary
adds scheduler-level speculative decoding for faster single-request generation on apple silicon.
files changed
scheduler.py(speculative step, draft cache management, snapshot/restore)engine/batched.py,engine/vlm.py,engine_pool.py(draft model loading)model_settings.py(new fields),model_discovery.pyroutes.py,dashboard.js, modal/settings templates, i18n (ko/en/ja/zh)test_model_settings.pytest plan
pytest tests/test_model_settings.py -v