Skip to content

feat: speculative decoding support#91

Draft
jundot wants to merge 1 commit intomainfrom
feat/speculative-decoding
Draft

feat: speculative decoding support#91
jundot wants to merge 1 commit intomainfrom
feat/speculative-decoding

Conversation

@jundot
Copy link
Copy Markdown
Owner

@jundot jundot commented Mar 6, 2026

summary

adds scheduler-level speculative decoding for faster single-request generation on apple silicon.

  • when only 1 request is in the batch and a draft model is configured, the scheduler drafts N tokens with the small model then verifies all N+1 in a single main model forward pass
  • automatically falls back to standard continuous batching when multiple concurrent requests are active
  • supports hybrid SSM/attention models (Qwen3-Next etc) via cache snapshot/restore instead of trim
  • fixes simple KVCache trim bug where offset resets on next update_and_fetch
  • admin dashboard UI for configuring draft model and num_draft_tokens per model
  • per-request cumulative accept rate logging

files changed

area files
core scheduler.py (speculative step, draft cache management, snapshot/restore)
engine engine/batched.py, engine/vlm.py, engine_pool.py (draft model loading)
settings model_settings.py (new fields), model_discovery.py
admin routes.py, dashboard.js, modal/settings templates, i18n (ko/en/ja/zh)
tests test_model_settings.py

test plan

  • single LLM request with speculative decoding enabled — verify accept rate in logs
  • multi-request scenario — verify automatic fallback to standard batching
  • hybrid SSM model (Qwen3-Next) — verify snapshot/restore produces correct output
  • pure attention model — verify trim-based cache rewind works
  • admin UI — toggle speculative decoding, select draft model, save/reload settings
  • VLM model with draft model configured — verify it loads and works
  • pytest tests/test_model_settings.py -v

scheduler-level speculative decoding using a draft model for faster
single-request decode. when batch has exactly 1 request and a draft
model is configured, the scheduler generates N candidate tokens with
the draft model then verifies all N+1 in a single main model forward
pass. automatically falls back to standard continuous batching when
multiple requests are active.

key implementation details:
- draft model loaded alongside main model via model settings
- snapshot/restore approach for hybrid SSM/attention models (e.g.
  Qwen3-Next) where ArraysCache layers can't be trimmed
- proper KVCache array slicing after trim to fix simple KVCache
  offset reset bug
- per-request cumulative accept rate logging
- admin UI with draft model selector and num_draft_tokens config
- draft cache lazy prefill on first speculative step
@jundot jundot force-pushed the main branch 5 times, most recently from 7d954e8 to cd10c3d Compare March 11, 2026 15:28
@jundot jundot force-pushed the main branch 10 times, most recently from 338f98a to d90bf8a Compare March 22, 2026 10:25
@jundot jundot force-pushed the main branch 11 times, most recently from 187e87b to dfc5b20 Compare March 29, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant