feat: speculative decoding support by jundot · Pull Request #91 · jundot/omlx

jundot · 2026-03-06T13:53:50Z

summary

adds scheduler-level speculative decoding for faster single-request generation on apple silicon.

when only 1 request is in the batch and a draft model is configured, the scheduler drafts N tokens with the small model then verifies all N+1 in a single main model forward pass
automatically falls back to standard continuous batching when multiple concurrent requests are active
supports hybrid SSM/attention models (Qwen3-Next etc) via cache snapshot/restore instead of trim
fixes simple KVCache trim bug where offset resets on next update_and_fetch
admin dashboard UI for configuring draft model and num_draft_tokens per model
per-request cumulative accept rate logging

files changed

area	files
core	`scheduler.py` (speculative step, draft cache management, snapshot/restore)
engine	`engine/batched.py`, `engine/vlm.py`, `engine_pool.py` (draft model loading)
settings	`model_settings.py` (new fields), `model_discovery.py`
admin	`routes.py`, `dashboard.js`, modal/settings templates, i18n (ko/en/ja/zh)
tests	`test_model_settings.py`

test plan

single LLM request with speculative decoding enabled — verify accept rate in logs
multi-request scenario — verify automatic fallback to standard batching
hybrid SSM model (Qwen3-Next) — verify snapshot/restore produces correct output
pure attention model — verify trim-based cache rewind works
admin UI — toggle speculative decoding, select draft model, save/reload settings
VLM model with draft model configured — verify it loads and works
pytest tests/test_model_settings.py -v

scheduler-level speculative decoding using a draft model for faster single-request decode. when batch has exactly 1 request and a draft model is configured, the scheduler generates N candidate tokens with the draft model then verifies all N+1 in a single main model forward pass. automatically falls back to standard continuous batching when multiple requests are active. key implementation details: - draft model loaded alongside main model via model settings - snapshot/restore approach for hybrid SSM/attention models (e.g. Qwen3-Next) where ArraysCache layers can't be trimmed - proper KVCache array slicing after trim to fix simple KVCache offset reset bug - per-request cumulative accept rate logging - admin UI with draft model selector and num_draft_tokens config - draft cache lazy prefill on first speculative step

jundot force-pushed the main branch 5 times, most recently from 7d954e8 to cd10c3d Compare March 11, 2026 15:28

jundot force-pushed the main branch 10 times, most recently from 338f98a to d90bf8a Compare March 22, 2026 10:25

jundot force-pushed the main branch 11 times, most recently from 187e87b to dfc5b20 Compare March 29, 2026 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: speculative decoding support#91

feat: speculative decoding support#91
jundot wants to merge 1 commit intomainfrom
feat/speculative-decoding

jundot commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jundot commented Mar 6, 2026

summary

files changed

test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant