Open
Conversation
vLLM 0.19 reaches `AsyncLLM.__init__` through `AsyncLLM.from_vllm_config` for the OpenAI server, skipping `LLMEngine.from_engine_args`. That left `set_flash_head(metadata)` uncalled under `vllm serve`, so the patched `_get_logits` always saw `get_flash_head() is None` and silently fell back to the dense lm_head on every decode step. Add a mirror of patch_llm that targets `AsyncLLM.__init__` so the metadata is written on both paths, and stop caching the None result in `_get_flash_head` so a late-arriving metadata file is picked up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
JonnaMat
requested changes
Apr 22, 2026
| logger = logging.getLogger(__name__) | ||
|
|
||
| # Sentinel for lazy loading | ||
| _FLASH_HEAD_NOT_LOADED = object() |
Member
There was a problem hiding this comment.
This is needed since get_flash_head() may be None (e.g., when running non-FlashHead models).
| return None | ||
|
|
||
|
|
||
| def patch_async_llm(): |
Member
There was a problem hiding this comment.
I think we should add a guard for idempotence (similar to what we do in logits_processor.py) [if _flash_head is None:...]
While AsyncLLm is run only once per engine construction (not per decode / request) there may be other parts of vllm that call it. We could add a _FLASH_HEAD_NOT_LOADED.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
patch_async_llmtargetingAsyncLLM.__init__, so the FlashHead metadata-load runs under both the PythonLLM(...)API andvllm serve. The existingpatch_llmonly coversLLMEngine.from_engine_args, whichvllm servenever reaches in vLLM 0.19 (the OpenAI entrypoint goes throughAsyncLLM.from_vllm_configthenAsyncLLM.__init__).logits_processor._get_flash_headso a metadata file that appears after server startup is still picked up on the next decode step.What went wrong today (repro)
With
flash-head==0.1.9installed againstvllm==0.19.1, runningvllm serve embedl/Cosmos-Reason2-2B-W4A16-Edge2-FlashHead \ --max-model-len 8192 --gpu-memory-utilization 0.75 --max-num-seqs 2starts up and serves correctly, but
/tmp/flashhead_metadata.ptis never written.get_flash_head()returnsNone, and the patchedLogitsProcessor._get_logitsfalls straight through to the original dense path on every decode step, so FlashHead is effectively disabled undervllm serve, silently.Traced through:
vllm.entrypoints.openai.api_server.build_async_engine_client_from_engine_argscallsAsyncLLM.from_vllm_configwhich callsAsyncLLM.__init__.LLMEngine.from_engine_args(the legacy class the current patch targets) is never called. The PythonLLM(...)API still works becauseLLM.__init__does callLLMEngine.from_engine_args.Verification
Before: no
[FlashHead] Loaded lazily...log from either process after startup, no/tmp/flashhead_metadata.pt, dense-head fallback.After, with this PR:
Exact
curlfrom the README returns a coherent detailed video description.Note (not fixed here)
vLLM's
DEFAULT_LOGGING_CONFIGonly attaches a handler to thevllmlogger, so every[FlashHead] ...INFO line is dropped unless the user setsVLLM_LOGGING_CONFIG_PATHto a config that includes aflash_headlogger. Worth either adding a handler insideregister(), or mentioning in the README that the activation banner won't appear undervllm serveby default.