[BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check#1
Closed
kevincheng2 wants to merge 2 commits intorelease/2.4from
Closed
[BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check#1kevincheng2 wants to merge 2 commits intorelease/2.4from
kevincheng2 wants to merge 2 commits intorelease/2.4from
Conversation
…he_block_num check ## Motivation Cherry-pick from release/2.5: the original assertion only checked `prefill_kvcache_block_num >= max_block_num_per_seq`, but for encoder-decoder models the kvcache must also reserve blocks for the encoder side (`enc_dec_block_num`). Without this check, the service could silently allocate insufficient blocks for enc-dec sequences. ## Modifications - `CacheConfig.postprocess`: tighten assertion to `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num` - `CacheConfig.reset`: same tightening - Improve error message to guide users to reduce `max_model_len` or increase `num_gpu_blocks_override` ## Usage or Command No change to launch command. If the assertion fires, adjust: ```bash # Option 1: reduce max_model_len python -m fastdeploy.entrypoints.openai.api_server \ --max-model-len <smaller_value> ... # Option 2: increase GPU block count python -m fastdeploy.entrypoints.openai.api_server \ --num-gpu-blocks-override <larger_value> ... ``` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he_block_num check ## Motivation Cherry-pick from release/2.5: the original assertion only checked `prefill_kvcache_block_num >= max_block_num_per_seq`, but for encoder-decoder models the kvcache must also reserve blocks for the encoder side (`enc_dec_block_num`). Without this check, the service could silently allocate insufficient blocks for enc-dec sequences. ## Modifications - `CacheConfig.postprocess`: tighten assertion to `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`, error message guides user to reduce `max_model_len` or increase `num_gpu_blocks_override` - `CacheConfig.reset`: same tightening, error message guides user to reduce `max_model_len` or replace with larger GPU cards (override is not applicable here) ## Usage or Command No change to launch command. If the assertion fires, adjust: ```bash # Option 1: reduce max_model_len python -m fastdeploy.entrypoints.openai.api_server \ --max-model-len <smaller_value> ... # Option 2 (postprocess only): increase GPU block count python -m fastdeploy.entrypoints.openai.api_server \ --num-gpu-blocks-override <larger_value> ... ``` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Owner
Author
|
Closing to resubmit with a clean branch. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
Cherry-pick from release/2.5: the original assertion only checked
prefill_kvcache_block_num >= max_block_num_per_seq, but forencoder-decoder models the kvcache must also reserve blocks for the
encoder side (
enc_dec_block_num). Without this check, the servicecould silently allocate insufficient blocks for enc-dec sequences.
Modifications
CacheConfig.postprocess: tighten assertion toprefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num,error message guides user to reduce
max_model_lenor increasenum_gpu_blocks_overrideCacheConfig.reset: same tightening, error message guides user toreduce
max_model_lenor replace with larger GPU cards (overrideis not applicable here)
Usage or Command
No change to launch command. If the assertion fires, adjust:
Checklist
pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.