Skip to content

[BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check#1

Closed
kevincheng2 wants to merge 2 commits intorelease/2.4from
fix/prefill-kvcache-block-check
Closed

[BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check#1
kevincheng2 wants to merge 2 commits intorelease/2.4from
fix/prefill-kvcache-block-check

Conversation

@kevincheng2
Copy link
Copy Markdown
Owner

Motivation

Cherry-pick from release/2.5: the original assertion only checked
prefill_kvcache_block_num >= max_block_num_per_seq, but for
encoder-decoder models the kvcache must also reserve blocks for the
encoder side (enc_dec_block_num). Without this check, the service
could silently allocate insufficient blocks for enc-dec sequences.

Modifications

  • CacheConfig.postprocess: tighten assertion to
    prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num,
    error message guides user to reduce max_model_len or increase
    num_gpu_blocks_override
  • CacheConfig.reset: same tightening, error message guides user to
    reduce max_model_len or replace with larger GPU cards (override
    is not applicable here)

Usage or Command

No change to launch command. If the assertion fires, adjust:

# Option 1: reduce max_model_len
python -m fastdeploy.entrypoints.openai.api_server \
  --max-model-len <smaller_value> ...

# Option 2 (postprocess only): increase GPU block count
python -m fastdeploy.entrypoints.openai.api_server \
  --num-gpu-blocks-override <larger_value> ...

Checklist

  • Add at least a tag in the PR title.
  • Format your code, run pre-commit before commit.
  • Add unit tests. No new logic introduced, only assertion tightened.
  • Provide accuracy results. Not applicable.
  • If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.

kevincheng2 and others added 2 commits March 30, 2026 20:47
…he_block_num check

## Motivation

Cherry-pick from release/2.5: the original assertion only checked
`prefill_kvcache_block_num >= max_block_num_per_seq`, but for
encoder-decoder models the kvcache must also reserve blocks for the
encoder side (`enc_dec_block_num`). Without this check, the service
could silently allocate insufficient blocks for enc-dec sequences.

## Modifications

- `CacheConfig.postprocess`: tighten assertion to
  `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`
- `CacheConfig.reset`: same tightening
- Improve error message to guide users to reduce `max_model_len` or
  increase `num_gpu_blocks_override`

## Usage or Command

No change to launch command. If the assertion fires, adjust:

```bash
# Option 1: reduce max_model_len
python -m fastdeploy.entrypoints.openai.api_server \
  --max-model-len <smaller_value> ...

# Option 2: increase GPU block count
python -m fastdeploy.entrypoints.openai.api_server \
  --num-gpu-blocks-override <larger_value> ...
```

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…he_block_num check

## Motivation

Cherry-pick from release/2.5: the original assertion only checked
`prefill_kvcache_block_num >= max_block_num_per_seq`, but for
encoder-decoder models the kvcache must also reserve blocks for the
encoder side (`enc_dec_block_num`). Without this check, the service
could silently allocate insufficient blocks for enc-dec sequences.

## Modifications

- `CacheConfig.postprocess`: tighten assertion to
  `prefill_kvcache_block_num >= max_block_num_per_seq + enc_dec_block_num`,
  error message guides user to reduce `max_model_len` or increase
  `num_gpu_blocks_override`
- `CacheConfig.reset`: same tightening, error message guides user to
  reduce `max_model_len` or replace with larger GPU cards (override
  is not applicable here)

## Usage or Command

No change to launch command. If the assertion fires, adjust:

```bash
# Option 1: reduce max_model_len
python -m fastdeploy.entrypoints.openai.api_server \
  --max-model-len <smaller_value> ...

# Option 2 (postprocess only): increase GPU block count
python -m fastdeploy.entrypoints.openai.api_server \
  --num-gpu-blocks-override <larger_value> ...
```

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@kevincheng2 kevincheng2 changed the title [Cherry-Pick][BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check [BugFix][KVCache] Add enc_dec_block_num to prefill_kvcache_block_num check Mar 30, 2026
@kevincheng2
Copy link
Copy Markdown
Owner Author

Closing to resubmit with a clean branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant