-
Notifications
You must be signed in to change notification settings - Fork 86
Advanced Prefix Cache Controls #287
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
sjmonson
wants to merge
8
commits into
vllm-project:main
Choose a base branch
from
sjmonson:feat/adv_prefix
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Co-authored-by: Mehul <MEHTMEHUL@GMAIL.COM> Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
4 tasks
markurtz
pushed a commit
that referenced
this pull request
Oct 3, 2025
## TODO - Docs - ~CSV arg string support~ CSV arg string now supports single bucket (see last example). Might leave it at that for now. - More validation ## Summary <!-- Include a short paragraph of the changes introduced in this PR. If this PR requires additional context or rationale, explain why the changes are necessary. --> This PR is a port of #287 to the v0.4.0 refactor branch. Adds controls for sharing one or more fixed prefixes between samples. See examples bellow. ## Details <!-- Provide a detailed list of all changes introduced in this pull request. --> Adds a `prefix_buckets` argument to the `SyntheticTextDatasetConfig`, each bucket consists of a prefix count, token count, and bucket weight. Prefix count sets the number of unique prefixes to generate for a given bucket, token count is the length of each prompt in the bucket, and bucket weight is used to calculate the proportion of requests the bucket applies to relative to the sum of all bucket weights. Here are a few examples: Here we have one bucket of 32 prefixes of length 2048. Since there are 1024 total samples each prefix will apply to 32 samples. If there is only one bucket than weight can be omitted as the bucket applies to 100% of samples. ```yaml data: prefix_buckets: - prefix_tokens: 2048 prefix_count: 32 prompt_tokens: 256 output_tokens: 256 samples: 1024 ``` In this modified version of the first example 16 of the prompts have 2048 tokens while the other 16 have 1024 tokens. ```yaml data: prefix_buckets: - prefix_tokens: 2048 prefix_count: 16 bucket_weight: 50 - prefix_tokens: 1024 prefix_count: 16 bucket_weight: 50 prompt_tokens: 256 output_tokens: 256 samples: 1024 ``` The prefix tokens of a bucket can also be 0 to disable prefixes for those samples. Here is an example where 40% of the samples have a prefix of 2048 tokens while the other 60% have no prefix. ```yaml data: prefix_buckets: - prefix_tokens: 2048 bucket_weight: 40 - prefix_tokens: 0 bucket_weight: 60 prompt_tokens: 256 output_tokens: 256 samples: 1000 ``` If only a single bucket is needed, it can be set at the top level. This make the changes backwards compatible with the previous interface and allows the CSV string format to work without parsing nested structures (at least for this use-case). ```yaml data: prefix_tokens: 128 prefix_count: 10 prompt_tokens: 256 output_tokens: 256 samples: 1000 ``` ## Test Plan <!-- List the steps needed to test this PR. --> - PR includes unit tests for all synthetic dataset changes (`pytest tests/unit/dataset`) - Scenearios in the Details section can be used against a model server with prefix caching and the cache rate can be confirmed by inspecting console output. ## Related Issues <!-- Link any relevant issues that this PR addresses. --> - Resolves #232 - Closes #287 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [x] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Samuel Monson <smonson@redhat.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
TODO
Summary
Work to allow control of token prefix cache rates with the synthetic data generator. Firstly adds an auto-incrementing single token prefix to ensure we never repeat the same prefix. Secondly adds controls for sharing one or more fixed prefixes between samples.
Details
1. Ensure every prompt is unique
When generating a prompt, the first token is now taken from an iterator over the tokenizer vocab.
2. Add configurable prefixes to simulate system prompts or other common token prefixes
Adds a
prefix_buckets
argument to theSyntheticDatasetConfig
, each bucket consists of a prefix count, token count, and bucket weight. Prefix count sets the number of unique prefixes to generate for a given bucket, token count is the length of each prompt in the bucket, and bucket weight is used to calculate the proportion of requests the bucket applies to relative to the sum of all bucket weights. Here are a few examples:Here we have one bucket of 32 prefixes of length 2048. Since there are 1024 total samples each prefix will apply to 32 samples. If there is only one bucket than weight can be omitted as the bucket applies to 100% of samples.
In this modified version of the first example 16 of the prompts have 2048 tokens while the other 16 have 1024 tokens.
The prefix tokens of a bucket can also be 0 to disable prefixes for those samples. Here is an example where 40% of the samples have a prefix of 2048 tokens while the other 60% have no prefix.
Test Plan
pytest tests/unit/dataset
)Related Issues
Use of AI
## WRITTEN BY AI ##
)