Advanced Prefix Cache Controls #287

sjmonson · 2025-08-20T13:55:53Z

TODO

Docs
CSV arg string support
More validation

Summary

Work to allow control of token prefix cache rates with the synthetic data generator. Firstly adds an auto-incrementing single token prefix to ensure we never repeat the same prefix. Secondly adds controls for sharing one or more fixed prefixes between samples.

Details

1. Ensure every prompt is unique

When generating a prompt, the first token is now taken from an iterator over the tokenizer vocab.

2. Add configurable prefixes to simulate system prompts or other common token prefixes

Adds a prefix_buckets argument to the SyntheticDatasetConfig, each bucket consists of a prefix count, token count, and bucket weight. Prefix count sets the number of unique prefixes to generate for a given bucket, token count is the length of each prompt in the bucket, and bucket weight is used to calculate the proportion of requests the bucket applies to relative to the sum of all bucket weights. Here are a few examples:

Here we have one bucket of 32 prefixes of length 2048. Since there are 1024 total samples each prefix will apply to 32 samples. If there is only one bucket than weight can be omitted as the bucket applies to 100% of samples.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      prefix_count: 32
  prompt_tokens: 256,
  output_tokens: 256,
  samples: 1024

In this modified version of the first example 16 of the prompts have 2048 tokens while the other 16 have 1024 tokens.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      prefix_count: 16
      bucket_weight: 50
    - prefix_tokens: 1024
      prefix_count: 16
      bucket_weight: 50
  prompt_tokens: 256,
  output_tokens: 256,
  samples: 1024

The prefix tokens of a bucket can also be 0 to disable prefixes for those samples. Here is an example where 40% of the samples have a prefix of 2048 tokens while the other 60% have no prefix.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      bucket_weight: 40
    - prefix_tokens: 0
      bucket_weight: 60
  prompt_tokens: 256,
  output_tokens: 256,
  samples: 1000

Test Plan

PR includes unit tests for all synthetic dataset changes (pytest tests/unit/dataset)
Scenearios in the Details section can be used against a model server with prefix caching and the cache rate can be confirmed by inspecting console output.

Related Issues

Resolves [Feature Request] Consider having groups of queries with multiple system prompts #232

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: Samuel Monson <smonson@redhat.com>

Co-authored-by: Mehul <MEHTMEHUL@GMAIL.COM> Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Samuel Monson <smonson@redhat.com>

Signed-off-by: Samuel Monson <smonson@redhat.com>

## TODO - Docs - ~CSV arg string support~ CSV arg string now supports single bucket (see last example). Might leave it at that for now. - More validation ## Summary  This PR is a port of #287 to the v0.4.0 refactor branch. Adds controls for sharing one or more fixed prefixes between samples. See examples bellow. ## Details  Adds a `prefix_buckets` argument to the `SyntheticTextDatasetConfig`, each bucket consists of a prefix count, token count, and bucket weight. Prefix count sets the number of unique prefixes to generate for a given bucket, token count is the length of each prompt in the bucket, and bucket weight is used to calculate the proportion of requests the bucket applies to relative to the sum of all bucket weights. Here are a few examples: Here we have one bucket of 32 prefixes of length 2048. Since there are 1024 total samples each prefix will apply to 32 samples. If there is only one bucket than weight can be omitted as the bucket applies to 100% of samples. ```yaml data: prefix_buckets: - prefix_tokens: 2048 prefix_count: 32 prompt_tokens: 256 output_tokens: 256 samples: 1024 ``` In this modified version of the first example 16 of the prompts have 2048 tokens while the other 16 have 1024 tokens. ```yaml data: prefix_buckets: - prefix_tokens: 2048 prefix_count: 16 bucket_weight: 50 - prefix_tokens: 1024 prefix_count: 16 bucket_weight: 50 prompt_tokens: 256 output_tokens: 256 samples: 1024 ``` The prefix tokens of a bucket can also be 0 to disable prefixes for those samples. Here is an example where 40% of the samples have a prefix of 2048 tokens while the other 60% have no prefix. ```yaml data: prefix_buckets: - prefix_tokens: 2048 bucket_weight: 40 - prefix_tokens: 0 bucket_weight: 60 prompt_tokens: 256 output_tokens: 256 samples: 1000 ``` If only a single bucket is needed, it can be set at the top level. This make the changes backwards compatible with the previous interface and allows the CSV string format to work without parsing nested structures (at least for this use-case). ```yaml data: prefix_tokens: 128 prefix_count: 10 prompt_tokens: 256 output_tokens: 256 samples: 1000 ``` ## Test Plan  - PR includes unit tests for all synthetic dataset changes (`pytest tests/unit/dataset`) - Scenearios in the Details section can be used against a model server with prefix caching and the cache rate can be confirmed by inspecting console output. ## Related Issues  - Resolves #232 - Closes #287 --- - [x] "I certify that all code in this PR is my own, except as noted below." ## Use of AI - [x] Includes AI-assisted code completion - [ ] Includes code generated by an AI application - [x] Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes `## WRITTEN BY AI ##`) --------- Signed-off-by: Samuel Monson <smonson@redhat.com>

sjmonson and others added 8 commits August 19, 2025 15:50

Add fixed prefix option to synthetic data

f3345f7

Signed-off-by: Samuel Monson <smonson@redhat.com>

Add prefix before decode

e4560e2

Signed-off-by: Samuel Monson <smonson@redhat.com>

Document prefix_tokens arg

c748f00

Signed-off-by: Samuel Monson <smonson@redhat.com>

Add unique single-token prefix to every request

85320d2

Co-authored-by: Mehul <MEHTMEHUL@GMAIL.COM> Co-authored-by: Samuel Monson <smonson@redhat.com> Signed-off-by: Samuel Monson <smonson@redhat.com>

Add unit tests

f254066

Signed-off-by: Samuel Monson <smonson@redhat.com>

Add advenced shared prefix support

c8a847a

Signed-off-by: Samuel Monson <smonson@redhat.com>

Update tests for new prefix patch and reduce the number of mocks

692589c

Signed-off-by: Samuel Monson <smonson@redhat.com>

Add more prefix bucket testcases

558cc78

Signed-off-by: Samuel Monson <smonson@redhat.com>

sjmonson self-assigned this Aug 20, 2025

sjmonson mentioned this pull request Sep 30, 2025

[GuideLLM Refactor] Advanced Prefix Cache Controls #382

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Advanced Prefix Cache Controls #287

Advanced Prefix Cache Controls #287

Uh oh!

sjmonson commented Aug 20, 2025

Uh oh!

Uh oh!

Advanced Prefix Cache Controls #287

Are you sure you want to change the base?

Advanced Prefix Cache Controls #287

Uh oh!

Conversation

sjmonson commented Aug 20, 2025

TODO

Summary

Details

1. Ensure every prompt is unique

2. Add configurable prefixes to simulate system prompts or other common token prefixes

Test Plan

Related Issues

Use of AI

Uh oh!

Uh oh!