Skip to content

Conversation

sjmonson
Copy link
Collaborator

@sjmonson sjmonson commented Sep 30, 2025

TODO

  • Docs
  • CSV arg string support CSV arg string now supports single bucket (see last example). Might leave it at that for now.
  • More validation

Summary

This PR is a port of #287 to the v0.4.0 refactor branch.

Adds controls for sharing one or more fixed prefixes between samples. See examples bellow.

Details

Adds a prefix_buckets argument to the SyntheticTextDatasetConfig, each bucket consists of a prefix count, token count, and bucket weight. Prefix count sets the number of unique prefixes to generate for a given bucket, token count is the length of each prompt in the bucket, and bucket weight is used to calculate the proportion of requests the bucket applies to relative to the sum of all bucket weights. Here are a few examples:

Here we have one bucket of 32 prefixes of length 2048. Since there are 1024 total samples each prefix will apply to 32 samples. If there is only one bucket than weight can be omitted as the bucket applies to 100% of samples.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      prefix_count: 32
  prompt_tokens: 256
  output_tokens: 256
  samples: 1024

In this modified version of the first example 16 of the prompts have 2048 tokens while the other 16 have 1024 tokens.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      prefix_count: 16
      bucket_weight: 50
    - prefix_tokens: 1024
      prefix_count: 16
      bucket_weight: 50
  prompt_tokens: 256
  output_tokens: 256
  samples: 1024

The prefix tokens of a bucket can also be 0 to disable prefixes for those samples. Here is an example where 40% of the samples have a prefix of 2048 tokens while the other 60% have no prefix.

data:
  prefix_buckets:
    - prefix_tokens: 2048
      bucket_weight: 40
    - prefix_tokens: 0
      bucket_weight: 60
  prompt_tokens: 256
  output_tokens: 256
  samples: 1000

If only a single bucket is needed, it can be set at the top level. This make the changes backwards compatible with the previous interface and allows the CSV string format to work without parsing nested structures (at least for this use-case).

data:
  prefix_tokens: 128
  prefix_count: 10
  prompt_tokens: 256
  output_tokens: 256
  samples: 1000

Test Plan

  • PR includes unit tests for all synthetic dataset changes (pytest tests/unit/dataset)
  • Scenearios in the Details section can be used against a model server with prefix caching and the cache rate can be confirmed by inspecting console output.

Related Issues


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

@sjmonson sjmonson changed the base branch from main to features/refactor/multiturn September 30, 2025 17:59
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
Signed-off-by: Samuel Monson <smonson@redhat.com>
@sjmonson sjmonson force-pushed the features/refactor/adv_prefix branch from 5ddb73c to 8534ea0 Compare October 2, 2025 19:12
@sjmonson sjmonson changed the base branch from features/refactor/multiturn to features/refactor/multimodal_support October 2, 2025 19:13
Signed-off-by: Samuel Monson <smonson@redhat.com>
@sjmonson sjmonson force-pushed the features/refactor/adv_prefix branch from 8534ea0 to 07da84f Compare October 2, 2025 19:14
Signed-off-by: Samuel Monson <smonson@redhat.com>
@sjmonson sjmonson force-pushed the features/refactor/adv_prefix branch from 90cf0cf to 2588cda Compare October 2, 2025 20:32
@sjmonson sjmonson requested review from markurtz and Copilot October 3, 2025 16:29
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces advanced prefix cache controls for synthetic dataset generation, allowing users to configure shared prefixes across samples to optimize model caching behavior. The changes enable specification of multiple prefix buckets with configurable weights, prefix counts, and token lengths.

Key changes:

  • Added SyntheticTextPrefixBucketConfig for granular prefix control with bucket weights, prefix counts, and token lengths
  • Enhanced SyntheticTextDatasetConfig with prefix_buckets field and backward compatibility for single prefix configurations
  • Updated data generation logic to support prefix distribution across samples with proper weighting

Reviewed Changes

Copilot reviewed 8 out of 11 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/dataset/test_synthetic.py Removed old synthetic dataset tests (entire file deleted)
tests/unit/data/deserializers/test_synthetic.py Added comprehensive test suite for new synthetic dataset implementation
src/guidellm/data/utils.py Added prefix column mapping support
src/guidellm/data/objects.py Added prefix_column to data model types
src/guidellm/data/formatters/templates.py Updated Jinja templates to handle prefix columns in request formatting
src/guidellm/data/deserializers/synthetic.py Implemented new prefix bucket configuration and generation logic
src/guidellm/data/deserializers/init.py Added export for new SyntheticTextPrefixBucketConfig class
pyproject.toml Added torch and librosa dependencies

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@model_validator(mode="after")
def check_prefix_options(self) -> Self:
prefix_count = self.__pydantic_extra__.get("prefix_count", None) # type: ignore[attr-defined]
prefix_tokens = self.__pydantic_extra__.get("prefix_count", None) # type: ignore[attr-defined]
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second call to __pydantic_extra__.get() should retrieve 'prefix_tokens', not 'prefix_count'. This will always return None for prefix_tokens, breaking the backward compatibility feature.

Suggested change
prefix_tokens = self.__pydantic_extra__.get("prefix_count", None) # type: ignore[attr-defined]
prefix_tokens = self.__pydantic_extra__.get("prefix_tokens", None) # type: ignore[attr-defined]

Copilot uses AI. Check for mistakes.

if text_column and text_column|length == 1
else text_column
)
"prompt": prefix_column[0]|default("") + text_column[0]
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Direct string concatenation without proper spacing could result in malformed prompts. Consider adding a space or newline separator between prefix and text content to ensure proper formatting.

Suggested change
"prompt": prefix_column[0]|default("") + text_column[0]
"prompt": prefix_column[0]|default("") ~ " " ~ text_column[0]

Copilot uses AI. Check for mistakes.

Comment on lines +198 to 223

# Increase weights to ensure an integer number of samples per per-prefix
least_common_prefix_count = math.lcm(
*(bucket.prefix_count for bucket in self.config.prefix_buckets)
)
unnorm_weights = [
least_common_prefix_count * bucket.bucket_weight // bucket.prefix_count
for bucket in self.config.prefix_buckets
]
# Use GCD to reduce the weights to smallest integer ratio
common_divisor = math.gcd(*unnorm_weights)

# Create prefix list maintaining the correct distribution
prefixes = []
for bucket, weight in zip(self.config.prefix_buckets, unnorm_weights):
bucket_prefixes = [
self._create_prompt(bucket.prefix_tokens, faker)
for _ in range(bucket.prefix_count)
]
sample_count = weight // common_divisor
prefixes.extend(bucket_prefixes * sample_count)

while True:
yield rand.choice(prefixes)


Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function returns early when prefix_buckets is None or empty, but the remaining code is not properly indented as an else block. This creates unreachable code and potential runtime errors.

Suggested change
# Increase weights to ensure an integer number of samples per per-prefix
least_common_prefix_count = math.lcm(
*(bucket.prefix_count for bucket in self.config.prefix_buckets)
)
unnorm_weights = [
least_common_prefix_count * bucket.bucket_weight // bucket.prefix_count
for bucket in self.config.prefix_buckets
]
# Use GCD to reduce the weights to smallest integer ratio
common_divisor = math.gcd(*unnorm_weights)
# Create prefix list maintaining the correct distribution
prefixes = []
for bucket, weight in zip(self.config.prefix_buckets, unnorm_weights):
bucket_prefixes = [
self._create_prompt(bucket.prefix_tokens, faker)
for _ in range(bucket.prefix_count)
]
sample_count = weight // common_divisor
prefixes.extend(bucket_prefixes * sample_count)
while True:
yield rand.choice(prefixes)
else:
# Increase weights to ensure an integer number of samples per per-prefix
least_common_prefix_count = math.lcm(
*(bucket.prefix_count for bucket in self.config.prefix_buckets)
)
unnorm_weights = [
least_common_prefix_count * bucket.bucket_weight // bucket.prefix_count
for bucket in self.config.prefix_buckets
]
# Use GCD to reduce the weights to smallest integer ratio
common_divisor = math.gcd(*unnorm_weights)
# Create prefix list maintaining the correct distribution
prefixes = []
for bucket, weight in zip(self.config.prefix_buckets, unnorm_weights):
bucket_prefixes = [
self._create_prompt(bucket.prefix_tokens, faker)
for _ in range(bucket.prefix_count)
]
sample_count = weight // common_divisor
prefixes.extend(bucket_prefixes * sample_count)
while True:
yield rand.choice(prefixes)

Copilot uses AI. Check for mistakes.

[[tool.pdm.source]]
name = "torch"
type = "find_links"
#url = "https://download.pytorch.org/whl/cpu/torch_stable.html"
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out URL should be removed rather than left as a comment. If this is for documentation purposes, consider adding a proper comment explaining why this specific URL is used.

Suggested change
#url = "https://download.pytorch.org/whl/cpu/torch_stable.html"
# Previous URL for torch wheels (replaced with the current one due to updated structure)

Copilot uses AI. Check for mistakes.

*(bucket.prefix_count for bucket in self.config.prefix_buckets)
)
unnorm_weights = [
least_common_prefix_count * bucket.bucket_weight // bucket.prefix_count
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calculation could result in integer division by zero if bucket.prefix_count is 0, despite the field validation. Consider adding a runtime check or ensuring the validation prevents this case.

Copilot uses AI. Check for mistakes.

@markurtz markurtz merged commit dd7a4b8 into features/refactor/multimodal_support Oct 3, 2025
10 of 17 checks passed
@markurtz markurtz deleted the features/refactor/adv_prefix branch October 3, 2025 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Consider having groups of queries with multiple system prompts
2 participants