Skip to content

Conversation

tukwila
Copy link
Contributor

@tukwila tukwila commented Sep 22, 2025

Summary

fix: #360

Details

  • [ ]

In the following two screenshots, all prompts are german or english.
image

image

Test Plan

Related Issues


  • "I certify that all code in this PR is my own, except as noted below."

Use of AI

  • Includes AI-assisted code completion
  • Includes code generated by an AI application
  • Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: guangli.bao <guangli.bao@daocloud.io>
@sjmonson
Copy link
Collaborator

This is an intentional feature to help us avoid prefix cache hits. Generally the content of a token does not matter for performance. The only time this would matter that I can think of is if speculative decoding is enabled. However in that case you can not use synthetic data at all since you need to use real-world data with speculative decoding to get good performance.

That being said, I would accept a change to line src/guidellm/dataset/synthetic.py:172
that uses the token space of the corpus rather than the entire tokenizer. Something like:

        unique_prefix_iter = cycle(set(self.processor.encode(self.text_creator.text)))

I recommend waiting till our refactor (#351) lands before attempting this due to some upcoming changes.

@tukwila
Copy link
Contributor Author

tukwila commented Sep 23, 2025

This is an intentional feature to help us avoid prefix cache hits. Generally the content of a token does not matter for performance. The only time this would matter that I can think of is if speculative decoding is enabled. However in that case you can not use synthetic data at all since you need to use real-world data with speculative decoding to get good performance.

That being said, I would accept a change to line src/guidellm/dataset/synthetic.py:172 that uses the token space of the corpus rather than the entire tokenizer. Something like:

        unique_prefix_iter = cycle(set(self.processor.encode(self.text_creator.text)))

I recommend waiting till our refactor (#351) lands before attempting this due to some upcoming changes.

Got.

@markurtz
Copy link
Collaborator

markurtz commented Oct 1, 2025

@tukwila could you take a look through the new data pipelines refactor that is up and seeing if that fixes it / adapting this PR on top of that one? #384

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

start_tokens in synthetic prompt is not related to prompt_text
3 participants