Fix start_token is not correct #361

tukwila · 2025-09-22T08:49:05Z

Summary

fix: #360

Details

[ ]

In the following two screenshots, all prompts are german or english.

Test Plan

Related Issues

Resolves # start_tokens in synthetic prompt is not related to prompt_text #360

"I certify that all code in this PR is my own, except as noted below."

Use of AI

Includes AI-assisted code completion
Includes code generated by an AI application
Includes AI-generated tests (NOTE: AI written tests should have a docstring that includes ## WRITTEN BY AI ##)

Signed-off-by: guangli.bao <guangli.bao@daocloud.io>

sjmonson · 2025-09-23T14:32:28Z

This is an intentional feature to help us avoid prefix cache hits. Generally the content of a token does not matter for performance. The only time this would matter that I can think of is if speculative decoding is enabled. However in that case you can not use synthetic data at all since you need to use real-world data with speculative decoding to get good performance.

That being said, I would accept a change to line src/guidellm/dataset/synthetic.py:172
that uses the token space of the corpus rather than the entire tokenizer. Something like:

        unique_prefix_iter = cycle(set(self.processor.encode(self.text_creator.text)))

I recommend waiting till our refactor (#351) lands before attempting this due to some upcoming changes.

tukwila · 2025-09-23T15:20:01Z

This is an intentional feature to help us avoid prefix cache hits. Generally the content of a token does not matter for performance. The only time this would matter that I can think of is if speculative decoding is enabled. However in that case you can not use synthetic data at all since you need to use real-world data with speculative decoding to get good performance.

That being said, I would accept a change to line src/guidellm/dataset/synthetic.py:172 that uses the token space of the corpus rather than the entire tokenizer. Something like:
        unique_prefix_iter = cycle(set(self.processor.encode(self.text_creator.text)))
I recommend waiting till our refactor (#351) lands before attempting this due to some upcoming changes.

Got.

markurtz · 2025-10-01T12:16:32Z

@tukwila could you take a look through the new data pipelines refactor that is up and seeing if that fixes it / adapting this PR on top of that one? #384

tukwila force-pushed the fix_start_token branch from 24cc3cd to d624a26 Compare September 22, 2025 08:52

fix start token

7b96175

Signed-off-by: guangli.bao <guangli.bao@daocloud.io>

tukwila force-pushed the fix_start_token branch from e848bb9 to 7b96175 Compare September 22, 2025 09:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix start_token is not correct #361

Fix start_token is not correct #361

Uh oh!

tukwila commented Sep 22, 2025 •

edited

Loading

Uh oh!

sjmonson commented Sep 23, 2025

Uh oh!

tukwila commented Sep 23, 2025

Uh oh!

markurtz commented Oct 1, 2025

Uh oh!

Uh oh!

Fix start_token is not correct #361

Are you sure you want to change the base?

Fix start_token is not correct #361

Uh oh!

Conversation

tukwila commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Test Plan

Related Issues

Use of AI

Uh oh!

sjmonson commented Sep 23, 2025

Uh oh!

tukwila commented Sep 23, 2025

Uh oh!

markurtz commented Oct 1, 2025

Uh oh!

Uh oh!

tukwila commented Sep 22, 2025 •

edited

Loading