You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## TODO
- Docs
- ~CSV arg string support~ CSV arg string now supports single bucket
(see last example). Might leave it at that for now.
- More validation
## Summary
<!--
Include a short paragraph of the changes introduced in this PR.
If this PR requires additional context or rationale, explain why
the changes are necessary.
-->
This PR is a port of #287 to the v0.4.0 refactor branch.
Adds controls for sharing one or more fixed prefixes between samples.
See examples bellow.
## Details
<!--
Provide a detailed list of all changes introduced in this pull request.
-->
Adds a `prefix_buckets` argument to the `SyntheticTextDatasetConfig`,
each bucket consists of a prefix count, token count, and bucket weight.
Prefix count sets the number of unique prefixes to generate for a given
bucket, token count is the length of each prompt in the bucket, and
bucket weight is used to calculate the proportion of requests the bucket
applies to relative to the sum of all bucket weights. Here are a few
examples:
Here we have one bucket of 32 prefixes of length 2048. Since there are
1024 total samples each prefix will apply to 32 samples. If there is
only one bucket than weight can be omitted as the bucket applies to 100%
of samples.
```yaml
data:
prefix_buckets:
- prefix_tokens: 2048
prefix_count: 32
prompt_tokens: 256
output_tokens: 256
samples: 1024
```
In this modified version of the first example 16 of the prompts have
2048 tokens while the other 16 have 1024 tokens.
```yaml
data:
prefix_buckets:
- prefix_tokens: 2048
prefix_count: 16
bucket_weight: 50
- prefix_tokens: 1024
prefix_count: 16
bucket_weight: 50
prompt_tokens: 256
output_tokens: 256
samples: 1024
```
The prefix tokens of a bucket can also be 0 to disable prefixes for
those samples. Here is an example where 40% of the samples have a prefix
of 2048 tokens while the other 60% have no prefix.
```yaml
data:
prefix_buckets:
- prefix_tokens: 2048
bucket_weight: 40
- prefix_tokens: 0
bucket_weight: 60
prompt_tokens: 256
output_tokens: 256
samples: 1000
```
If only a single bucket is needed, it can be set at the top level. This
make the changes backwards compatible with the previous interface and
allows the CSV string format to work without parsing nested structures
(at least for this use-case).
```yaml
data:
prefix_tokens: 128
prefix_count: 10
prompt_tokens: 256
output_tokens: 256
samples: 1000
```
## Test Plan
<!--
List the steps needed to test this PR.
-->
- PR includes unit tests for all synthetic dataset changes (`pytest
tests/unit/dataset`)
- Scenearios in the Details section can be used against a model server
with prefix caching and the cache rate can be confirmed by inspecting
console output.
## Related Issues
<!--
Link any relevant issues that this PR addresses.
-->
- Resolves#232
- Closes#287
---
- [x] "I certify that all code in this PR is my own, except as noted
below."
## Use of AI
- [x] Includes AI-assisted code completion
- [ ] Includes code generated by an AI application
- [x] Includes AI-generated tests (NOTE: AI written tests should have a
docstring that includes `## WRITTEN BY AI ##`)
---------
Signed-off-by: Samuel Monson <smonson@redhat.com>
0 commit comments