-
Notifications
You must be signed in to change notification settings - Fork 1
Data Generation Second Phase - service.py cache fix and job scripts add
#138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements a cache fix in the data generation service and adds a new job command for dataset cleaning as part of the second-phase data generation effort. Key changes include:
- Updating the cache logic in service.py to avoid raising an error on cache misses.
- Adjusting the task preference inclusion probabilities in the response mapper.
- Adding a new CLI command (clean_dataset) along with accompanying documentation in benchmarks/README.md.
Reviewed Changes
Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| benchmarks/README.md | Updated the dataset filename used in the testing/example command. |
| aif_gen/generate/service.py | Changed cache lookup logic to only process non-None cache output. |
| aif_gen/api/response_mapper/response_mapper.py | Adjusted the preference inclusion probabilities. |
| aif_gen/_cli/main.py | Added the clean_dataset command to the CLI. |
| aif_gen/_cli/commands/clean.py | Introduced a new command to clean datasets by removing specified words. |
Files not reviewed (3)
- jobs/generate_all_downsampled.sh: Language not supported
- jobs/generate_all_static.sh: Language not supported
- jobs/validate_all_static.sh: Language not supported
Comments suppressed due to low confidence (1)
aif_gen/generate/service.py:536
- Previously an error was raised for a missing cache entry; please ensure that the updated behavior (doing nothing if output is None) is intended, or consider adding logging or alternative handling for cache misses.
if output is not None:
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | ||
| input_dataset.to_json(output_data_file) | ||
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') |
Copilot
AI
Apr 24, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable 'dataset' refers to the last iterated dataset rather than the aggregate number of cleaned samples; consider using the total sample count or an appropriately aggregated value to accurately reflect the number of samples written.
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | |
| input_dataset.to_json(output_data_file) | |
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') | |
| total_samples = sum(len(dataset.samples) for dataset in input_dataset.datasets) | |
| logging.info(f'Writing {total_samples} samples to {output_data_file}') | |
| input_dataset.to_json(output_data_file) | |
| logging.info(f'Wrote {total_samples} samples to {output_data_file}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot is right here
jobs/generate_all_downsampled.sh
Outdated
| # 2) run all generation jobs sequentially | ||
| uv run aif generate \ | ||
| --include-preference-axes \ | ||
| --max_concurrency 256 \ | ||
| --output_file "data/70B_generation/education_qna_direct/data.json" \ | ||
| config/static_copy/education_qna_direct.yaml \ | ||
| Meta-Llama-3.1-70B-Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplicated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.
| logging.info(f'Writing {len(dataset)} samples to {output_data_file}') | ||
| input_dataset.to_json(output_data_file) | ||
| logging.info(f'Wrote {len(dataset)} samples to {output_data_file}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot is right here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming we are finalized on this approach for our first release and paper, we should move this to config (in a subsequent PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could be worth fixing #140 while we are here unless you rather do it separately for clean git history
This is a branch used for generating the second phase of our data.