Data Generation Second Phase - `service.py` cache fix and job scripts add #138

EMZEDI · 2025-04-24T14:27:40Z

This is a branch used for generating the second phase of our data.

Copilot

Pull Request Overview

This pull request implements a cache fix in the data generation service and adds a new job command for dataset cleaning as part of the second-phase data generation effort. Key changes include:

Updating the cache logic in service.py to avoid raising an error on cache misses.
Adjusting the task preference inclusion probabilities in the response mapper.
Adding a new CLI command (clean_dataset) along with accompanying documentation in benchmarks/README.md.

Reviewed Changes

Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
benchmarks/README.md	Updated the dataset filename used in the testing/example command.
aif_gen/generate/service.py	Changed cache lookup logic to only process non-None cache output.
aif_gen/api/response_mapper/response_mapper.py	Adjusted the preference inclusion probabilities.
aif_gen/_cli/main.py	Added the clean_dataset command to the CLI.
aif_gen/_cli/commands/clean.py	Introduced a new command to clean datasets by removing specified words.

Files not reviewed (3)

jobs/generate_all_downsampled.sh: Language not supported
jobs/generate_all_static.sh: Language not supported
jobs/validate_all_static.sh: Language not supported

Comments suppressed due to low confidence (1)

aif_gen/generate/service.py:536

Previously an error was raised for a missing cache entry; please ensure that the updated behavior (doing nothing if output is None) is intended, or consider adding logging or alternative handling for cache misses.

if output is not None:

Copilot · 2025-04-24T14:28:19Z

aif_gen/_cli/commands/clean.py

+    logging.info(f'Writing {len(dataset)} samples to {output_data_file}')
+    input_dataset.to_json(output_data_file)
+    logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')


The variable 'dataset' refers to the last iterated dataset rather than the aggregate number of cleaned samples; consider using the total sample count or an appropriately aggregated value to accurately reflect the number of samples written.

Suggested change

logging.info(f'Writing {len(dataset)} samples to {output_data_file}')

input_dataset.to_json(output_data_file)

logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')

total_samples = sum(len(dataset.samples) for dataset in input_dataset.datasets)

logging.info(f'Writing {total_samples} samples to {output_data_file}')

input_dataset.to_json(output_data_file)

logging.info(f'Wrote {total_samples} samples to {output_data_file}')

Copilot is right here

Jacob-Chmura · 2025-04-24T19:01:20Z

jobs/generate_all_downsampled.sh

+# 2) run all generation jobs sequentially
+ uv run aif generate \
+    --include-preference-axes \
+    --max_concurrency 256 \
+    --output_file "data/70B_generation/education_qna_direct/data.json" \
+    config/static_copy/education_qna_direct.yaml \
+    Meta-Llama-3.1-70B-Instruct


Is this duplicated?

Jacob-Chmura · 2025-04-24T19:07:03Z

aif_gen/_cli/commands/clean.py

What's the use case?

To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.

Jacob-Chmura · 2025-04-27T15:04:40Z

aif_gen/_cli/commands/clean.py

+    logging.info(f'Writing {len(dataset)} samples to {output_data_file}')
+    input_dataset.to_json(output_data_file)
+    logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')


Copilot is right here

Jacob-Chmura · 2025-04-27T15:07:25Z

aif_gen/_cli/commands/clean.py

To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.

Jacob-Chmura · 2025-04-27T15:08:12Z

aif_gen/api/response_mapper/response_mapper.py

Assuming we are finalized on this approach for our first release and paper, we should move this to config (in a subsequent PR)

Also could be worth fixing #140 while we are here unless you rather do it separately for clean git history

service.py cache fix and job scripts add

2626ab8

EMZEDI requested review from Jacob-Chmura and Copilot April 24, 2025 14:27

EMZEDI self-assigned this Apr 24, 2025

Copilot AI reviewed Apr 24, 2025

View reviewed changes

Jacob-Chmura reviewed Apr 24, 2025

View reviewed changes

duplicate job script bug fix

3d516d6

Jacob-Chmura reviewed Apr 27, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Generation Second Phase - `service.py` cache fix and job scripts add #138

Data Generation Second Phase - `service.py` cache fix and job scripts add #138

Uh oh!

EMZEDI commented Apr 24, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 24, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Jacob-Chmura Apr 24, 2025

Uh oh!

Jacob-Chmura Apr 24, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Jacob-Chmura Apr 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data Generation Second Phase - service.py cache fix and job scripts add #138

Are you sure you want to change the base?

Data Generation Second Phase - service.py cache fix and job scripts add #138

Uh oh!

Conversation

EMZEDI commented Apr 24, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Data Generation Second Phase - `service.py` cache fix and job scripts add #138

Data Generation Second Phase - `service.py` cache fix and job scripts add #138