Skip to content

Conversation

@EMZEDI
Copy link
Collaborator

@EMZEDI EMZEDI commented Apr 24, 2025

This is a branch used for generating the second phase of our data.

@EMZEDI EMZEDI requested review from Jacob-Chmura and Copilot April 24, 2025 14:27
@EMZEDI EMZEDI self-assigned this Apr 24, 2025
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request implements a cache fix in the data generation service and adds a new job command for dataset cleaning as part of the second-phase data generation effort. Key changes include:

  • Updating the cache logic in service.py to avoid raising an error on cache misses.
  • Adjusting the task preference inclusion probabilities in the response mapper.
  • Adding a new CLI command (clean_dataset) along with accompanying documentation in benchmarks/README.md.

Reviewed Changes

Copilot reviewed 5 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
benchmarks/README.md Updated the dataset filename used in the testing/example command.
aif_gen/generate/service.py Changed cache lookup logic to only process non-None cache output.
aif_gen/api/response_mapper/response_mapper.py Adjusted the preference inclusion probabilities.
aif_gen/_cli/main.py Added the clean_dataset command to the CLI.
aif_gen/_cli/commands/clean.py Introduced a new command to clean datasets by removing specified words.
Files not reviewed (3)
  • jobs/generate_all_downsampled.sh: Language not supported
  • jobs/generate_all_static.sh: Language not supported
  • jobs/validate_all_static.sh: Language not supported
Comments suppressed due to low confidence (1)

aif_gen/generate/service.py:536

  • Previously an error was raised for a missing cache entry; please ensure that the updated behavior (doing nothing if output is None) is intended, or consider adding logging or alternative handling for cache misses.
if output is not None:

Comment on lines +84 to +86
logging.info(f'Writing {len(dataset)} samples to {output_data_file}')
input_dataset.to_json(output_data_file)
logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')
Copy link

Copilot AI Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable 'dataset' refers to the last iterated dataset rather than the aggregate number of cleaned samples; consider using the total sample count or an appropriately aggregated value to accurately reflect the number of samples written.

Suggested change
logging.info(f'Writing {len(dataset)} samples to {output_data_file}')
input_dataset.to_json(output_data_file)
logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')
total_samples = sum(len(dataset.samples) for dataset in input_dataset.datasets)
logging.info(f'Writing {total_samples} samples to {output_data_file}')
input_dataset.to_json(output_data_file)
logging.info(f'Wrote {total_samples} samples to {output_data_file}')

Copilot uses AI. Check for mistakes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is right here

Comment on lines 37 to 43
# 2) run all generation jobs sequentially
uv run aif generate \
--include-preference-axes \
--max_concurrency 256 \
--output_file "data/70B_generation/education_qna_direct/data.json" \
config/static_copy/education_qna_direct.yaml \
Meta-Llama-3.1-70B-Instruct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this duplicated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.

Comment on lines +84 to +86
logging.info(f'Writing {len(dataset)} samples to {output_data_file}')
input_dataset.to_json(output_data_file)
logging.info(f'Wrote {len(dataset)} samples to {output_data_file}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot is right here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To reiterate, is this necessary to merge? Or just useful as a pre-processing step for our data? By the way, I would rather not have used this in excess (or at all), since it's unclear how removing specific words could alter the latent preference that we are aiming to model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming we are finalized on this approach for our first release and paper, we should move this to config (in a subsequent PR)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could be worth fixing #140 while we are here unless you rather do it separately for clean git history

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants