Skip to content

Fix: Resolve dictionary key overwrites and missing pandas import in data generation pipeline#92

Open
kamilansri wants to merge 1 commit intohumanai-foundation:mainfrom
kamilansri:fix/data-pipeline-runtime-errors
Open

Fix: Resolve dictionary key overwrites and missing pandas import in data generation pipeline#92
kamilansri wants to merge 1 commit intohumanai-foundation:mainfrom
kamilansri:fix/data-pipeline-runtime-errors

Conversation

@kamilansri
Copy link

Description

This PR addresses several critical bugs in the data_generation_pipeline script that prevented successful execution and caused silent logic failures.

Changes Made

  • Added Missing Import: Added import pandas as pd at the top of the file to prevent a NameError during the final dataset CSV read.
  • Fixed Dictionary Key Overwrites: Refactored the book_transformations dictionary. Previously, consecutive identical keys (e.g., 'denoise_image') were overwriting each other natively in Python. These sequential steps have been bundled into lists (e.g., 'denoise_image': [{'method': 'bilateral'}, {'method': 'nlm'}]) to preserve the pipeline's operational intent. (Note: Ensure process_multiple_books is equipped to parse these list values).
  • Variable Renaming: Prevented df from being assigned and overwritten three separate times in sequence. Variables are now explicitly named (regions_df, dataset_df, final_df) to improve debugging capability.
  • Passed Defined Variables: Replaced the hardcoded 0.8 in mapping_bounding_boxes with the initialized similarity_threshold variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant