Skip to content

Add directory validation, empty dataset check and move metrics to module level in utils.py#95

Open
agentksimha wants to merge 11 commits intohumanai-foundation:mainfrom
agentksimha:refactor/utils-validation-and-metrics
Open

Add directory validation, empty dataset check and move metrics to module level in utils.py#95
agentksimha wants to merge 11 commits intohumanai-foundation:mainfrom
agentksimha:refactor/utils-validation-and-metrics

Conversation

@agentksimha
Copy link

Changes made to RenAIssance_Transformer_OCR_Utsav_Rai/code/utils.py:

  1. Added image_dir and text_dir validation in SpanishDocumentsDataset.init
    Both directories were never validated before use. An invalid path causes os.listdir to throw a generic FileNotFoundError with no indication of which argument was wrong. Added explicit FileNotFoundError checks for both with descriptive messages.

  2. Added empty dataset check in SpanishDocumentsDataset.init
    If image_dir exists but contains no .jpg files, self.filenames is empty and the DataLoader silently produces nothing during training. Added a ValueError check after scanning for .jpg files.

  3. Added image_path validation in generate_text_from_image_segment
    The function had no path existence check before Image.open. A missing file would be caught by the generic except Exception block, producing a vague error message. Added an explicit FileNotFoundError check before the try block so the failure is immediately actionable.

  4. Moved load_metric calls to module level
    load_metric was called three times inside compute_metrics, meaning cer, wer and bleu metrics were loaded from disk on every evaluation step during training. Moved all three to module-level constants so they are loaded once at import time and reused across all calls.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant