Skip to content

improve robustness of remote dataset retrieval#381

Merged
sophiamaedler merged 4 commits intomainfrom
fix_364
Mar 1, 2026
Merged

improve robustness of remote dataset retrieval#381
sophiamaedler merged 4 commits intomainfrom
fix_364

Conversation

@sophiamaedler
Copy link
Collaborator

@sophiamaedler sophiamaedler commented Mar 1, 2026

Summary

This PR improves robustness of remote dataset retrieval by handling incomplete prior downloads more safely, and adds regression coverage. Fixes #364

What changed

src/scportrait/data/_datasets.py

  • Updated _get_remote_dataset(...) logic to detect incomplete dataset states:
    • If the dataset directory exists but the expected target file (name) is missing, it now re-downloads automatically.
  • Consolidated download trigger logic into explicit flags:
    • force_download
    • missing dataset directory
    • missing expected file
  • When retrying due to missing expected file, download is invoked with overwrite=True to recover from stale/incomplete artifacts.

Tests added

tests/unit_tests/data/test_datasets.py

Added new unit tests for _get_remote_dataset(...):

  1. re-downloads when dataset directory exists but expected file is missing
  2. skips download when expected file already exists
  3. downloads when dataset directory does not exist

These tests mock _download and get_data_dir so behavior is deterministic and isolated from network dependencies.

Why

Previously, an interrupted or failed download could leave behind a dataset directory without the expected file. In that state, the loader could incorrectly skip download and later fail with confusing file-not-found errors.
This change makes the retrieval path self-healing for that common partial-download scenario.

Copilot AI review requested due to automatic review settings March 1, 2026 18:42
@sophiamaedler sophiamaedler changed the title Fix 364 improve robustness of remote dataset retrieval Mar 1, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens scportrait.data._datasets._get_remote_dataset(...) against incomplete prior downloads by re-triggering downloads when the dataset directory exists but the expected target file is missing, and adds unit tests to prevent regressions.

Changes:

  • Add missing-expected-file detection to _get_remote_dataset(...) and route download decisions through a single should_download flag.
  • Pass an overwrite flag into _download(...) when forcing downloads or recovering from an incomplete dataset state.
  • Add unit tests covering download/redownload/skip behaviors for _get_remote_dataset(...).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/scportrait/data/_datasets.py Adds expected-file checks and consolidated download/overwrite decision logic to make remote dataset retrieval self-healing.
tests/unit_tests/data/test_datasets.py Adds regression tests validating download invocation behavior under directory/file-present/missing scenarios.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sophiamaedler sophiamaedler merged commit 973647c into main Mar 1, 2026
2 checks passed
@sophiamaedler sophiamaedler deleted the fix_364 branch March 1, 2026 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

aborted downloads of external resources (e.g. example config files) leads to cryptic errors

2 participants