Skip to content

Conversation

@chrisiacovella
Copy link
Member

@chrisiacovella chrisiacovella commented Jan 23, 2026

Pull Request Summary

The main purpose of this PR is to do a slight revision on the caching, as discussed in #396 . Briefly, the solution prior to this PR would fetch any datasets not in the cache (if say, the defaults changed), but the updated files would not be saved into a new cache (requiring manual removal to generate a new one). This adds in a new step that write to a text file that contains a list of the dataset names and versions. We then hash that file and add it to the key, so we know if we have a cache miss.

Also, this changes the location of the cache to ~\.cache/modelforge_testing_dataset_cache. For some reason this was not working for me initially (in terms of saving the cache), but there could have been other issues. This was done to remove problems associated with local testing.

Also, this adds in fall backs for downloading files. As discussed in #397, when zenodo was rejecting python.requests connections, I found I could use curl or wget successfully from the command line. I've added in wrappers to those system tools; they will be called after the max number of retries fail (first trying wget if available, then curl if available). I did add in logic to specify the method used in the download function, but that is not exposed at the level of the .toml files as I'm not really sure that is necessary.

Key changes

Notable points that this PR has either accomplished or will accomplish.

  • modify caching scheme
  • add fallback downloading schemes
  • Add 3.13 to the testing matrix

Associated Issue(s)

Pull Request Checklist

  • Issue(s) raised/addressed and linked
  • Includes appropriate unit test(s)
  • Appropriate docstring(s) added/updated
  • Appropriate .rst doc file(s) added/updated
  • PR is ready for review

@codecov-commenter
Copy link

codecov-commenter commented Jan 23, 2026

Codecov Report

❌ Patch coverage is 75.75758% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 90.68%. Comparing base (eaf3817) to head (e85fb34).

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request revises the dataset caching mechanism to better handle cache invalidation and adds fallback download methods (wget, curl) for when the primary requests library fails. The cache location is moved to ~/.cache/modelforge_testing_dataset_cache, and a new tracking file downloaded_datasets.txt is used to generate cache keys based on required datasets and versions. Additionally, a new Aimnet2 dataset curation module is introduced.

Changes:

  • Modified caching scheme to track dataset requirements via hash of downloaded_datasets.txt file for proper cache invalidation
  • Added wget and curl fallback mechanisms for dataset downloads when Python requests library fails
  • Relocated dataset cache from local directory to ~/.cache/modelforge_testing_dataset_cache

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
modelforge/utils/remote.py Added wget_wrapper, curl_wrapper, and requests_wrapper functions for download fallbacks; refactored download_from_url to support multiple download schemes with retry logic
modelforge/tests/list_test_datasets.py New file that generates downloaded_datasets.txt tracking file for cache key hashing
.github/workflows/CI.yaml Updated cache configuration to use hashFiles on downloaded_datasets.txt for cache invalidation; upgraded to actions/cache@v4
modelforge/tests/conftest.py Updated dataset_temp_dir fixture to use new cache location with expanduser
modelforge/tests/fetch_test_datasets.py Updated to use new cache location and added documentation
modelforge/tests/test_remote.py Added parametrized tests for different download schemes (auto, curl, wget)
modelforge/tests/test_aaa_setup.py Removed entire file (dataset download logic moved to fetch_test_datasets.py)
scripts/plot_dataset.py Updated hardcoded paths to use new cache location
modelforge/utils/io.py Added fmt:off comments to preserve message formatting
modelforge-curate/modelforge/curate/datasets/aimnet2_curation.py New curation module for AIMNet2 dataset
modelforge/dataset/yaml_files/ani2x.yaml Updated file names to match correct naming conventions
modelforge/dataset/parameters.py Added commented placeholder for future spin_multiplicity_per_atom property
modelforge/tests/precalculated_values.py Added fmt:off comment for code formatting
README.md Added installation instructions and updated copyright year to 2026

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@chrisiacovella chrisiacovella merged commit fe4932a into choderalab:main Jan 23, 2026
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants