-
Notifications
You must be signed in to change notification settings - Fork 5
Rev caching #398
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rev caching #398
Conversation
…ks (those still tend to work ok if zenodo is having issues)
Codecov Report❌ Patch coverage is 🚀 New features to boost your workflow:
|
…at it is handled separately). attempt to fix linting complaints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This pull request revises the dataset caching mechanism to better handle cache invalidation and adds fallback download methods (wget, curl) for when the primary requests library fails. The cache location is moved to ~/.cache/modelforge_testing_dataset_cache, and a new tracking file downloaded_datasets.txt is used to generate cache keys based on required datasets and versions. Additionally, a new Aimnet2 dataset curation module is introduced.
Changes:
- Modified caching scheme to track dataset requirements via hash of
downloaded_datasets.txtfile for proper cache invalidation - Added wget and curl fallback mechanisms for dataset downloads when Python requests library fails
- Relocated dataset cache from local directory to
~/.cache/modelforge_testing_dataset_cache
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| modelforge/utils/remote.py | Added wget_wrapper, curl_wrapper, and requests_wrapper functions for download fallbacks; refactored download_from_url to support multiple download schemes with retry logic |
| modelforge/tests/list_test_datasets.py | New file that generates downloaded_datasets.txt tracking file for cache key hashing |
| .github/workflows/CI.yaml | Updated cache configuration to use hashFiles on downloaded_datasets.txt for cache invalidation; upgraded to actions/cache@v4 |
| modelforge/tests/conftest.py | Updated dataset_temp_dir fixture to use new cache location with expanduser |
| modelforge/tests/fetch_test_datasets.py | Updated to use new cache location and added documentation |
| modelforge/tests/test_remote.py | Added parametrized tests for different download schemes (auto, curl, wget) |
| modelforge/tests/test_aaa_setup.py | Removed entire file (dataset download logic moved to fetch_test_datasets.py) |
| scripts/plot_dataset.py | Updated hardcoded paths to use new cache location |
| modelforge/utils/io.py | Added fmt:off comments to preserve message formatting |
| modelforge-curate/modelforge/curate/datasets/aimnet2_curation.py | New curation module for AIMNet2 dataset |
| modelforge/dataset/yaml_files/ani2x.yaml | Updated file names to match correct naming conventions |
| modelforge/dataset/parameters.py | Added commented placeholder for future spin_multiplicity_per_atom property |
| modelforge/tests/precalculated_values.py | Added fmt:off comment for code formatting |
| README.md | Added installation instructions and updated copyright year to 2026 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Pull Request Summary
The main purpose of this PR is to do a slight revision on the caching, as discussed in #396 . Briefly, the solution prior to this PR would fetch any datasets not in the cache (if say, the defaults changed), but the updated files would not be saved into a new cache (requiring manual removal to generate a new one). This adds in a new step that write to a text file that contains a list of the dataset names and versions. We then hash that file and add it to the key, so we know if we have a cache miss.
Also, this changes the location of the cache to
~\.cache/modelforge_testing_dataset_cache. For some reason this was not working for me initially (in terms of saving the cache), but there could have been other issues. This was done to remove problems associated with local testing.Also, this adds in fall backs for downloading files. As discussed in #397, when zenodo was rejecting python.requests connections, I found I could use curl or wget successfully from the command line. I've added in wrappers to those system tools; they will be called after the max number of retries fail (first trying wget if available, then curl if available). I did add in logic to specify the method used in the download function, but that is not exposed at the level of the .toml files as I'm not really sure that is necessary.
Key changes
Notable points that this PR has either accomplished or will accomplish.
Associated Issue(s)
Pull Request Checklist