-
Notifications
You must be signed in to change notification settings - Fork 2
Closed
Description
Problem
The CI workflow uploads data to policyengine/policyengine-uk-data-private (Hugging Face), but policyengine-uk downloads from the public repo policyengine/policyengine-uk-data.
This means that when new data is built and uploaded via CI (e.g., PR #220 adding salary sacrifice imputation), it goes to the private repo but isn't accessible to users of policyengine-uk.
Details
Upload script (upload_completed_datasets.py:16-18):
upload_data_files(
files=dataset_files,
hf_repo_name="policyengine/policyengine-uk-data-private", # <-- PRIVATE
...
)Download source in policyengine-uk (policyengine-uk/simulation.py:109):
self.build_from_url(
"hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5" # <-- PUBLIC (no -private suffix)
)Impact
- PR Add salary sacrifice imputation to dataset pipeline #220 (salary sacrifice imputation) was merged and CI ran successfully
- Data was built and uploaded to the private repo (100% upload completed in logs)
- But
Microsimulation()downloads from the public repo - Users don't get the updated data
Verification
After clearing the HF cache and re-downloading, the data is unchanged:
- SS contributors: 1.22m
- Total SS: £6.91bn
This confirms the public repo wasn't updated.
Suggested Fix
Either:
- Change
upload_completed_datasets.pyto upload topolicyengine/policyengine-uk-data(the public repo that policyengine-uk downloads from) - Or change
policyengine-ukto download frompolicyengine/policyengine-uk-data-private(would require HF token authentication)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels