Skip to content

Data uploads to private repo but policyengine-uk downloads from public repo #221

@MaxGhenis

Description

@MaxGhenis

Problem

The CI workflow uploads data to policyengine/policyengine-uk-data-private (Hugging Face), but policyengine-uk downloads from the public repo policyengine/policyengine-uk-data.

This means that when new data is built and uploaded via CI (e.g., PR #220 adding salary sacrifice imputation), it goes to the private repo but isn't accessible to users of policyengine-uk.

Details

Upload script (upload_completed_datasets.py:16-18):

upload_data_files(
    files=dataset_files,
    hf_repo_name="policyengine/policyengine-uk-data-private",  # <-- PRIVATE
    ...
)

Download source in policyengine-uk (policyengine-uk/simulation.py:109):

self.build_from_url(
    "hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5"  # <-- PUBLIC (no -private suffix)
)

Impact

  • PR Add salary sacrifice imputation to dataset pipeline #220 (salary sacrifice imputation) was merged and CI ran successfully
  • Data was built and uploaded to the private repo (100% upload completed in logs)
  • But Microsimulation() downloads from the public repo
  • Users don't get the updated data

Verification

After clearing the HF cache and re-downloading, the data is unchanged:

  • SS contributors: 1.22m
  • Total SS: £6.91bn

This confirms the public repo wasn't updated.

Suggested Fix

Either:

  1. Change upload_completed_datasets.py to upload to policyengine/policyengine-uk-data (the public repo that policyengine-uk downloads from)
  2. Or change policyengine-uk to download from policyengine/policyengine-uk-data-private (would require HF token authentication)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions