-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
Problem
policyengine-uk downloads the enhanced FRS dataset from HuggingFace without pinning a version:
hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5
hf_hub_download uses an etag check to detect updates, but if the check times out (default 10s) or fails silently, it falls back to the stale cached version in ~/.cache/huggingface/hub/. This has caused real bugs — a stale cache missing highest_education led to KeyError crashes in economic_assumptions.py.
Fix
Pin the dataset version in the URL using the existing @version syntax:
hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5@1.40.1
Both HF repos (policyengine-uk-data and policyengine-uk-data-private) already publish version tags (e.g. 1.40.1). The download code in simulation.py already parses @version and passes it as the revision parameter to hf_hub_download.
Benefits
- Reproducible: each policyengine-uk release is tied to a specific data version
- No stale cache: a new version = new cache key, so there's no ambiguity
- Explicit upgrades: when policyengine-uk-data publishes a new version, policyengine-uk bumps the pin deliberately
Implementation
- In
simulation.py, append@{version}to the HF URL (currently around line 149) - When policyengine-uk-data releases a new version, update the pinned version in policyengine-uk
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels