Skip to content

Pin HuggingFace dataset version to avoid stale cache #1505

@MaxGhenis

Description

@MaxGhenis

Problem

policyengine-uk downloads the enhanced FRS dataset from HuggingFace without pinning a version:

hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5

hf_hub_download uses an etag check to detect updates, but if the check times out (default 10s) or fails silently, it falls back to the stale cached version in ~/.cache/huggingface/hub/. This has caused real bugs — a stale cache missing highest_education led to KeyError crashes in economic_assumptions.py.

Fix

Pin the dataset version in the URL using the existing @version syntax:

hf://policyengine/policyengine-uk-data/enhanced_frs_2023_24.h5@1.40.1

Both HF repos (policyengine-uk-data and policyengine-uk-data-private) already publish version tags (e.g. 1.40.1). The download code in simulation.py already parses @version and passes it as the revision parameter to hf_hub_download.

Benefits

  • Reproducible: each policyengine-uk release is tied to a specific data version
  • No stale cache: a new version = new cache key, so there's no ambiguity
  • Explicit upgrades: when policyengine-uk-data publishes a new version, policyengine-uk bumps the pin deliberately

Implementation

  1. In simulation.py, append @{version} to the HF URL (currently around line 149)
  2. When policyengine-uk-data releases a new version, update the pinned version in policyengine-uk

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions