Skip to content

🐛[BUG]: ARCO (and WB2) data source re-downloads the same Zarr chunk for multi-level pressure variables #778

@negin513

Description

@negin513

Version

0.13.0 -- c2d99b6

On which installation method(s) does this occur?

source

Describe the issue

ARCO era5 stores pressure-level variables as 4D arrays (time, level, lat, lon) and chunks them per timestep eg... (1, 37, 721, 1440).
So, all 37 pressure levels for a given timestep and variable are in the same Zarr chunk.

When fetching multiple levels from the same variable (for example, z500, z1000, z250, z300, z700 from the geopotential variable), each requested level triggers an independent download of the same underlying source chunk.

Because all of these level requests map to the exact same zarr chunk, the concurrent asyncio.gather in ARCO.fetch() can race past the cache check before any task finishes writing that chunk to disk, so the same full-pressure-level chunk is re-downloaded once per requested level variable.

Steps to reproduce

Clear the cache and run with verbose=True, you'll see the same chunk path downloaded multiple times in the logs:

import torch
from earth2studio.data import ARCO, fetch_data
from earth2studio.models.px import GraphCastOperational
from earth2studio.utils.time import to_time_array

package = GraphCastOperational.load_default_package()
model = GraphCastOperational.load_model(package)

data = ARCO(cache=True, verbose=True)
prognostic_ic = model.input_coords()

x, coords = fetch_data(
    source=data,
    time=to_time_array(["2021-06-15T00:00:00"]),
    variable=prognostic_ic["variable"],
    lead_time=prognostic_ic["lead_time"],
    device=torch.device("cuda"),
)

In the logs, you'll see the same chunk path copied to cache multiple times. Variables like geopotential and temperature are downloaded 13× (once per pressure level) required for initializing the model...

.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache

Copying .../temperature/1064634.0.0.0 to local cache...
....
....

Surface variables (t2m, u10m, v10m, msl) download correctly, only once each.

Expected behavior

Each unique Zarr chunk should be downloaded only once, regardless of how many pressure levels are requested from it. Also, checking the cache file during the download phase for cache hits and misses.

Suggested fix

Option 1: Adding a per-path asyncio.Lock in AsyncCachingFileSystem._cat_file so only one coroutine downloads a given chunk while others wait:

# In AsyncCachingFileSystem
self._download_locks: dict[str, asyncio.Lock] = {}

async def _cat_file(self, path, ...):
    if path not in self._download_locks:
        self._download_locks[path] = asyncio.Lock()
    
    async with self._download_locks[path]:
        # Re-check cache, then download if needed
        ...

Option 2: Deduplicate at the ARCO.fetch level by grouping level variables that map to the same underlying chunk (full pressure variable) before dispatching tasks.

Option 1 is more general and (easier to implement) and benefits all data sources using AsyncCachingFileSystem. Implementing either will result in a significant speed-up in data download and full inference workflows. Please let me know your thoughts.

cc: @djgagne and @ kafitzgerald

Metadata

Metadata

Assignees

No one assigned

    Labels

    ? - Needs TriageNeed team to review and classifybugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions