-
Notifications
You must be signed in to change notification settings - Fork 163
🐛[BUG]: ARCO (and WB2) data source re-downloads the same Zarr chunk for multi-level pressure variables #778
Description
Version
0.13.0 -- c2d99b6
On which installation method(s) does this occur?
source
Describe the issue
ARCO era5 stores pressure-level variables as 4D arrays (time, level, lat, lon) and chunks them per timestep eg... (1, 37, 721, 1440).
So, all 37 pressure levels for a given timestep and variable are in the same Zarr chunk.
When fetching multiple levels from the same variable (for example, z500, z1000, z250, z300, z700 from the geopotential variable), each requested level triggers an independent download of the same underlying source chunk.
Because all of these level requests map to the exact same zarr chunk, the concurrent asyncio.gather in ARCO.fetch() can race past the cache check before any task finishes writing that chunk to disk, so the same full-pressure-level chunk is re-downloaded once per requested level variable.
Steps to reproduce
Clear the cache and run with verbose=True, you'll see the same chunk path downloaded multiple times in the logs:
import torch
from earth2studio.data import ARCO, fetch_data
from earth2studio.models.px import GraphCastOperational
from earth2studio.utils.time import to_time_array
package = GraphCastOperational.load_default_package()
model = GraphCastOperational.load_model(package)
data = ARCO(cache=True, verbose=True)
prognostic_ic = model.input_coords()
x, coords = fetch_data(
source=data,
time=to_time_array(["2021-06-15T00:00:00"]),
variable=prognostic_ic["variable"],
lead_time=prognostic_ic["lead_time"],
device=torch.device("cuda"),
)In the logs, you'll see the same chunk path copied to cache multiple times. Variables like geopotential and temperature are downloaded 13× (once per pressure level) required for initializing the model...
.... DEBUG | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
Copying .../temperature/1064634.0.0.0 to local cache...
....
....
Surface variables (t2m, u10m, v10m, msl) download correctly, only once each.
Expected behavior
Each unique Zarr chunk should be downloaded only once, regardless of how many pressure levels are requested from it. Also, checking the cache file during the download phase for cache hits and misses.
Suggested fix
Option 1: Adding a per-path asyncio.Lock in AsyncCachingFileSystem._cat_file so only one coroutine downloads a given chunk while others wait:
# In AsyncCachingFileSystem
self._download_locks: dict[str, asyncio.Lock] = {}
async def _cat_file(self, path, ...):
if path not in self._download_locks:
self._download_locks[path] = asyncio.Lock()
async with self._download_locks[path]:
# Re-check cache, then download if needed
...Option 2: Deduplicate at the ARCO.fetch level by grouping level variables that map to the same underlying chunk (full pressure variable) before dispatching tasks.
Option 1 is more general and (easier to implement) and benefits all data sources using AsyncCachingFileSystem. Implementing either will result in a significant speed-up in data download and full inference workflows. Please let me know your thoughts.
cc: @djgagne and @ kafitzgerald