🐛[BUG]: ARCO (and WB2) data source re-downloads the same Zarr chunk for multi-level pressure variables

### Version

0.13.0 -- c2d99b6

### On which installation method(s) does this occur?

source

### Describe the issue

ARCO era5 stores pressure-level variables as 4D arrays `(time, level, lat, lon)` and chunks them per timestep eg... `(1, 37, 721, 1440)`.  
So, all 37 pressure levels for a given timestep and variable are in the same Zarr chunk.

When fetching multiple levels from the same variable (for example, `z500`, `z1000`, `z250`, `z300`, `z700` from the `geopotential` variable), each requested level triggers an independent download of the same underlying source chunk.

Because all of these level requests map to the exact same zarr chunk, the concurrent `asyncio.gather` in `ARCO.fetch()` can race past the cache check before any task finishes writing that chunk to disk, so the same full-pressure-level chunk is re-downloaded once per requested level variable. 

## Steps to reproduce
Clear the cache and run with `verbose=True`, you'll see the same chunk path downloaded multiple times in the logs:

```python
import torch
from earth2studio.data import ARCO, fetch_data
from earth2studio.models.px import GraphCastOperational
from earth2studio.utils.time import to_time_array

package = GraphCastOperational.load_default_package()
model = GraphCastOperational.load_model(package)

data = ARCO(cache=True, verbose=True)
prognostic_ic = model.input_coords()

x, coords = fetch_data(
    source=data,
    time=to_time_array(["2021-06-15T00:00:00"]),
    variable=prognostic_ic["variable"],
    lead_time=prognostic_ic["lead_time"],
    device=torch.device("cuda"),
)
```

In the logs, you'll see the same chunk path copied to cache multiple times. Variables like `geopotential` and `temperature` are downloaded **13×** (once per pressure level) required for initializing the model...

```
.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache
.... DEBUG    | earth2studio.data.utils:_make_local_details:774 - Copying /gcp-public-data-arco-era5/ar/full_37-1h-0p25deg-chunk-1.zarr-v3/geopotential/1064634.0.0.0 to local cache

Copying .../temperature/1064634.0.0.0 to local cache...
....
....
```
Surface variables (t2m, u10m, v10m, msl) download correctly, only once each.

## Expected behavior
Each unique Zarr chunk should be downloaded only once, regardless of how many pressure levels are requested from it. Also, checking the cache file during the download phase for cache hits and misses. 

## Suggested fix

**Option 1:** Adding a per-path `asyncio.Lock` in `AsyncCachingFileSystem._cat_file` so only one coroutine downloads a given chunk while others wait:
```python
# In AsyncCachingFileSystem
self._download_locks: dict[str, asyncio.Lock] = {}

async def _cat_file(self, path, ...):
    if path not in self._download_locks:
        self._download_locks[path] = asyncio.Lock()
    
    async with self._download_locks[path]:
        # Re-check cache, then download if needed
        ...
```

**Option 2:** Deduplicate at the `ARCO.fetch` level by grouping level variables that map to the same underlying chunk  (full pressure variable) before dispatching tasks.

Option 1 is more general and (easier to implement) and benefits all data sources using `AsyncCachingFileSystem`. Implementing either will result in a significant speed-up in data download and full inference workflows. Please let me know your thoughts. 

cc: @djgagne and @ kafitzgerald

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛[BUG]: ARCO (and WB2) data source re-downloads the same Zarr chunk for multi-level pressure variables #778

Version

On which installation method(s) does this occur?

Describe the issue

Steps to reproduce

Expected behavior

Suggested fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

🐛[BUG]: ARCO (and WB2) data source re-downloads the same Zarr chunk for multi-level pressure variables #778

Description

Version

On which installation method(s) does this occur?

Describe the issue

Steps to reproduce

Expected behavior

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions