Skip to content

Fetch Argo data massively#565

Draft
gmaze wants to merge 12 commits intomasterfrom
naive-arco
Draft

Fetch Argo data massively#565
gmaze wants to merge 12 commits intomasterfrom
naive-arco

Conversation

@gmaze
Copy link
Member

@gmaze gmaze commented Dec 19, 2025

This PR implements a naive approach to fetch core Argo data massively, i.e. to fetch a very large selection of measurements, as fast as possible.

This class should work as long as data fit in memory, so check out your RAM levels.

For reasonable performances, this is intended to be used with a local copy of the GDAC.

This class use a distributed.client.Client to scale Argo data fetching from a large collection of floats.

This class return data in argopy "Standard" user-mode, i.e. with:

  • QC=[1,5,8] on position and date
  • QC=1 on pres/temp/psal
  • Real-time, adjusted or delayed, according to data_mode

Note
One month of global Argo data in standard user-mod is about 45Mb on disk as a zarr archive, so there is a lot of room for improvement in having ARCO (Analysis-Ready and Cloud-Optimized) Argo data with such approach.

Installation to play with this new feature

You need to install argopy from this branch.
This can be done like this:

conda create -n argopy-arco python=3.11
conda activate argopy-arco
conda install -c conda-forge aiohttp=3.12.14 decorator=5.2.1 erddapy=2.2.4 fsspec=2025.5.1 h5netcdf=1.6.3 netCDF4=1.7.2 packaging=25.0 requests=2.32.4 scipy=1.16.0 toolz=1.0.0 xarray=2025.7.1 gsw=3.6.19 pyco2sys=1.8.3 tqdm=4.67.1 boto3=1.38.27 jsonschema=4.25.1 kerchunk=0.2.8 numcodecs=0.16.1 s3fs=2025.5.1 zarr=3.0.10 dask=2025.5.1 distributed=2025.5.1 flox=0.10.8 joblib=1.5.2 numba=0.62.1 pyarrow=20.0.0 coiled=1.120.0 IPython=9.4.0 cartopy=0.24.0 matplotlib=3.10.3 pyproj=3.7.1 seaborn=0.13.2 jupyterlab=4.4.7 ipykernel=6.29.5 ipywidgets=8.1.7 aiofiles=24.1.0 black=25.1.0 bottleneck=1.5.0 cfgrib=0.9.15.0 cftime=1.6.4 codespell=2.4.1 flake8=7.3.0 numpy=2.3.1 pandas=2.3.1 pip=25.1.1 pytest=8.4.1 setuptools=80.9.0 -y
pip install git+http://github.com/euroargodev/argopy.git@naive-arco

Examples

Fetch and load the global Argo dataset for a given month

        from dask.distributed import Client
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Fetch and load data in memory:
        # (~6 mins on laptop with 4 cores and 32Gb of RAM)
        dsp = MassFetcher(idx).to_xarray()

Fetch and save the global Argo dataset for a given month

        from dask.distributed import Client
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Fetch and load data in memory:
        # (~6 mins on laptop with 4 cores and 32Gb of RAM)
        zarr_archive = f"zarr/{idx.search_type['BOX'][-2]}_ARGO_POINTS.zarr"
        MassFetcher(idx).to_zarr(zarr_archive)

Work with interpolated data

        from dask.distributed import Client
        import numpy as np
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Define standard pressure levels:
        sdl = np.arange(0, 2005., 5)

        # Fetch, interpolate and load data in memory:
        dsp = MassFetcher(idx, sdl=sdl).to_xarray()

        # Fetch, interpolate and save data to zarr:
        zarr_archive = f"zarr/{idx.search_type['BOX'][-2]}_ARGO_STANDARD.zarr"
        dsp = MassFetcher(idx, sdl=sdl).to_zarr(zarr_archive)

Load time series of global Argo dataset as points

        # Considering the above example where monthly data have been saved in several zarr archives
        # one can load back everything like this:

        import xarray as xr

        bigds = xr.open_mfdataset(["./zarr/20250901_ARGO_POINTS.zarr",
                                   "./zarr/20251001_ARGO_POINTS.zarr",
                                   "./zarr/20251101_ARGO_POINTS.zarr"],
                                   combine='nested', concat_dim='N_POINTS')

Load time series of global Argo dataset as interpolated profiles

        # Considering the above example where monthly data have been saved in several zarr archives
        # one can load back everything like this:

        import xarray as xr

        bigds = xr.open_mfdataset(["./zarr/20250901_ARGO_STANDARD.zarr",
                                   "./zarr/20251001_ARGO_STANDARD.zarr",
                                   "./zarr/20251101_ARGO_STANDARD.zarr"],
                                   combine='nested', concat_dim='N_PROF')

fix bug whereby read_files on index not loaded raised an error
- add domain filter
- add to_zarr method
- fix bug whereby coordinates were wrong when data interpolated on sdl
@gmaze gmaze marked this pull request as draft December 19, 2025 14:53
@gmaze gmaze added performance Should make argopy faster or better designed argo-core About core variables (P, T, S) labels Dec 19, 2025
@gmaze gmaze moved this from Queued to In Progress in Argopy Management Dashboard Dec 19, 2025
gmaze added 8 commits January 3, 2026 23:28
renamed 'id' to 'ARCOID'
- shapes are saved internally to avoid computation on every call
- shapes is made compatible with a MassFetcher output and the peculiar 'ARCOID' which is used as _dummy_argo_uid
@euroargodev euroargodev deleted a comment from sonarqubecloud bot Jan 25, 2026
@gmaze gmaze moved this from In Progress to Stalled in Argopy Management Dashboard Feb 13, 2026
@gmaze gmaze self-assigned this Feb 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

argo-core About core variables (P, T, S) performance Should make argopy faster or better designed

Projects

Status: Stalled

Development

Successfully merging this pull request may close these issues.

1 participant