Fetch Argo data massively by gmaze · Pull Request #565 · euroargodev/argopy

gmaze · 2025-12-19T14:53:29Z

This PR implements a naive approach to fetch core Argo data massively, i.e. to fetch a very large selection of measurements, as fast as possible.

This class should work as long as data fit in memory, so check out your RAM levels.

For reasonable performances, this is intended to be used with a local copy of the GDAC.

This class use a distributed.client.Client to scale Argo data fetching from a large collection of floats.

This class return data in argopy "Standard" user-mode, i.e. with:

QC=[1,5,8] on position and date
QC=1 on pres/temp/psal
Real-time, adjusted or delayed, according to data_mode

Note
One month of global Argo data in standard user-mod is about 45Mb on disk as a zarr archive, so there is a lot of room for improvement in having ARCO (Analysis-Ready and Cloud-Optimized) Argo data with such approach.

Installation to play with this new feature

You need to install argopy from this branch.
This can be done like this:

conda create -n argopy-arco python=3.11
conda activate argopy-arco
conda install -c conda-forge aiohttp=3.12.14 decorator=5.2.1 erddapy=2.2.4 fsspec=2025.5.1 h5netcdf=1.6.3 netCDF4=1.7.2 packaging=25.0 requests=2.32.4 scipy=1.16.0 toolz=1.0.0 xarray=2025.7.1 gsw=3.6.19 pyco2sys=1.8.3 tqdm=4.67.1 boto3=1.38.27 jsonschema=4.25.1 kerchunk=0.2.8 numcodecs=0.16.1 s3fs=2025.5.1 zarr=3.0.10 dask=2025.5.1 distributed=2025.5.1 flox=0.10.8 joblib=1.5.2 numba=0.62.1 pyarrow=20.0.0 coiled=1.120.0 IPython=9.4.0 cartopy=0.24.0 matplotlib=3.10.3 pyproj=3.7.1 seaborn=0.13.2 jupyterlab=4.4.7 ipykernel=6.29.5 ipywidgets=8.1.7 aiofiles=24.1.0 black=25.1.0 bottleneck=1.5.0 cfgrib=0.9.15.0 cftime=1.6.4 codespell=2.4.1 flake8=7.3.0 numpy=2.3.1 pandas=2.3.1 pip=25.1.1 pytest=8.4.1 setuptools=80.9.0 -y
pip install git+http://github.com/euroargodev/argopy.git@naive-arco

Examples

Fetch and load the global Argo dataset for a given month

        from dask.distributed import Client
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Fetch and load data in memory:
        # (~6 mins on laptop with 4 cores and 32Gb of RAM)
        dsp = MassFetcher(idx).to_xarray()

Fetch and save the global Argo dataset for a given month

        from dask.distributed import Client
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Fetch and load data in memory:
        # (~6 mins on laptop with 4 cores and 32Gb of RAM)
        zarr_archive = f"zarr/{idx.search_type['BOX'][-2]}_ARGO_POINTS.zarr"
        MassFetcher(idx).to_zarr(zarr_archive)

Work with interpolated data

        from dask.distributed import Client
        import numpy as np
        from argopy import ArgoIndex
        from argopy.utils.arco import MassFetcher

        # Create the Dask cluster to work with:
        client = Client(processes=True)
        print(client)

        # Select October 2025 global data:
        # This is about 8_184_913 points (12_287 prof,  3578 floats)
        idx = ArgoIndex(index_file='core')
        idx.query.box([-180, 180, -90, 90, '20251001', '20251101'])

        # Define standard pressure levels:
        sdl = np.arange(0, 2005., 5)

        # Fetch, interpolate and load data in memory:
        dsp = MassFetcher(idx, sdl=sdl).to_xarray()

        # Fetch, interpolate and save data to zarr:
        zarr_archive = f"zarr/{idx.search_type['BOX'][-2]}_ARGO_STANDARD.zarr"
        dsp = MassFetcher(idx, sdl=sdl).to_zarr(zarr_archive)

Load time series of global Argo dataset as points

        # Considering the above example where monthly data have been saved in several zarr archives
        # one can load back everything like this:

        import xarray as xr

        bigds = xr.open_mfdataset(["./zarr/20250901_ARGO_POINTS.zarr",
                                   "./zarr/20251001_ARGO_POINTS.zarr",
                                   "./zarr/20251101_ARGO_POINTS.zarr"],
                                   combine='nested', concat_dim='N_POINTS')

Load time series of global Argo dataset as interpolated profiles

        # Considering the above example where monthly data have been saved in several zarr archives
        # one can load back everything like this:

        import xarray as xr

        bigds = xr.open_mfdataset(["./zarr/20250901_ARGO_STANDARD.zarr",
                                   "./zarr/20251001_ARGO_STANDARD.zarr",
                                   "./zarr/20251101_ARGO_STANDARD.zarr"],
                                   combine='nested', concat_dim='N_PROF')

fix bug whereby read_files on index not loaded raised an error

- add domain filter - add to_zarr method - fix bug whereby coordinates were wrong when data interpolated on sdl

renamed 'id' to 'ARCOID'

- shapes are saved internally to avoid computation on every call - shapes is made compatible with a MassFetcher output and the peculiar 'ARCOID' which is used as _dummy_argo_uid

gmaze added 4 commits December 18, 2025 08:10

Experimental arco utility for massive loading

f84b8d7

Update index.py

f1118df

fix bug whereby read_files on index not loaded raised an error

Update local.py

db062a7

Refactor MassFetcher as BigFetcher

9d6fff8

- add domain filter - add to_zarr method - fix bug whereby coordinates were wrong when data interpolated on sdl

github-project-automation bot added this to Argopy Management Dashboard Dec 19, 2025

github-project-automation bot moved this to Queued in Argopy Management Dashboard Dec 19, 2025

gmaze marked this pull request as draft December 19, 2025 14:53

gmaze added performance Should make argopy faster or better designed argo-core About core variables (P, T, S) labels Dec 19, 2025

gmaze moved this from Queued to In Progress in Argopy Management Dashboard Dec 19, 2025

gmaze added 8 commits January 3, 2026 23:28

Merge branch 'master' into naive-arco

baef14a

get back to "MassFetcher" name

c7879c4

Update fetcher.py

5a32150

renamed 'id' to 'ARCOID'

Update xarray.py

e8c9a61

- shapes are saved internally to avoid computation on every call - shapes is made compatible with a MassFetcher output and the peculiar 'ARCOID' which is used as _dummy_argo_uid

Add flox for performance

6910a8d

Merge branch 'master' into naive-arco

b13befa

Merge branch 'master' into naive-arco

44163d2

add support for user mode standard vs research

fa3a9f2

euroargodev deleted a comment from sonarqubecloud bot Jan 25, 2026

gmaze moved this from In Progress to Stalled in Argopy Management Dashboard Feb 13, 2026

gmaze self-assigned this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch Argo data massively#565

Fetch Argo data massively#565
gmaze wants to merge 12 commits intomasterfrom
naive-arco

gmaze commented Dec 19, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gmaze commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Installation to play with this new feature

Examples

Fetch and load the global Argo dataset for a given month

Fetch and save the global Argo dataset for a given month

Work with interpolated data

Load time series of global Argo dataset as points

Load time series of global Argo dataset as interpolated profiles

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

gmaze commented Dec 19, 2025 •

edited

Loading