Update open_virtual_mfdataset to use virtualizarr v2.x. by owenlittlejohns · Pull Request #1074 · earthaccess-dev/earthaccess

owenlittlejohns · 2025-08-22T17:02:43Z

This PR updates earthaccess.open_virtual_mfdataset and earthaccess.open_virtual_dataset to use VirtualiZarr v2, which requires using obstore instead of fsspec. This addresses #1071.

Manual verification of the new code (for indirect access links):

import earthaccess

auth = earthaccess.login()
results = earthaccess.search_data(count=5, temporal=("2024"), short_name="MUR-JPL-L4-GLOB-v4.1")
vds = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts",
)
vds

Pull Request (PR) draft checklist - click to expand

Please review our
contributing documentation
before getting started.
Populate a descriptive title. For example, instead of "Updated README.md", use a
title such as "Add testing details to the contributor section of the README".
Example PRs: #763
Populate the body of the pull request with:
- A clear description of the change you are proposing.
- Links to any issues resolved by this PR with text in the PR description, for
  example closes #1. See
  GitHub docs - Linking a pull request to an issue.
Update CHANGELOG.md with details about your change in a section titled
## Unreleased. If such a section does not exist, please create one. Follow
Common Changelog for your additions.
Example PRs: #763
Update the documentation and/or the README.md with details of changes to the
earthaccess interface, if any. Consider new environment variables, function names,
decorators, etc.

Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!

Pull Request (PR) merge checklist - click to expand

Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the @nsidc/earthaccess-support team in a comment and we
will help you out!

Add unit tests for any new features.
Apply formatting and linting autofixes. You can add a GitHub comment in this Pull
Request containing "pre-commit.ci autofix" to automate this.
Ensure all automated PR checks (seen at the bottom of the "conversation" tab) pass.
Get at least one approving review.

📚 Documentation preview 📚: https://earthaccess--1074.org.readthedocs.build/en/1074/

github-actions · 2025-08-22T17:02:58Z

👈 Launch a binder notebook on this branch for commit 07e9af0

I will automatically update this comment whenever this PR is modified

👈 Launch a binder notebook on this branch for commit 627f326

👈 Launch a binder notebook on this branch for commit 003ab19

👈 Launch a binder notebook on this branch for commit 46028a2

👈 Launch a binder notebook on this branch for commit cb0f1d6

👈 Launch a binder notebook on this branch for commit 8a00b1f

👈 Launch a binder notebook on this branch for commit e0241ae

👈 Launch a binder notebook on this branch for commit 29cb3e0

👈 Launch a binder notebook on this branch for commit 1459ee0

👈 Launch a binder notebook on this branch for commit 166d7d3

👈 Launch a binder notebook on this branch for commit 850c9a7

👈 Launch a binder notebook on this branch for commit 8b900ae

👈 Launch a binder notebook on this branch for commit 2cb9fec

owenlittlejohns · 2025-08-22T17:04:52Z

@betolink - here's the draft PR. I still need to do some more checking over doc-strings, and definitely need to look at unit test coverage. But the code is here now to take a look at the changes while I do that.

owenlittlejohns · 2025-08-22T17:11:02Z

~~I'm also working through a minor tweak with the S3Store. I'll be pushing up changes for that hopefully very soon!~~

Update: Added in 003ab19

owenlittlejohns · 2025-08-27T15:47:05Z

I'll look into the unit tests failing now - looks like maybe some package incompatibilities?

earthaccess/dmrpp_zarr.py

owenlittlejohns · 2025-08-27T16:22:55Z

earthaccess/dmrpp_zarr.py

+    if credentials_endpoint is None:
+        raise ValueError("The collection did not provide an S3CredentialsAPIEndpoint")


I wasn't sure of the best exception to raise here, but looking at other places in the code, it looks like UMM records that don't have what is needed lead to ValueErrors in other places, so it seemed like a good first attempt.

owenlittlejohns · 2025-08-27T16:24:06Z

Okay - apologies for the several commits with failing tests. I think the only things now failing are integration tests unrelated to this PR.

betolink · 2025-08-28T03:19:11Z

I think the only things now failing are integration tests unrelated to this PR.

It's not this PR directly... I forgot that this Obstore integration comes with Zarr v3(fsspec, kerchunk etc need it now with xarray), I think we should merge #967 first, although will introduce some code that you already removed, if we do it ZarrV3 should work and the integration tests should pass.

@weiji14, do you think this PR could be merged soon-ish? (crossing fingers 😆)

betolink · 2025-08-28T03:23:16Z

I wonder if there is a compatibility matrix for packages affected by this Zarr V2 to V3 migration. 🤔

weiji14 · 2025-08-28T03:30:04Z

@weiji14, do you think this PR could be merged soon-ish? (crossing fingers 😆)

Haven't got time to review as I'm a little occupied this week attending a conference, plus a huge backlog of stuff from too many weeks of travel+conferences, so... go ahead if it looks ok to you! Happy for my PR at #967 to be superseded by this PR if it works, I haven't had a chance to dive into how virtualizarr v2 works yet.

betolink · 2025-08-28T03:37:51Z

Thanks for the heads up @weiji14! what do you think @owenlittlejohns? maybe we can try just updating the deps for kerchunk and fsspec on this PR. From Wei Ji's PR:

"fsspec >=2024.12.0",
"kerchunk >=0.2.8",

and he's using the upper bound for zarr to <4.0

betolink · 2025-08-28T03:41:04Z

ok I'll test now if this works...

betolink · 2025-08-28T15:19:02Z

Ok I think the best thing we can do for now is to skip the kerchunk integration test, this is the code that started to automate the creation of virtual stores using kerchunk directly. Now that virtualizar is a thing, we need to re-implement the earthaccess.consolidate_metadata() method to use virtualizar's own kerchunk parser. I think there is value in knowing how to open these stores using fsspec as is still used in the majority of Pangeo related projects.

I opened this on the Kerchunk repo: fsspec/kerchunk#574

betolink · 2025-08-28T17:14:18Z

The PR looks great! just a few thoughts:

we are assuming the dmrpp is next to the data and this may change, this would require to potentially mount different buckets as the dmrpp files may be on a bucket with different permissions.
this is a somewhat unreliable access pattern and it depends on checking some boxes, we need to create a tutorial on the requirements to virtualize data from NASA something @danielfromearth and @DeanHenze started doing already.
there are potential improvements to this access patterns that may be worth working on with the OPeNDAP team, like simplifying and serializing the dmrpp using parquets, e.g. the dimensions and groups will be columns and when we need to load data from them we won't need to parse the whole file. We also talked about including bboxes at the chunk level to implement better subsetting when we load data into xarray. cc @Mikejmnez
right now we have a consolidate_metadata(granules) method that uses kerchunk directly, we'll have to migrate it to use virtualizarr's own KerchunkParser class, but that's for the next release.
When a provider generates a consolidated virtual store usually it ends up on a different location from the data, we may need to also create a tutorial on how to mount these consolidated stores e.g. the steps from Danny's notebook should be in earthaccess https://github.com/nasa/ASDC_Data_and_User_Services/blob/prefire-virtually/PREFIRE/additional_drafts/read-references-for-PREFIRE-from-s3.ipynb and this overlaps with Dean's PR to load parquets from CMR Add get_virtual_reference(), in testing #964

Can you guys take a look as well?
@chuckwondo @jhkennedy

betolink

The only thing I may add is that we may need to include lithops(even though we are not using it yet) as parallel executor but other than that this PR is amazing!

earthaccess/dmrpp_zarr.py

owenlittlejohns · 2025-08-28T23:23:57Z

Thanks for the heads up @weiji14! what do you think @owenlittlejohns? maybe we can try just updating the deps for kerchunk and fsspec on this PR. From Wei Ji's PR:
"fsspec >=2024.12.0",
"kerchunk >=0.2.8",
and he's using the upper bound for zarr to <4.0

@betolink - just to check (given the later comments here): did you still think we should bump the dependencies, too, or hold off?

betolink · 2025-08-29T16:25:02Z

@betolink - just to check (given the later comments #1074 (comment)): did you still think we should bump the dependencies, too, or hold off?

I went ahead and updated the dependencies, now everything is working as expected for Zarr V3.

betolink

Approving this and will merge before Tuesday unless we hear otherwise from @chuckwondo or @jhkennedy

betolink · 2025-09-02T00:05:57Z

Ok I think I missed a critical thing when I merged this, looks like indexing on virtual datasets in memory is still not supported and we still need a round trip to kerchunk/icechunk. I tried using it with sel() and isel() and got exceptions in both cases. I don't know if load_variables is the only factor here, if I'm reading this correctly there is still no workaround and Tom even mentioned this is a misunderstanding of the library in zarr-developers/VirtualiZarr#647 IMO using indexes on an in-memory virtualized dataset is a killer feature! if we know what we are doing and not loading petabytes of data, even in that case the worst thing it can happen is the same thing it will happen if we try to do it directly with xarray...

We can try to re-implement the round-trip serialization and give the "icechunk" option if the library is installed. I'm not sure this would be ready for tomorrow when we wanted to do a release during the hacking hour, will CC people in the morning.

TomNicholas · 2025-09-05T21:25:43Z

indexing on virtual datasets in memory

What do you mean exactly? Loading virtual variables? Using .sel along coordinates with loadable variables? Something else?

betolink · 2025-09-05T22:04:28Z

Using .sel along coordinates with loadable variables? Something else?

Yes exactly that but for the dmrpp parser, it may be that the actual dmrpp parser is the culprit here. Say we load 2 files with:

        vds = vz.open_virtual_mfdataset(
            urls=granule_dmrpp_urls,
            registry=obstore_registry,
            parser=DMRPPParser(group=group),
            preprocess=preprocess,
            parallel=parallel,
            combine="nested",
            **xr_combine_nested_kwargs,
        )

Then if we tried to do an isel(), or sel() got an exception:

NotImplementedError: Unsupported indexer. Indexing within a ManifestArray using ints or slices is not yet supported (see GitHub issue #51), but received 0

With the explicit serialization and reloading this went away but maybe is unnecessary if there is a more explicit fix. @TomNicholas

TomNicholas · 2025-09-07T00:48:05Z

Can you help me understand the context of why you're trying to index into a virtual dataset? Indexing into a ManifestArray is inherently very limited because one cannot slice inside of a compressed chunk, which is why only indexing no-ops are supported right now.

We could add support for some other limited types of indexing (i.e. chunk-aligned indexing, or arbitrary indexing into uncompressed data), but why do you even want to do this?

betolink · 2025-09-08T20:39:20Z

@TomNicholas one of the cool things about virtualized datasets is that we could lazy load them into memory and then work on a subset of the logical cube, either to compute something on that slice or to persist it as a virtualized store for later use.

A concrete case is to replicate what @bilts did here: https://github.com/nasa/zarr-eosdis-store/blob/main/presentation/example.ipynb Now imagine this for all files along the time dimension, we could generate a virtual datacube only for the great lakes for example.

I'm not sure however if this is a Vritualizarr bug, when I tried to do it with the native HDF5 parser I'm almost sure it worked fine (need to double check I'm using the latest version etc)

vds = vz.open_virtual_mfdataset(
            urls=granule_dmrpp_urls,
            registry=obstore_registry,
            parser=HDFParser(group=group),
            parallel=parallel,
            combine="nested",
            **xr_combine_nested_kwargs,
        )
subsetted_ds = vds.isel(time=0).sel(
        {
            "longitude": slice(lon_bounds[0], lon_bounds[1]),
            "latitude": slice(lat_bounds[0], lat_bounds[1]),
        }
    )

TomNicholas · 2025-09-08T23:05:54Z

work on a subset of the logical cube

What's the advantage of keeping the virtual dataset around so long, relative to simply creating a bigger virtual store that contains all the data you might want up front? You can still subset that lazily, and are not restricted to subsetting only along chunk boundaries. That's how the library was intended to be used.

I'm not sure however if this is a Virtualizarr bug, when I tried to do it with the native HDF5 parser I'm almost sure it worked fine

To do this indexing you need the values for that coordinate in memory, rather than as a ManifestArray. Do the DMRPP files actually store these values "inlined"? If not then what you're seeing is intended behaviour, intended to protect you from touching the referenced legacy files when only manipulating a chunk references format (e.g. DMRPP/Kerchunk). It's similar to opening a pre-existing Kerchunk reference file that does not contain any inlined data. The HDFParser always has to look at the original HDF file, so has slightly different defaults.

betolink · 2025-09-09T01:59:15Z

Do the DMRPP files actually store these values "inlined"? If not then what you're seeing is intended behavior, intended to protect you from touching the referenced legacy files when only manipulating a chunk references format (e.g. DMRPP/Kerchunk). It's similar to opening a pre-existing Kerchunk reference file that does not contain any inlined data. The HDFParser always has to look at the original HDF file, so has slightly different defaults.

Yes, I assume this is the intended behavior as you described in zarr-developers/VirtualiZarr#647 in our case we would like to give our users a shortcut to sel() and isel operations, they can still opt for load=False and we won't materialize anything. I guess the HDF parser loads the coordinates by default.

Eventually DAACs will produce these consolidated stores but in the meantime I see value in letting the users manipulate what they'd like to persist. Now that we have VirtualiZarr 2.x we'll be able to finally start playing with Icechunk+lithops!

Update to use virtualizarr v2.x.

07e9af0

Add information to CHANGELOG.md.

627f326

owenlittlejohns and others added 4 commits August 22, 2025 13:16

Include NasaEarthdataCredentialProvider.

003ab19

Merge branch 'main' into update-to-virtualizarr-2.x

46028a2

Merge branch 'main' into update-to-virtualizarr-2.x

cb0f1d6

Add unit tests for dmrpp Zarr S3 credentials endpoint.

8a00b1f

owenlittlejohns marked this pull request as ready for review August 27, 2025 15:43

owenlittlejohns requested review from betolink and chuckwondo August 27, 2025 15:43

owenlittlejohns added 2 commits August 27, 2025 11:50

Update virtual dataset integration test.

e0241ae

Update h5py dependency.

29cb3e0

owenlittlejohns commented Aug 27, 2025

View reviewed changes

earthaccess/dmrpp_zarr.py Show resolved Hide resolved

Typing enforcement and extra unit tests.

1459ee0

owenlittlejohns commented Aug 27, 2025

View reviewed changes

betolink added 2 commits August 28, 2025 11:43

refactored kerchunk test, temporary until we use virtualizarr

166d7d3

update uv.lock

850c9a7

betolink requested a review from jhkennedy August 28, 2025 17:14

betolink reviewed Aug 28, 2025

View reviewed changes

earthaccess/dmrpp_zarr.py Outdated Show resolved Hide resolved

Raise exception for empty granule list.

8b900ae

update changelog

2cb9fec

betolink approved these changes Aug 29, 2025

View reviewed changes

betolink merged commit 249c99d into earthaccess-dev:main Sep 1, 2025
13 of 14 checks passed

This was referenced Sep 2, 2025

[DNM] Revert "Update open_virtual_mfdataset to use virtualizarr v2.x." #1082

Closed

[DRAFT] Retry downloads #1016

Closed

Update zarr requirement from <3.0.0a,>=2.12.0 to >=2.12.0,<4.0.0 in the pip-dependencies group #973

Closed

Support zarr v3 #967

Closed

This was referenced Sep 16, 2025

Refactor open_virtual_dataset and open_virtual_mfdataset API to match virtualizarr #905

Open

Error in Google Colab when using earthaccess.open_virtual_mfdataset() #1046

Open

mfisher87 added this to earthaccess Mar 3, 2026

github-project-automation bot moved this to ✅ Done in earthaccess Mar 3, 2026

mfisher87 removed this from earthaccess Mar 3, 2026

		if credentials_endpoint is None:
		raise ValueError("The collection did not provide an S3CredentialsAPIEndpoint")

Conversation

owenlittlejohns commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

owenlittlejohns commented Aug 22, 2025

Uh oh!

owenlittlejohns commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

owenlittlejohns commented Aug 27, 2025

Uh oh!

Uh oh!

owenlittlejohns Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

owenlittlejohns commented Aug 27, 2025

Uh oh!

betolink commented Aug 28, 2025

Uh oh!

betolink commented Aug 28, 2025

Uh oh!

weiji14 commented Aug 28, 2025

Uh oh!

betolink commented Aug 28, 2025

Uh oh!

betolink commented Aug 28, 2025

Uh oh!

betolink commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betolink commented Aug 28, 2025

Uh oh!

betolink left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

owenlittlejohns commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betolink commented Aug 29, 2025

Uh oh!

betolink left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

betolink commented Sep 2, 2025

Uh oh!

TomNicholas commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

betolink commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomNicholas commented Sep 7, 2025

Uh oh!

betolink commented Sep 8, 2025

Uh oh!

TomNicholas commented Sep 8, 2025

Uh oh!

betolink commented Sep 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

owenlittlejohns commented Aug 22, 2025 •

edited

Loading

github-actions bot commented Aug 22, 2025 •

edited

Loading

owenlittlejohns commented Aug 22, 2025 •

edited

Loading

betolink commented Aug 28, 2025 •

edited

Loading

owenlittlejohns commented Aug 28, 2025 •

edited

Loading

TomNicholas commented Sep 5, 2025 •

edited

Loading

betolink commented Sep 5, 2025 •

edited

Loading

betolink commented Sep 9, 2025 •

edited

Loading