Skip to content

Update open_virtual_mfdataset to use virtualizarr v2.x.#1074

Merged
betolink merged 13 commits intoearthaccess-dev:mainfrom
owenlittlejohns:update-to-virtualizarr-2.x
Sep 1, 2025
Merged

Update open_virtual_mfdataset to use virtualizarr v2.x.#1074
betolink merged 13 commits intoearthaccess-dev:mainfrom
owenlittlejohns:update-to-virtualizarr-2.x

Conversation

@owenlittlejohns
Copy link
Copy Markdown
Collaborator

@owenlittlejohns owenlittlejohns commented Aug 22, 2025

This PR updates earthaccess.open_virtual_mfdataset and earthaccess.open_virtual_dataset to use VirtualiZarr v2, which requires using obstore instead of fsspec. This addresses #1071.

Manual verification of the new code (for indirect access links):

import earthaccess

auth = earthaccess.login()
results = earthaccess.search_data(count=5, temporal=("2024"), short_name="MUR-JPL-L4-GLOB-v4.1")
vds = earthaccess.open_virtual_mfdataset(
    results,
    access="indirect",
    coords="minimal",
    compat="override",
    combine_attrs="drop_conflicts",
)
vds
Pull Request (PR) draft checklist - click to expand
  • Please review our
    contributing documentation
    before getting started.
  • Populate a descriptive title. For example, instead of "Updated README.md", use a
    title such as "Add testing details to the contributor section of the README".
    Example PRs: #763
  • Populate the body of the pull request with:
  • Update CHANGELOG.md with details about your change in a section titled
    ## Unreleased. If such a section does not exist, please create one. Follow
    Common Changelog for your additions.
    Example PRs: #763
  • Update the documentation and/or the README.md with details of changes to the
    earthaccess interface, if any. Consider new environment variables, function names,
    decorators, etc.

Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!

Pull Request (PR) merge checklist - click to expand

Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the @nsidc/earthaccess-support team in a comment and we
will help you out!

  • Add unit tests for any new features.
  • Apply formatting and linting autofixes. You can add a GitHub comment in this Pull
    Request containing "pre-commit.ci autofix" to automate this.
  • Ensure all automated PR checks (seen at the bottom of the "conversation" tab) pass.
  • Get at least one approving review.

📚 Documentation preview 📚: https://earthaccess--1074.org.readthedocs.build/en/1074/

@github-actions
Copy link
Copy Markdown

github-actions bot commented Aug 22, 2025

Binder 👈 Launch a binder notebook on this branch for commit 07e9af0

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 627f326

Binder 👈 Launch a binder notebook on this branch for commit 003ab19

Binder 👈 Launch a binder notebook on this branch for commit 46028a2

Binder 👈 Launch a binder notebook on this branch for commit cb0f1d6

Binder 👈 Launch a binder notebook on this branch for commit 8a00b1f

Binder 👈 Launch a binder notebook on this branch for commit e0241ae

Binder 👈 Launch a binder notebook on this branch for commit 29cb3e0

Binder 👈 Launch a binder notebook on this branch for commit 1459ee0

Binder 👈 Launch a binder notebook on this branch for commit 166d7d3

Binder 👈 Launch a binder notebook on this branch for commit 850c9a7

Binder 👈 Launch a binder notebook on this branch for commit 8b900ae

Binder 👈 Launch a binder notebook on this branch for commit 2cb9fec

@owenlittlejohns
Copy link
Copy Markdown
Collaborator Author

@betolink - here's the draft PR. I still need to do some more checking over doc-strings, and definitely need to look at unit test coverage. But the code is here now to take a look at the changes while I do that.

@owenlittlejohns
Copy link
Copy Markdown
Collaborator Author

owenlittlejohns commented Aug 22, 2025

I'm also working through a minor tweak with the S3Store. I'll be pushing up changes for that hopefully very soon!

Update: Added in 003ab19

@owenlittlejohns owenlittlejohns marked this pull request as ready for review August 27, 2025 15:43
@owenlittlejohns
Copy link
Copy Markdown
Collaborator Author

I'll look into the unit tests failing now - looks like maybe some package incompatibilities?

Comment on lines +223 to +224
if credentials_endpoint is None:
raise ValueError("The collection did not provide an S3CredentialsAPIEndpoint")
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure of the best exception to raise here, but looking at other places in the code, it looks like UMM records that don't have what is needed lead to ValueErrors in other places, so it seemed like a good first attempt.

@owenlittlejohns
Copy link
Copy Markdown
Collaborator Author

Okay - apologies for the several commits with failing tests. I think the only things now failing are integration tests unrelated to this PR.

@betolink
Copy link
Copy Markdown
Contributor

I think the only things now failing are integration tests unrelated to this PR.

It's not this PR directly... I forgot that this Obstore integration comes with Zarr v3(fsspec, kerchunk etc need it now with xarray), I think we should merge #967 first, although will introduce some code that you already removed, if we do it ZarrV3 should work and the integration tests should pass.

@weiji14, do you think this PR could be merged soon-ish? (crossing fingers 😆)

@betolink
Copy link
Copy Markdown
Contributor

I wonder if there is a compatibility matrix for packages affected by this Zarr V2 to V3 migration. 🤔

@weiji14
Copy link
Copy Markdown
Contributor

weiji14 commented Aug 28, 2025

@weiji14, do you think this PR could be merged soon-ish? (crossing fingers 😆)

Haven't got time to review as I'm a little occupied this week attending a conference, plus a huge backlog of stuff from too many weeks of travel+conferences, so... go ahead if it looks ok to you! Happy for my PR at #967 to be superseded by this PR if it works, I haven't had a chance to dive into how virtualizarr v2 works yet.

@betolink
Copy link
Copy Markdown
Contributor

Thanks for the heads up @weiji14! what do you think @owenlittlejohns? maybe we can try just updating the deps for kerchunk and fsspec on this PR. From Wei Ji's PR:

"fsspec >=2024.12.0",
"kerchunk >=0.2.8",

and he's using the upper bound for zarr to <4.0

@betolink
Copy link
Copy Markdown
Contributor

ok I'll test now if this works...

@betolink
Copy link
Copy Markdown
Contributor

betolink commented Aug 28, 2025

Ok I think the best thing we can do for now is to skip the kerchunk integration test, this is the code that started to automate the creation of virtual stores using kerchunk directly. Now that virtualizar is a thing, we need to re-implement the earthaccess.consolidate_metadata() method to use virtualizar's own kerchunk parser. I think there is value in knowing how to open these stores using fsspec as is still used in the majority of Pangeo related projects.

I opened this on the Kerchunk repo: fsspec/kerchunk#574

@betolink
Copy link
Copy Markdown
Contributor

The PR looks great! just a few thoughts:

  • we are assuming the dmrpp is next to the data and this may change, this would require to potentially mount different buckets as the dmrpp files may be on a bucket with different permissions.
  • this is a somewhat unreliable access pattern and it depends on checking some boxes, we need to create a tutorial on the requirements to virtualize data from NASA something @danielfromearth and @DeanHenze started doing already.
  • there are potential improvements to this access patterns that may be worth working on with the OPeNDAP team, like simplifying and serializing the dmrpp using parquets, e.g. the dimensions and groups will be columns and when we need to load data from them we won't need to parse the whole file. We also talked about including bboxes at the chunk level to implement better subsetting when we load data into xarray. cc @Mikejmnez
  • right now we have a consolidate_metadata(granules) method that uses kerchunk directly, we'll have to migrate it to use virtualizarr's own KerchunkParser class, but that's for the next release.
  • When a provider generates a consolidated virtual store usually it ends up on a different location from the data, we may need to also create a tutorial on how to mount these consolidated stores e.g. the steps from Danny's notebook should be in earthaccess https://github.com/nasa/ASDC_Data_and_User_Services/blob/prefire-virtually/PREFIRE/additional_drafts/read-references-for-PREFIRE-from-s3.ipynb and this overlaps with Dean's PR to load parquets from CMR Add get_virtual_reference(), in testing #964

Can you guys take a look as well?
@chuckwondo @jhkennedy

@betolink betolink requested a review from jhkennedy August 28, 2025 17:14
Copy link
Copy Markdown
Contributor

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I may add is that we may need to include lithops(even though we are not using it yet) as parallel executor but other than that this PR is amazing!

@owenlittlejohns
Copy link
Copy Markdown
Collaborator Author

owenlittlejohns commented Aug 28, 2025

Thanks for the heads up @weiji14! what do you think @owenlittlejohns? maybe we can try just updating the deps for kerchunk and fsspec on this PR. From Wei Ji's PR:

"fsspec >=2024.12.0",
"kerchunk >=0.2.8",

and he's using the upper bound for zarr to <4.0

@betolink - just to check (given the later comments here): did you still think we should bump the dependencies, too, or hold off?

@betolink
Copy link
Copy Markdown
Contributor

@betolink - just to check (given the later comments #1074 (comment)): did you still think we should bump the dependencies, too, or hold off?

I went ahead and updated the dependencies, now everything is working as expected for Zarr V3.

Copy link
Copy Markdown
Contributor

@betolink betolink left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving this and will merge before Tuesday unless we hear otherwise from @chuckwondo or @jhkennedy

@betolink betolink merged commit 249c99d into earthaccess-dev:main Sep 1, 2025
13 of 14 checks passed
@betolink
Copy link
Copy Markdown
Contributor

betolink commented Sep 2, 2025

Ok I think I missed a critical thing when I merged this, looks like indexing on virtual datasets in memory is still not supported and we still need a round trip to kerchunk/icechunk. I tried using it with sel() and isel() and got exceptions in both cases. I don't know if load_variables is the only factor here, if I'm reading this correctly there is still no workaround and Tom even mentioned this is a misunderstanding of the library in zarr-developers/VirtualiZarr#647 IMO using indexes on an in-memory virtualized dataset is a killer feature! if we know what we are doing and not loading petabytes of data, even in that case the worst thing it can happen is the same thing it will happen if we try to do it directly with xarray...

We can try to re-implement the round-trip serialization and give the "icechunk" option if the library is installed. I'm not sure this would be ready for tomorrow when we wanted to do a release during the hacking hour, will CC people in the morning.

@TomNicholas
Copy link
Copy Markdown

TomNicholas commented Sep 5, 2025

indexing on virtual datasets in memory

What do you mean exactly? Loading virtual variables? Using .sel along coordinates with loadable variables? Something else?

@betolink
Copy link
Copy Markdown
Contributor

betolink commented Sep 5, 2025

Using .sel along coordinates with loadable variables? Something else?

Yes exactly that but for the dmrpp parser, it may be that the actual dmrpp parser is the culprit here. Say we load 2 files with:

        vds = vz.open_virtual_mfdataset(
            urls=granule_dmrpp_urls,
            registry=obstore_registry,
            parser=DMRPPParser(group=group),
            preprocess=preprocess,
            parallel=parallel,
            combine="nested",
            **xr_combine_nested_kwargs,
        )

Then if we tried to do an isel(), or sel() got an exception:

NotImplementedError: Unsupported indexer. Indexing within a ManifestArray using ints or slices is not yet supported (see GitHub issue #51), but received 0

With the explicit serialization and reloading this went away but maybe is unnecessary if there is a more explicit fix. @TomNicholas

@TomNicholas
Copy link
Copy Markdown

Can you help me understand the context of why you're trying to index into a virtual dataset? Indexing into a ManifestArray is inherently very limited because one cannot slice inside of a compressed chunk, which is why only indexing no-ops are supported right now.

We could add support for some other limited types of indexing (i.e. chunk-aligned indexing, or arbitrary indexing into uncompressed data), but why do you even want to do this?

@betolink
Copy link
Copy Markdown
Contributor

betolink commented Sep 8, 2025

@TomNicholas one of the cool things about virtualized datasets is that we could lazy load them into memory and then work on a subset of the logical cube, either to compute something on that slice or to persist it as a virtualized store for later use.

A concrete case is to replicate what @bilts did here: https://github.com/nasa/zarr-eosdis-store/blob/main/presentation/example.ipynb Now imagine this for all files along the time dimension, we could generate a virtual datacube only for the great lakes for example.

I'm not sure however if this is a Vritualizarr bug, when I tried to do it with the native HDF5 parser I'm almost sure it worked fine (need to double check I'm using the latest version etc)

vds = vz.open_virtual_mfdataset(
            urls=granule_dmrpp_urls,
            registry=obstore_registry,
            parser=HDFParser(group=group),
            parallel=parallel,
            combine="nested",
            **xr_combine_nested_kwargs,
        )
subsetted_ds = vds.isel(time=0).sel(
        {
            "longitude": slice(lon_bounds[0], lon_bounds[1]),
            "latitude": slice(lat_bounds[0], lat_bounds[1]),
        }
    )

@TomNicholas
Copy link
Copy Markdown

work on a subset of the logical cube

What's the advantage of keeping the virtual dataset around so long, relative to simply creating a bigger virtual store that contains all the data you might want up front? You can still subset that lazily, and are not restricted to subsetting only along chunk boundaries. That's how the library was intended to be used.

I'm not sure however if this is a Virtualizarr bug, when I tried to do it with the native HDF5 parser I'm almost sure it worked fine

To do this indexing you need the values for that coordinate in memory, rather than as a ManifestArray. Do the DMRPP files actually store these values "inlined"? If not then what you're seeing is intended behaviour, intended to protect you from touching the referenced legacy files when only manipulating a chunk references format (e.g. DMRPP/Kerchunk). It's similar to opening a pre-existing Kerchunk reference file that does not contain any inlined data. The HDFParser always has to look at the original HDF file, so has slightly different defaults.

@betolink
Copy link
Copy Markdown
Contributor

betolink commented Sep 9, 2025

Do the DMRPP files actually store these values "inlined"? If not then what you're seeing is intended behavior, intended to protect you from touching the referenced legacy files when only manipulating a chunk references format (e.g. DMRPP/Kerchunk). It's similar to opening a pre-existing Kerchunk reference file that does not contain any inlined data. The HDFParser always has to look at the original HDF file, so has slightly different defaults.

Yes, I assume this is the intended behavior as you described in zarr-developers/VirtualiZarr#647 in our case we would like to give our users a shortcut to sel() and isel operations, they can still opt for load=False and we won't materialize anything. I guess the HDF parser loads the coordinates by default.

Eventually DAACs will produce these consolidated stores but in the meantime I see value in letting the users manipulate what they'd like to persist. Now that we have VirtualiZarr 2.x we'll be able to finally start playing with Icechunk+lithops!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants