Update open_virtual_mfdataset to use virtualizarr v2.x.#1074
Update open_virtual_mfdataset to use virtualizarr v2.x.#1074betolink merged 13 commits intoearthaccess-dev:mainfrom
Conversation
|
I will automatically update this comment whenever this PR is modified
|
|
@betolink - here's the draft PR. I still need to do some more checking over doc-strings, and definitely need to look at unit test coverage. But the code is here now to take a look at the changes while I do that. |
|
Update: Added in 003ab19 |
|
I'll look into the unit tests failing now - looks like maybe some package incompatibilities? |
| if credentials_endpoint is None: | ||
| raise ValueError("The collection did not provide an S3CredentialsAPIEndpoint") |
There was a problem hiding this comment.
I wasn't sure of the best exception to raise here, but looking at other places in the code, it looks like UMM records that don't have what is needed lead to ValueErrors in other places, so it seemed like a good first attempt.
|
Okay - apologies for the several commits with failing tests. I think the only things now failing are integration tests unrelated to this PR. |
It's not this PR directly... I forgot that this Obstore integration comes with Zarr v3(fsspec, kerchunk etc need it now with xarray), I think we should merge #967 first, although will introduce some code that you already removed, if we do it ZarrV3 should work and the integration tests should pass. @weiji14, do you think this PR could be merged soon-ish? (crossing fingers 😆) |
|
I wonder if there is a compatibility matrix for packages affected by this Zarr V2 to V3 migration. 🤔 |
Haven't got time to review as I'm a little occupied this week attending a conference, plus a huge backlog of stuff from too many weeks of travel+conferences, so... go ahead if it looks ok to you! Happy for my PR at #967 to be superseded by this PR if it works, I haven't had a chance to dive into how virtualizarr v2 works yet. |
|
Thanks for the heads up @weiji14! what do you think @owenlittlejohns? maybe we can try just updating the deps for kerchunk and fsspec on this PR. From Wei Ji's PR: and he's using the upper bound for zarr to |
|
ok I'll test now if this works... |
|
Ok I think the best thing we can do for now is to skip the kerchunk integration test, this is the code that started to automate the creation of virtual stores using kerchunk directly. Now that virtualizar is a thing, we need to re-implement the I opened this on the Kerchunk repo: fsspec/kerchunk#574 |
|
The PR looks great! just a few thoughts:
Can you guys take a look as well? |
betolink
left a comment
There was a problem hiding this comment.
The only thing I may add is that we may need to include lithops(even though we are not using it yet) as parallel executor but other than that this PR is amazing!
@betolink - just to check (given the later comments here): did you still think we should bump the dependencies, too, or hold off? |
I went ahead and updated the dependencies, now everything is working as expected for Zarr V3. |
betolink
left a comment
There was a problem hiding this comment.
Approving this and will merge before Tuesday unless we hear otherwise from @chuckwondo or @jhkennedy
|
Ok I think I missed a critical thing when I merged this, looks like indexing on virtual datasets in memory is still not supported and we still need a round trip to kerchunk/icechunk. I tried using it with We can try to re-implement the round-trip serialization and give the "icechunk" option if the library is installed. I'm not sure this would be ready for tomorrow when we wanted to do a release during the hacking hour, will CC people in the morning. |
What do you mean exactly? Loading virtual variables? Using . |
Yes exactly that but for the dmrpp parser, it may be that the actual dmrpp parser is the culprit here. Say we load 2 files with: vds = vz.open_virtual_mfdataset(
urls=granule_dmrpp_urls,
registry=obstore_registry,
parser=DMRPPParser(group=group),
preprocess=preprocess,
parallel=parallel,
combine="nested",
**xr_combine_nested_kwargs,
)Then if we tried to do an With the explicit serialization and reloading this went away but maybe is unnecessary if there is a more explicit fix. @TomNicholas |
|
Can you help me understand the context of why you're trying to index into a virtual dataset? Indexing into a We could add support for some other limited types of indexing (i.e. chunk-aligned indexing, or arbitrary indexing into uncompressed data), but why do you even want to do this? |
|
@TomNicholas one of the cool things about virtualized datasets is that we could lazy load them into memory and then work on a subset of the logical cube, either to compute something on that slice or to persist it as a virtualized store for later use. A concrete case is to replicate what @bilts did here: https://github.com/nasa/zarr-eosdis-store/blob/main/presentation/example.ipynb Now imagine this for all files along the time dimension, we could generate a virtual datacube only for the great lakes for example. I'm not sure however if this is a Vritualizarr bug, when I tried to do it with the native HDF5 parser I'm almost sure it worked fine (need to double check I'm using the latest version etc) vds = vz.open_virtual_mfdataset(
urls=granule_dmrpp_urls,
registry=obstore_registry,
parser=HDFParser(group=group),
parallel=parallel,
combine="nested",
**xr_combine_nested_kwargs,
)
subsetted_ds = vds.isel(time=0).sel(
{
"longitude": slice(lon_bounds[0], lon_bounds[1]),
"latitude": slice(lat_bounds[0], lat_bounds[1]),
}
) |
What's the advantage of keeping the virtual dataset around so long, relative to simply creating a bigger virtual store that contains all the data you might want up front? You can still subset that lazily, and are not restricted to subsetting only along chunk boundaries. That's how the library was intended to be used.
To do this indexing you need the values for that coordinate in memory, rather than as a ManifestArray. Do the DMRPP files actually store these values "inlined"? If not then what you're seeing is intended behaviour, intended to protect you from touching the referenced legacy files when only manipulating a chunk references format (e.g. DMRPP/Kerchunk). It's similar to opening a pre-existing Kerchunk reference file that does not contain any inlined data. The HDFParser always has to look at the original HDF file, so has slightly different defaults. |
Yes, I assume this is the intended behavior as you described in zarr-developers/VirtualiZarr#647 in our case we would like to give our users a shortcut to Eventually DAACs will produce these consolidated stores but in the meantime I see value in letting the users manipulate what they'd like to persist. Now that we have VirtualiZarr 2.x we'll be able to finally start playing with Icechunk+lithops! |
This PR updates
earthaccess.open_virtual_mfdatasetandearthaccess.open_virtual_datasetto use VirtualiZarr v2, which requires usingobstoreinstead offsspec. This addresses #1071.Manual verification of the new code (for indirect access links):
Pull Request (PR) draft checklist - click to expand
contributing documentation
before getting started.
title such as "Add testing details to the contributor section of the README".
Example PRs: #763
example
closes #1. SeeGitHub docs - Linking a pull request to an issue.
CHANGELOG.mdwith details about your change in a section titled## Unreleased. If such a section does not exist, please create one. FollowCommon Changelog for your additions.
Example PRs: #763
README.mdwith details of changes to theearthaccess interface, if any. Consider new environment variables, function names,
decorators, etc.
Click the "Ready for review" button at the bottom of the "Conversation" tab in GitHub
once these requirements are fulfilled. Don't worry if you see any test failures in
GitHub at this point!
Pull Request (PR) merge checklist - click to expand
Please do your best to complete these requirements! If you need help with any of these
requirements, you can ping the
@nsidc/earthaccess-supportteam in a comment and wewill help you out!
Request containing "pre-commit.ci autofix" to automate this.
📚 Documentation preview 📚: https://earthaccess--1074.org.readthedocs.build/en/1074/