Skip to content

H5Coro testing and integration#678

Draft
zachghiaccio wants to merge 10 commits intoicesat2py:developmentfrom
zachghiaccio:development
Draft

H5Coro testing and integration#678
zachghiaccio wants to merge 10 commits intoicesat2py:developmentfrom
zachghiaccio:development

Conversation

@zachghiaccio
Copy link
Contributor

Opening this to start discussion and development on the H5coro integration piece. Based on discussions with Rachel and Eric, we have several options for implementing h5coro into icepyx. However, it might require significant restructuring of the Read module in order to support the h5coro backend and (potentially) Xarray DataTrees.

JessicaS11 and others added 5 commits September 25, 2023 11:45
* Adding argo search and download script

* Create get_argo.py

Download the 'classic' argo data with physical variables only

* begin implementing argo dataset

* 1st draft implementing argo dataset

* implement search_data for physical argo

* doctests and general cleanup for physical argo query

* beginning of BGC Argo download

* parse BGC profiles into DF

* plan to query BGC profiles

* validate BGC param input function

* order BGC params in order in which they should be queried

* fix bug in parse_into_df() - init blank df to take in union of params from all profiles

* identify profiles from initial API request containing all required params

* creates df with only profiles that contain all user specified params
Need to dload additional params

* modified to populate prof df by querying individual profiles

* finished up BGC argo download!

* assert bounding box type in Argo init, begin framework for unit tests

* Adding argo search and download script

* Create get_argo.py

Download the 'classic' argo data with physical variables only

* begin implementing argo dataset

* 1st draft implementing argo dataset

* implement search_data for physical argo

* doctests and general cleanup for physical argo query

* beginning of BGC Argo download

* parse BGC profiles into DF

* plan to query BGC profiles

* validate BGC param input function

* order BGC params in order in which they should be queried

* fix bug in parse_into_df() - init blank df to take in union of params from all profiles

* identify profiles from initial API request containing all required params

* creates df with only profiles that contain all user specified params
Need to dload additional params

* modified to populate prof df by querying individual profiles

* finished up BGC argo download!

* assert bounding box type in Argo init, begin framework for unit tests

* need to confirm spatial extent is bbox

* begin test case for available profiles

* add tests for argo.py

* add typing, add example json, and use it to test parsing

* update argo to submit successful api request (update keys and values submitted)

* first pass at porting argo over to metadata+per profile download (WIP)

* basic working argo script

* simplify parameter validation (ordered list no longer needed)

* add option to delete existing data before new download

* continue cleaning up argo.py

* fix download_by_profile to properly store all downloaded data

* remove old get_argo.py script

* remove _filter_profiles function in favor of submitting data kwarg in request

* start filling in docstrings

* clean up nearly duplicate functions

* add more docstrings

* get a few minimal argo tests working

* add bgc argo params. begin adding merge for second download runs

* some changes

* WIP test commit to see if can push to GH

* WIP handling argo merge issue

* update profile to df to return df and move merging to get_dataframe

* merge profiles with existing df

* clean up docstrings and code

* add test_argo.py

* add prelim test case for adding to Argo df

* remove sandbox files

* remove bgc argo test file

* update variables notebook from development

* simplify import statements

* quickfix for granules error

* draft subpage on available QUEST datasets

* small reference fix in text

* add reference to top of .rst file

* test argo df merge

* add functionality to Quest class to pass search criteria to all datasets

* add functionality to Quest class to pass search criteria to all datasets

* update dataset docstrings; reorder argo.py to match

* implement quest search+download for IS2

* move spatial and temporal properties from query to genquery

* add query docstring test for cycles,tracks to test file

* add quest test module

* standardize print outputs for quest search and download; is2 download needs auth updates

* remove extra files from this branch

* comment out argo portions of quest for PR

* remove argo-branch-only init file

* remove argo script from branch

* remove argo test file from branch

* comment out another line of argo stuff

* Update quest.py

Added Docstrings to functions within quest.py and edited the primary docstring for the QUEST class here.

Note I did not add Docstrings to the implicit __self__ function.

* Update test_quest.py

Added comments (not Docstrings) to test functions

* Update dataset.py

Minor edits to the doc strings

* Update quest.py

Edited docstrings

* catch error with downloading datasets in Quest; template test case for multi dataset query

---------

Co-authored-by: Kelsey Bisson <48059682+kelseybisson@users.noreply.github.com>
Co-authored-by: Romina <romina.piunno@gmail.com>
Co-authored-by: zachghiaccio <zhfair@umich.edu>
Co-authored-by: Zach Fair <48361714+zachghiaccio@users.noreply.github.com>
@github-actions
Copy link

github-actions bot commented Apr 30, 2025

Binder 👈 Launch a binder notebook on this branch for commit 245fcac

I will automatically update this comment whenever this PR is modified

Binder 👈 Launch a binder notebook on this branch for commit 83ae9a4

Binder 👈 Launch a binder notebook on this branch for commit 237871a

Binder 👈 Launch a binder notebook on this branch for commit 49ab30c

Binder 👈 Launch a binder notebook on this branch for commit 664d517

@rwegener2
Copy link
Contributor

Exciting to see the progress, thanks for posting @zachghiaccio!

One question - does the Quest module have to change in order to use h5coro? I see several modified Quest files, which surprised me because I wouldn't have thought they'd need to change.

I am noticing that there is one commit on this branch from September 2023. Maybe that is the source of the Quest file modifications? (and perhaps also the merge conflicts?)

@zachghiaccio
Copy link
Contributor Author

Yeah I'm a bit confused on those Quest changes as well, because I don't think there's anything in Quest right now that requires h5coro. The changes in said commit are already implemented into icepyx, so I must have forgotten to pull those changes.

@rwegener2
Copy link
Contributor

rwegener2 commented May 7, 2025

Hey everyone! I played around with this a bit this afternoon. @zachghiaccio I removed everything from this commit that was related to quest, since it sounds like the h5coro work isn't related to the quest activitiy.

In terms of h5coro I'm starting to develop a short list of things that need to be done to allow for h5coro to be added to icepyx:

  1. specify engine=h5coro in the xr.open_dataset() calls (@zachghiaccio did this one already)
  2. h5coro doesn't want to read an fsspec object, it wants a string to an s3 filepath (without the s3:// at the beginning)
  3. because of the previous point h5coro also needs the s3 access token to be specified using the credentials argument
  4. implement phony_dims kwarg in h5coro

The first 3 pieces above can be completed for the moment pretty easily (I made these changes in a fast way in the most recent commit). The bigger bump is point no. 4. Icepyx relies on the phony_dims: access kwarg on the xr.open_dataset() function, which isn't implemented in h5coro. I don't understand well what this kwarg does, but it is appears to be related to smushing together datasets of differing dimensions. If you try to read data in icepyx without that kwarg an error is returned:

ValueError: conflicting sizes for dimension 'crossing_time': length 1 on 'cycle_number' and length 15 on {'crossing_time': 'bounding_polygon_lat1'}

While I didn't get all the way to the end of implementing h5coro in icepyx, I did notice the reoccuring theme:

the amount that h5coro can speed up data reading in icepyx is limited by the number of times that icepyx splits up the reads

#676 goes into more depth on some profiling to find that reading a single beam of data creates three separate read requests (three groups). As a parallelization library h5coro would provide the most benefit if we could consolidate our reads into one and let h5coro do the work of splitting that up. All interesting points to consider in the broader conversation about what the Read module in icepyx should look like.

@zachghiaccio
Copy link
Contributor Author

Sorry for the delayed response, but thanks for looking into all of this @rwegener2 ! I'm juggling two science team meetings today, but I'll finally have a bit of time this afternoon to check out your additions, and maybe look into the phony_dims kwarg, since I want to understand that better as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants