Skip to content

playing with xcdat#558

Open
remicousin wants to merge 1 commit intomasterfrom
xcdata
Open

playing with xcdat#558
remicousin wants to merge 1 commit intomasterfrom
xcdata

Conversation

@remicousin
Copy link
Contributor

This is not meant to be merged: this is an experiment about using xcdat to represent Time as intervals as opposed to points only, a prerequisite to make functionality that creates yearly seasonal time series.

For the installation, I ran it only for the linux platform since it seems that the environment for windows goes bananas. You can create the environment following the instructions here

The datasets I tested are ENACTS ones in my shared drive so you'll get access if you work on mako.

I had to copy some functions from enactstozarr to calc to get rid of the CONFIG needs.

You can experiment by running the first test function of test_calc
python -m pytest tests/test_calc.py::test_xcdat
Set the final assess to False and uncomment the prints to see what objects look like.
I left some comments to explain the different things I played with.

Let me know what this inspires you.

@remicousin remicousin self-assigned this Jul 31, 2025
"/data/remic/mydatafiles/Malawi/test/ALL_20200917/MERGED_precip_daily/"
).glob("*.nc")))
data = xr.open_mfdataset(
netcdf,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn’t it be opened using xc? Did I miss something, or does xcdat act on xr with some parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xc.openmfdataset doesn't work, or rather doesn't bring anything in this case, because the individual ENACTS files don't have time dimension. So I need to reuse what we had written from zarrifcation codes to read the files and get time from file names. Then having xc module available, the DataArray created is considered "xcdat" compatible (there is some sort of boolean attribute showing when printing the DataArray) and then the bounds related functions (that are also xr I believe) or the specific xcdat methods are applicable.

Copy link
Collaborator

@aaron-kaplan aaron-kaplan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there's some useful stuff here, but significant costs too.

Running a single test on mako takes 2.5s in the conda environment on master, but 10s in the one on this branch. The test itself runs instantaneously, it's just the startup time (probably imports) that increases. There's some pretty heavyweight stuff in its dependencies, notably xesmf. (Which might be useful in its own right, but I don't think we need it right now.)

I don't like how xcdat reaches into the internal implementation of another library (xarray) to change the behavior of pre-existing functions (e.g. open_mfdataset). That's sure to be a maintenance nightmare (for the xcdat maintainers) in the long run.

It looks like xcdat gets some (but not all) of its bounds-related functionality from cf_xarray. We should consider using cf_xarray even if we decide not to use xcdat.

Not sure if this is a problem with cf_xarray or xcdat, but I notice it adds bounds as a data variable, whereas I think it should be a coordinate variable.

Do you have an opinion on how useful this is? I won't come down strongly against it, but I wouldn't adopt it just to get some minor conveniences that we could easily reimplement.



def filename2datetime64(file, time_res="daily"):
"""Return time associated with an ENACTS filename in datetime
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are ENACTS filenames actually standardized? It seems like I've seen different naming conventions in different countries, but maybe it was only directory names that changed.

Same question for dimension names Lon, Lat.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a copy-and-paste from enactstozarr which is upstream so I can't import it here. I guess ENACTS have been standard for long enough so that this has worked so far.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to factor out common ENACTS functionality into pingrid if that seems appropriate to you.

data_dkd["T"].attrs["axis"] = "T"
# Creates T_bnds
# calc.set_up_dims sets T at the beginning of the dekad
# However, add_time_bounds' freq understands only yearly, monthly, daily, hourly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to use xcdat, we should submit a PR to add the dekadal option. I imagine it wouldn't be hard, given that monthly and yearly options (other intervals whose length varies) already exist.

# So here I trick it by using midpoint
# of which ending edge of T_bnds are the centers of the dekads
# which I re assign as T and then can apply add_time_bounds again
# to get dekadal data as we would present it
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would have been simpler to add the right bounds in the preprocess method than to do this dance with the midpoints.

@remicousin
Copy link
Contributor Author

Do you have an opinion on how useful this is? I won't come down strongly against it, but I wouldn't adopt it just to get some minor conveniences that we could easily reimplement.

I think there are 2 considerations to have here. One is that I don't really have time to explore further before I deliver some first app on the 15th, and this brief experiment demonstrates that more exploration is needed. So for this, I will probably just use what's already implemented, which is the groupby method that also outputs start/end seasons, which in turn are going to be useful to print the seasons in the app, or possibly to make bar plots with corresponding center and width.

Then the other consideration to get back to it after the 15th is whether any of that has a future. A future means that these things are handed over to someone. At the moment, this someone could be PepsiCo, ENACTS, or another version of ourselves. I guess it's plausible that PepsiCo code is handed over to CCSR or PepsiCo themselves to install and run the apps, even though CCSR seems to have paid little to no attention so far to how their PepsiCo project will transition past us, even though it seems to be a life sustaining part of their current funding landscape. Even if handed over, it seems unlikely that CCSR or PepsiCo will dive into the code for further development themselves, meaning they will need to hire someone (staff, outsourced), which is not impossible but push the odds of an actual further use of those things to deeper uncertainty.

For ENACTS, even though it concerns a much bigger crowd than CCSR-PepsiCo, we are not aware of any effort to guarantee a future for any of it. ICPAC (maybe also AGRHYMET) may be interested in being handed over Python code, but this is only used currently for the water balance application which feels like a drop in the ocean of their potential concerns -- even though a well designed time library will then be used elsewhere, all that represents much more effort by us or them before being meaningfully fruitful.

Lastly, another version of ourselves remain extremely hypothetical as of now. The Announcement and work on Hyrax might move the needle, and good time libraries could come up quickly next in line in the prospect of developing a new system, but all that is highly uncertain and distant in the future.

So in summary, I would table that work until something happens that deserves to put it back up into priorities.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments