Skip to content

Prerequisites for supporting domain detection classes#564

Merged
aaraney merged 16 commits intoNOAA-OWP:masterfrom
robertbartel:f/dataset_domain_autodetect/0/prereqs
Apr 4, 2024
Merged

Prerequisites for supporting domain detection classes#564
aaraney merged 16 commits intoNOAA-OWP:masterfrom
robertbartel:f/dataset_domain_autodetect/0/prereqs

Conversation

@robertbartel
Copy link
Copy Markdown
Contributor

A few prereqs to keep the commits cleaner. Notably:

  • Extending Reader to support repeatable reads
  • Making ResultIndicator able to behave as a bool
  • Fixing and improving geopackage hydrofabric implementation and test data
  • Adding some data formats

Adding extended interface for reader that can reset for repeated data
reads.
- Modify GeoPackageHydrofabric to reflect current format for NextGen
  hydrofabrics, which uses 'divide_id' instead of 'id' for catchment id
  within divides layer.
- Modify hydrofabric test data to have consistent catchment id column.
- Modify GeoPackageHydrofabric in type hinting for from_file class
  method to also accept a bytes object, which the function is already
  functionally compatible with.
- Remove GeoPackageHydrofabric use of fiona and replacing with pyogrio.
- Add NGEN_GEOPACKAGE_HYDROFABRIC_V2 format for hydrofabrics.
- Add HYDROFABRIC_VERSION and HYDROFABRIC_REGION indices for use with
  NGEN_GEOPACKAGE_HYDROFABRIC_V2.
- Add EMPTY format to represent an empty dataset.
- Add GENERIC format to represent dataset for more easily dealing with
  datasets for which the true domain/format has not yet been determined.
- Tweaking capitalization in indices mapping and fields for AORC_CSV to
  match format produced by current ngen-forcing regridding tool
- Add docstring for member attributes.
- Remove duplicate __hash__ implementation.
- Add __eq__ implementation.
- Expand class docstring.
- Update validators for restrictions to allow for no restrictions when
  the data foramt is EMPTY or GENERIC.
Fixing function to actually have return statement.  Also removing
outdated TODO within the class.
Copy link
Copy Markdown
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to get this started, im about to step out of the office, but I think we can start some of the dialog.

"""EOF if empty b''."""


class RepeatableReader(Reader):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seek(offset, whence) is strictly more powerful, but far more conventional. Unless there is a really good reason not to, I would suggest that we make a Seeker with a seek method and then make a composite ReadSeeker.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the convention impose on us to also implement seekable() (and, from that, tell() and truncate())?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if seek is an available method, it follows that the methods you listed are also available. But, I don't think that is a strict requirement. I try to keep Protocol interfaces as thin and narrowly scoped as necessary. So, in this case I would not include those methods.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want to get too carried away with this, but I find the underlying design philosophy question genuinely interesting: if the strength of adhering to convention is encoded expectations, (when) is it better avoid the convention if expectations aren't going to be fulfilled?

We can potentially continue the discussion, here or offline, but for the sake of practicality I'll plan to make the changes as you've suggested.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah! I think it's a really interesting question too!

if the strength of adhering to convention is encoded expectations, (when) is it better avoid the convention if expectations aren't going to be fulfilled?

I would argue that a function or method that supports a polymorphic argument should constrain the requirements of that argument to only what the function or method needs. So in this context, an interface (python Protocol) is simply a contract that some type T supports a given set of methods. It divorces the requirement of a T to be a certain type or subtype, but instead focuses only on the behaviors of the T. This enables constructing expectations (methods) for only what is required and often reduces coupling.

With this in mind, we can consider a convention as a composition of expectations (methods) rather than a single set of expectations (traditional abc). So, the convention is not lost, we simply only specify our minimal set of requirements.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I largely agree, but a question follows from this on the strength of transitive expectations/requirements: does seek() imply seekable(), or merely suggest it?

As you point out, we could answer that by noting that the utilizing method doesn't strictly need to check whether the object is seek-able. It does need something that seekable() would provide (assurance that the data object can be read more than once), but not all of it, and what the method would need of it can be provided in a different way. So, we don't really need the object to include seekable().

The problem is that we could apply the same rationale with seek(). The method does not need an object with full seek capabilities: only an object that can do something that could be accomplished using seek(). And we could provide that functionality to the object in a different, more focused way. So we don't really need seek() either.

In other words, we would not be adopting that approach because it was the strictest possible constraint fulfillment, but rather because it's familiar: this will probably let people (including us in the future) work with the object more quickly and easily. But that fundamentally relies on expectations and assumptions for how things work. And in that context, I'm not sure we can get a firm answer on the implication versus suggestion question.

So the problem here to me is to approximating the balance point between convenience at the cost of risking reasonable-but-incorrect assumptions, versus avoiding ambiguity at the cost of familiarity. I'm still not sure where that is with this. Your option does feel more "Pythonic," though, so absent a more definitive mechanism, we can split the tie that way.

)
""" GeoPackage hydrofabric format v2 used by NextGen (id is catchment id). """

EMPTY = (15, {}, None, False)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do these interact with the Dataset's DataCategory? At least, EMPTY and GENERIC seem related to that concept.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ... I don't think there's any coupling, strictly speaking, between categories and either formats or domains. For this in particular, I think that's fine: it's reasonable to have a dataset in any category for which the data coverage is defined to be either empty or unknown.

More generally, I can already see things with categories that need tweaking (the notion of an OUTPUT category is probably not valid, if we start to think of evaluation task needing modeling "output" as input, or forcing regridding jobs producing as "output" something belonging to FORCING category). And we may not be using category sufficiently and/or consistently throughout our data modeling.

Did you have something specific in mind when you noted this, or just a general concern?

Copy link
Copy Markdown
Member

@aaraney aaraney Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've always thought there was an implied unidirectional relationship from a DataCategory to a set of DataFormat variants. So, for a given DataCategory variant there is an implied set of valid DataFormats. If that is the case, let's say a dataset is created with a DataCategory of HYDROFABRIC and data is uploaded to the dataset without specifying its DataFormat, thus its initial DataFormat is Generic. The uploaded data is detected to be AORC_CSV. Assuming AORC_CSV is not in the set of HYDROFABRICs valid DataFormats, how should DMOD handle this case in accounting for dataset metadata?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There definitely is conceptually, just not one that's explicitly defined somewhere.

I haven't tested that scenario, but I expect it would allow the HYDROFABRIC category dataset to have a defined domain with the AORC_CSV format. I'm pretty sure the same thing would happen if the user tried to provided the domain manually when creating a dataset, though. So this is definitely a problem, but outside the scope of the automatic detection work. I've opened #567 specifically for it.

@aaraney
Copy link
Copy Markdown
Member

aaraney commented Apr 2, 2024

Curious if the failures are related to removing Fiona as a dependency? I'm AFK, so I've not pulled the branch and verified locally.

""" Index for DATA_ID values of source dataset(s) when dataset is composite format and derives from others. """
HYDROFABRIC_VERSION = (10, str, "HYDROFABRIC_VERSION")
""" Version string for version of the hydrofabric to use (e.g., 2.0.1). """
HYDROFABRIC_REGION = (11, str, "HYDROFABRIC_REGION")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of scope, but related, we will need something for the hydrofabric model attributes as well. We should probably chat with the HF team about how they intend to version those.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #569 for tracking this in the future.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening something to track this!

""" Index for the name of a data file within a dataset. """
COMPOSITE_SOURCE_ID = (9, str, "COMPOSITE_SOURCE_ID")
""" Index for DATA_ID values of source dataset(s) when dataset is composite format and derives from others. """
HYDROFABRIC_VERSION = (10, str, "HYDROFABRIC_VERSION")
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on making this a generic VERSION?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep it separate, at least for now (does any other data format use any versioning at the moment?).

A lot of - maybe all - the other data DMOD works with is only fully valid if we assume its applied within some hydrofabric. I suspect before too long we will want or need a hydrofabric version index in constraints defining the domain of regridded forcings or BMI init config datasets, to be able to tell if the cat-1156 involved is actually the cat-1156 we are interested in. And that would be a more flexible way to constrain things than aligning the specific hydrofabric id.

"""EOF if empty b''."""


class RepeatableReader(Reader):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, if seek is an available method, it follows that the methods you listed are also available. But, I don't think that is a strict requirement. I try to keep Protocol interfaces as thin and narrowly scoped as necessary. So, in this case I would not include those methods.

@robertbartel
Copy link
Copy Markdown
Contributor Author

I'm working to see what's going on with the tests.

@robertbartel robertbartel requested a review from aaraney April 4, 2024 12:47
@robertbartel
Copy link
Copy Markdown
Contributor Author

@aaraney, I think all the review comments have been properly addressed either in conversation, separate issues, or additional commits, and I have the tests passing now also, but let me know if I've missed anything.

Copy link
Copy Markdown
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really solid! Thanks for the dialog and addressing the comments. I think we landed in a really good spot! Merge at will.

@aaraney aaraney self-requested a review April 4, 2024 14:29
Copy link
Copy Markdown
Member

@aaraney aaraney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed that ReadSeeker also needs to inherit Protocol. @christophertubbs, can you confirm this?

@aaraney
Copy link
Copy Markdown
Member

aaraney commented Apr 4, 2024

Sorry, I didn't catch this until after i'd approved. I went to take a look at #565 and in passing noticed my mistake.

Comment thread python/lib/core/dmod/core/common/reader.py Outdated
Co-authored-by: Austin Raney <austin.raney@noaa.gov>
@robertbartel robertbartel requested a review from aaraney April 4, 2024 15:23
@aaraney aaraney merged commit 6119d6c into NOAA-OWP:master Apr 4, 2024
@robertbartel robertbartel deleted the f/dataset_domain_autodetect/0/prereqs branch April 4, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

maas MaaS Workstream

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants