-
Notifications
You must be signed in to change notification settings - Fork 4
Initial package scaffolding #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Take all of my comments with a huge grain of salt...I'm interested in wiring this up so that I can effectively provide components to the system and minimize duplication with R bindings; however, I don't have much of a grasp of the existing APIs (notably, shapely and GeoPandas).
Lays out a simple poetry package, where geoarrow.core is defined as an implicit namespace package. So import geoarrow does nothing; import geoarrow.core imports this package.
Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c
(right?), which might or might not be used by other components.
Defines a simple PointArray dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.
Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow
(which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray
because Array -> ChunkedArray is always zero copy (but often not the other way around).
I would personally use the terminology Series
for what you have here (with Array
as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).
"known" accessors for geo, geos, and proj, which should give typing autocompletions in an IDE as long as those namespace packages are installed.
I like this, which is similar to what pandas does with str
and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo
georust
but 🤷 .
Thinking about it again, I think the biggest problem with the approach in this PR is that the returned object from the submodule is not necessarily the same class as the core class. I.e. are we going to require that the submodules depend on Maybe it would be better to use structural subtyping and have the core package focus on protocols? Then each package could have its own implementation of a point array if desired, which implements the
Yeah it could be named anything
Is your goal to separate pyarrow because of bundle size? Because it's a large dependency that some projects won't want? These are valid concerns, but I'm not sure what an I think it's important to have
I think here I have a different idea of the integration point between this package and other packages in the ecosystem. I wouldn't use the terminology
Maybe |
I wonder if it's too early to scaffold an object-oriented approach to the Array here. At its heart, there are a lot of functions that accept something array-like and return something array-like (e.g., |
Also feel free to push forward an execute your vision here...a new array interface isn't something I'm all that passionate about but that's not to say it isn't valuable! |
Yeah, I agree it's early and there are a lot of unknowns. The reason I reach for an object oriented approach is that not all operations are implemented on every geometry type. E.g. linestring simplification might not be implemented for points, or a clustering algorithm might be implemented only for points. And it's nicer to have some IDE hinting for what operations can be used for which data type, especially since in arrow we have known strict typing. Using arrow objects directly without any wrapping classes loses all typing support. It might be too early to do any work on a core library. I'll push along my Python bindings to experiment with some different approaches |
In geoarrow/geoarrow-rs#140 @paleolimbot and I were talking about how to lay out python geoarrow.
geoarrow.core
is defined as an implicit namespace package. Soimport geoarrow
does nothing;import geoarrow.core
imports this package.PointArray
dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.geo
,geos
, andproj
, which should give typing autocompletions in an IDE as long as those namespace packages are installed.