Initial package scaffolding #1

kylebarron · 2023-08-09T16:24:32Z

In geoarrow/geoarrow-rs#140 @paleolimbot and I were talking about how to lay out python geoarrow.

Lays out a simple poetry package, where geoarrow.core is defined as an implicit namespace package. So import geoarrow does nothing; import geoarrow.core imports this package.
Defines a simple PointArray dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.
"known" accessors for geo, geos, and proj, which should give typing autocompletions in an IDE as long as those namespace packages are installed.

paleolimbot

Take all of my comments with a huge grain of salt...I'm interested in wiring this up so that I can effectively provide components to the system and minimize duplication with R bindings; however, I don't have much of a grasp of the existing APIs (notably, shapely and GeoPandas).

Lays out a simple poetry package, where geoarrow.core is defined as an implicit namespace package. So import geoarrow does nothing; import geoarrow.core imports this package.

Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c (right?), which might or might not be used by other components.

Defines a simple PointArray dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.

Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow (which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray because Array -> ChunkedArray is always zero copy (but often not the other way around).

I would personally use the terminology Series for what you have here (with Array as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).

"known" accessors for geo, geos, and proj, which should give typing autocompletions in an IDE as long as those namespace packages are installed.

I like this, which is similar to what pandas does with str and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo georust but 🤷 .

kylebarron · 2023-08-09T18:58:53Z

Thinking about it again, I think the biggest problem with the approach in this PR is that the returned object from the submodule is not necessarily the same class as the core class. I.e. are we going to require that the submodules depend on geoarrow.core and always return core classes?

Maybe it would be better to use structural subtyping and have the core package focus on protocols? Then each package could have its own implementation of a point array if desired, which implements the geoarrow.core.PointArray protocol.

Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c (right?), which might or might not be used by other components.

Yeah it could be named anything geoarrow.[name].

Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow (which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray because Array -> ChunkedArray is always zero copy (but often not the other way around).

Is your goal to separate pyarrow because of bundle size? Because it's a large dependency that some projects won't want?

These are valid concerns, but I'm not sure what an ArrayStorage class would hold? Or you're saying that's an ABC and you'd have pyarrow storage and nanoarrow storage on top of that?

I think it's important to have Array storage and not just ChunkedArray, because that ensures to the user (developer) that all geometries in this array are in contiguous memory.

I would personally use the terminology Series for what you have here (with Array as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).

I think here I have a different idea of the integration point between this package and other packages in the ecosystem. I wouldn't use the terminology Series because that confers extra levels of abstraction above a contiguous array of geometries. I see this as lower level than Pandas (and I don't think Polars should be considered here at all, because Polars will be much more efficient with a Rust binding).

I like this, which is similar to what pandas does with str and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo georust but 🤷 .

Maybe rust is better. I'd prefer fewer characters

paleolimbot · 2023-08-09T19:59:33Z

I wonder if it's too early to scaffold an object-oriented approach to the Array here. At its heart, there are a lot of functions that accept something array-like and return something array-like (e.g., geoarrow.geos.buffer() or geoarrow.geos.length()). It may be that we don't need our own Array subclass here...pyarrow has it's own system for dealing with this (the ExtensionArray), as does pandas (the ExtensionArray/accessors) and datafusion and polars and any other dataframe APIs that may pop up.

paleolimbot · 2023-08-09T20:21:20Z

Also feel free to push forward an execute your vision here...a new array interface isn't something I'm all that passionate about but that's not to say it isn't valuable!

kylebarron · 2023-08-09T20:23:10Z

Yeah, I agree it's early and there are a lot of unknowns.

The reason I reach for an object oriented approach is that not all operations are implemented on every geometry type. E.g. linestring simplification might not be implemented for points, or a clustering algorithm might be implemented only for points. And it's nicer to have some IDE hinting for what operations can be used for which data type, especially since in arrow we have known strict typing.

Using arrow objects directly without any wrapping classes loses all typing support.

It might be too early to do any work on a core library. I'll push along my Python bindings to experiment with some different approaches

kylebarron added 2 commits August 9, 2023 12:19

Initial PointArray class

c4b88a6

use from_pyarrow

f3bfac6

kylebarron changed the title ~~Initial package scaffolding &~~ Initial package scaffolding Aug 9, 2023

kylebarron requested review from jorisvandenbossche and paleolimbot August 9, 2023 16:24

paleolimbot reviewed Aug 9, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial package scaffolding #1

Initial package scaffolding #1

kylebarron commented Aug 9, 2023

paleolimbot left a comment

kylebarron commented Aug 9, 2023

paleolimbot commented Aug 9, 2023

paleolimbot commented Aug 9, 2023

kylebarron commented Aug 9, 2023

Initial package scaffolding #1

Are you sure you want to change the base?

Initial package scaffolding #1

Conversation

kylebarron commented Aug 9, 2023

paleolimbot left a comment

Choose a reason for hiding this comment

kylebarron commented Aug 9, 2023

paleolimbot commented Aug 9, 2023

paleolimbot commented Aug 9, 2023

kylebarron commented Aug 9, 2023