Skip to content

Initial package scaffolding #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

Conversation

kylebarron
Copy link
Member

In geoarrow/geoarrow-rs#140 @paleolimbot and I were talking about how to lay out python geoarrow.

  • Lays out a simple poetry package, where geoarrow.core is defined as an implicit namespace package. So import geoarrow does nothing; import geoarrow.core imports this package.
  • Defines a simple PointArray dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.
  • "known" accessors for geo, geos, and proj, which should give typing autocompletions in an IDE as long as those namespace packages are installed.

@kylebarron kylebarron changed the title Initial package scaffolding & Initial package scaffolding Aug 9, 2023
Copy link
Contributor

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take all of my comments with a huge grain of salt...I'm interested in wiring this up so that I can effectively provide components to the system and minimize duplication with R bindings; however, I don't have much of a grasp of the existing APIs (notably, shapely and GeoPandas).

Lays out a simple poetry package, where geoarrow.core is defined as an implicit namespace package. So import geoarrow does nothing; import geoarrow.core imports this package.

Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c (right?), which might or might not be used by other components.

Defines a simple PointArray dataclass that wraps pyarrow. Maybe in the future we can remove pyarrow as a dependency, but for now it's simplest to have.

Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow (which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray because Array -> ChunkedArray is always zero copy (but often not the other way around).

I would personally use the terminology Series for what you have here (with Array as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).

"known" accessors for geo, geos, and proj, which should give typing autocompletions in an IDE as long as those namespace packages are installed.

I like this, which is similar to what pandas does with str and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo georust but 🤷 .

@kylebarron
Copy link
Member Author

Thinking about it again, I think the biggest problem with the approach in this PR is that the returned object from the submodule is not necessarily the same class as the core class. I.e. are we going to require that the submodules depend on geoarrow.core and always return core classes?

Maybe it would be better to use structural subtyping and have the core package focus on protocols? Then each package could have its own implementation of a point array if desired, which implements the geoarrow.core.PointArray protocol.

Cool! I like the idea of having a geoarrow-verse. Bindings the C implementation would be geoarrow.c (right?), which might or might not be used by other components.

Yeah it could be named anything geoarrow.[name].

Even with a pyarrow dependency, I still think we want an abstract "ArrayStorage" class. For pyarrow, this might be an Array or a ChunkedArray. I would personally put all pyarrow-related implementations in import geoarrow.pyarrow (which would take care of other pyarow-specific details like registering the extension types). If you have to declare exactly one, I'd pick ChunkedArray because Array -> ChunkedArray is always zero copy (but often not the other way around).

Is your goal to separate pyarrow because of bundle size? Because it's a large dependency that some projects won't want?

These are valid concerns, but I'm not sure what an ArrayStorage class would hold? Or you're saying that's an ABC and you'd have pyarrow storage and nanoarrow storage on top of that?

I think it's important to have Array storage and not just ChunkedArray, because that ensures to the user (developer) that all geometries in this array are in contiguous memory.

I would personally use the terminology Series for what you have here (with Array as a wrapper around a pyarrow Array/ChunkedArray/maybe something else in the future). This distinction is roughly what both Pandas and Polars do (Arrow C++ also separates this but calls it ArrayData and Array).

I think here I have a different idea of the integration point between this package and other packages in the ecosystem. I wouldn't use the terminology Series because that confers extra levels of abstraction above a contiguous array of geometries. I see this as lower level than Pandas (and I don't think Polars should be considered here at all, because Polars will be much more efficient with a Rust binding).

I like this, which is similar to what pandas does with str and similar accessors (and what cuDF does for type-specific operations). I would maybe call geo georust but 🤷 .

Maybe rust is better. I'd prefer fewer characters

@paleolimbot
Copy link
Contributor

I wonder if it's too early to scaffold an object-oriented approach to the Array here. At its heart, there are a lot of functions that accept something array-like and return something array-like (e.g., geoarrow.geos.buffer() or geoarrow.geos.length()). It may be that we don't need our own Array subclass here...pyarrow has it's own system for dealing with this (the ExtensionArray), as does pandas (the ExtensionArray/accessors) and datafusion and polars and any other dataframe APIs that may pop up.

@paleolimbot
Copy link
Contributor

Also feel free to push forward an execute your vision here...a new array interface isn't something I'm all that passionate about but that's not to say it isn't valuable!

@kylebarron
Copy link
Member Author

Yeah, I agree it's early and there are a lot of unknowns.

The reason I reach for an object oriented approach is that not all operations are implemented on every geometry type. E.g. linestring simplification might not be implemented for points, or a clustering algorithm might be implemented only for points. And it's nicer to have some IDE hinting for what operations can be used for which data type, especially since in arrow we have known strict typing.

Using arrow objects directly without any wrapping classes loses all typing support.

It might be too early to do any work on a core library. I'll push along my Python bindings to experiment with some different approaches

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants