questions on the longterm roadmap

Hello @xando ! Thanks for creating this project. As one of the original developers on the `google-cloud-bigquery-storage` package, this caught my eye. I'm interested in contributing as find capacity, and had a few questions about if you're open to some larger changes to this project?

**Interface**

I'm curious if you've explored the [pyarrow Tabular Dataset](https://arrow.apache.org/docs/python/dataset.html) interface? It provides some really handy features that seem like they'd be a good fit with the BigQuery Storage Read API interface:

> A unified interface that supports different sources and file formats and different file systems (local, cloud).
> -Discovery of sources (crawling directories, handle directory based partitioned datasets, basic schema normalization,)
> Optimized reading with predicate pushdown (filtering rows), projection (selecting and deriving columns), and optionally parallel reading.

That said, I'm not sure how much could be done in this package versus would have to be done in the Arrow C++ package for that. as I'm not seeing extension information regarding Dataset in the official pyarrow docs.

**Performance**

I ran some code using pyarrow-bigquery through the p-spy sampling profiler and discovered about 1/3 of the time (estimated) is spent writing feather data to disk for worker communication:

![Image](https://github.com/user-attachments/assets/10bd1a2a-20fb-47d3-93e5-0b871f5453b0)

I also see a lot of "grpc spin". I think that means just waiting around for data, which might imply some other bottlenecks preventing grpc from reading data. Some of this may be GIL, since the "process" worker does have a speedup in my tests.

Any thoughts on creating a Cython (wrapping the C++ client) or PyO3 (wrapping one of the community Rust clients) version that could potentially avoid some of the slowness Python introduces, especially around sharing data across the workers?

Thanks for your consideration,

Tim

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

questions on the longterm roadmap #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

questions on the longterm roadmap #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions