-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hello @xando ! Thanks for creating this project. As one of the original developers on the google-cloud-bigquery-storage package, this caught my eye. I'm interested in contributing as find capacity, and had a few questions about if you're open to some larger changes to this project?
Interface
I'm curious if you've explored the pyarrow Tabular Dataset interface? It provides some really handy features that seem like they'd be a good fit with the BigQuery Storage Read API interface:
A unified interface that supports different sources and file formats and different file systems (local, cloud).
-Discovery of sources (crawling directories, handle directory based partitioned datasets, basic schema normalization,)
Optimized reading with predicate pushdown (filtering rows), projection (selecting and deriving columns), and optionally parallel reading.
That said, I'm not sure how much could be done in this package versus would have to be done in the Arrow C++ package for that. as I'm not seeing extension information regarding Dataset in the official pyarrow docs.
Performance
I ran some code using pyarrow-bigquery through the p-spy sampling profiler and discovered about 1/3 of the time (estimated) is spent writing feather data to disk for worker communication:
I also see a lot of "grpc spin". I think that means just waiting around for data, which might imply some other bottlenecks preventing grpc from reading data. Some of this may be GIL, since the "process" worker does have a speedup in my tests.
Any thoughts on creating a Cython (wrapping the C++ client) or PyO3 (wrapping one of the community Rust clients) version that could potentially avoid some of the slowness Python introduces, especially around sharing data across the workers?
Thanks for your consideration,
Tim
