Skip to content

questions on the longterm roadmap #2

@tswast

Description

@tswast

Hello @xando ! Thanks for creating this project. As one of the original developers on the google-cloud-bigquery-storage package, this caught my eye. I'm interested in contributing as find capacity, and had a few questions about if you're open to some larger changes to this project?

Interface

I'm curious if you've explored the pyarrow Tabular Dataset interface? It provides some really handy features that seem like they'd be a good fit with the BigQuery Storage Read API interface:

A unified interface that supports different sources and file formats and different file systems (local, cloud).
-Discovery of sources (crawling directories, handle directory based partitioned datasets, basic schema normalization,)
Optimized reading with predicate pushdown (filtering rows), projection (selecting and deriving columns), and optionally parallel reading.

That said, I'm not sure how much could be done in this package versus would have to be done in the Arrow C++ package for that. as I'm not seeing extension information regarding Dataset in the official pyarrow docs.

Performance

I ran some code using pyarrow-bigquery through the p-spy sampling profiler and discovered about 1/3 of the time (estimated) is spent writing feather data to disk for worker communication:

Image

I also see a lot of "grpc spin". I think that means just waiting around for data, which might imply some other bottlenecks preventing grpc from reading data. Some of this may be GIL, since the "process" worker does have a speedup in my tests.

Any thoughts on creating a Cython (wrapping the C++ client) or PyO3 (wrapping one of the community Rust clients) version that could potentially avoid some of the slowness Python introduces, especially around sharing data across the workers?

Thanks for your consideration,

Tim

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions