Parquet Reboot

Our parquet performance is bad.  I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s.  This accounts for ~80% of our runtime in complex dataframe queries in TPC-H.  Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.

Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other.  What should be a relatively simple system is today somewhat opaque.  I would like for us to consider a rewrite.

There are many things to consider here.  I'll follow up with some personal thoughts, and I welcome thoughts from others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet Reboot #10602

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Parquet Reboot #10602

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions