-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Parquet Reboot #10602
Copy link
Copy link
Open
Labels
discussionDiscussing a topic with no specific actions yetDiscussing a topic with no specific actions yetneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet
Metadata
Metadata
Assignees
Labels
discussionDiscussing a topic with no specific actions yetDiscussing a topic with no specific actions yetneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.It's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet
Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.
Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.
There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.