Skip to content

Parquet Reboot #10602

@mrocklin

Description

@mrocklin

Our parquet performance is bad. I get 20MB/s in real-world use cases on the cloud where I would expect 500 MB/s. This accounts for ~80% of our runtime in complex dataframe queries in TPC-H. Systems like P2P, the GIL, Pandas copy-on-write, PyArrow strings, etc, are all inconsequential relative to this performance bottleneck.

Unfortunately, improving this situation is difficult because our current Parquet implementation has many layers and many systems all interwoven with each other. What should be a relatively simple system is today somewhat opaque. I would like for us to consider a rewrite.

There are many things to consider here. I'll follow up with some personal thoughts, and I welcome thoughts from others.

Metadata

Metadata

Assignees

No one assigned

    Labels

    discussionDiscussing a topic with no specific actions yetneeds attentionIt's been a while since this was pushed on. Needs attention from the owner or a maintainer.parquet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions