Introduce ``read_parquet_uniform`` node by rjzamora · Pull Request #732 · rapidsai/rapidsmpf

rjzamora · 2025-12-10T17:16:37Z

Adds a new streaming node for reading Parquet data with uniform chunk distribution. Here, "uniform" does not mean that all chunks will be a uniform size. Rather, it means that every chunk will correspond to the same file count of file fraction. This approach assumes the files and row-groups in the dataset have a relatively uniform distribution.

Motivation:

This partitioning approach is similar to the approach already used in cudf-polars. Therefore, it is easy to integrate this API with cudf-polars.
This approach produces file- or row-group-aligned reads in most cases.
- The only exception is when a large file is partitioned into a larger number of chunks than the number of row-groups (which is rare).
This approach supports distributed chunking across a single large file. The read_parquet node doesn't support this yet (though, it certainly can in the future. See: Support large-file splitting between ranks in read_parquet #736).

Other considerations:

The main read_parquet node may be safer for datasets with non-uniform file and/or row-group sizes.
The estimate_target_num_chunks utility cannot account for the effects of filters. Therefore, the size of each chunk may be significantly smaller than the corresponding num_rows_per_chunk argument when a filter is applied at IO time (even if the dataset is "uniform"). This may also be the case for row-count-based chunks (?)
The effect of aligning reads with row-group/file boundaries was almost negligible in my local testing (<5%). Therefore, using read_parquet_uniform is unlikely to provide a measurable performance improvement (more testing is needed to say for sure).

copy-pr-bot · 2025-12-10T17:16:40Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

rjzamora · 2025-12-10T18:09:26Z

/ok to test

rjzamora added 8 commits December 9, 2025 15:11

add read_parquet_uniform

25d5e2c

precommit

43b3771

avoid reading parquet metadata a bit more

bb7c39f

try aligning row-groups

3b4a528

Merge remote-tracking branch 'upstream/main' into read-parquet-uniform

8dfb030

add estimate_target_num_chunks and test coverage

73a85d7

add test coverage

e4d5571

Merge remote-tracking branch 'upstream/main' into read-parquet-uniform

707b1a2

rjzamora self-assigned this Dec 10, 2025

rjzamora added the improvement Improves an existing functionality label Dec 10, 2025

rjzamora added the non-breaking Introduces a non-breaking change label Dec 10, 2025

rjzamora added 3 commits December 12, 2025 08:38

Merge branch 'main' into read-parquet-uniform

a0a069a

Merge remote-tracking branch 'upstream/main' into read-parquet-uniform

ffad2b6

fix build errors

5ce111a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce `read_parquet_uniform` node#732

Introduce `read_parquet_uniform` node#732
rjzamora wants to merge 11 commits intorapidsai:mainfrom
rjzamora:read-parquet-uniform

rjzamora commented Dec 10, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Dec 10, 2025

Uh oh!

rjzamora commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rjzamora commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Dec 10, 2025

Uh oh!

rjzamora commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rjzamora commented Dec 10, 2025 •

edited

Loading