Introduce read_parquet_uniform node#732
Draft
rjzamora wants to merge 11 commits intorapidsai:mainfrom
Draft
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
Member
Author
|
/ok to test |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds a new streaming node for reading Parquet data with uniform chunk distribution. Here, "uniform" does not mean that all chunks will be a uniform size. Rather, it means that every chunk will correspond to the same file count of file fraction. This approach assumes the files and row-groups in the dataset have a relatively uniform distribution.
Motivation:
read_parquetnode doesn't support this yet (though, it certainly can in the future. See: Support large-file splitting between ranks inread_parquet#736).Other considerations:
read_parquetnode may be safer for datasets with non-uniform file and/or row-group sizes.estimate_target_num_chunksutility cannot account for the effects of filters. Therefore, the size of each chunk may be significantly smaller than the correspondingnum_rows_per_chunkargument when a filter is applied at IO time (even if the dataset is "uniform"). This may also be the case for row-count-based chunks (?)read_parquet_uniformis unlikely to provide a measurable performance improvement (more testing is needed to say for sure).