-
Notifications
You must be signed in to change notification settings - Fork 1k
Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅 |
Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with. |
Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles. |
I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀 |
} | ||
} | ||
|
||
pub(crate) fn parse_column_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs
. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, there are details (the parse_xxx_index
methods need access to private fields in the metadata). I guess leave it here for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@etseidl -- I spent a bit more time on this one, and I encountered a challenge with encryption -- namely there are now 2 parallel codepaths -- one for encrypted data and one for non encrypted data.
The only way I can come up with to avoid replicating the same pattern (🤮 ) is to change the free functions into some sort of struct that can pass the state along with self
Something like intead of
pub(crate) fn parse_column_index(
metadata: &mut ParquetMetaData,
column_index_policy: PageIndexPolicy,
bytes: &Bytes,
start_offset: u64,
) -> crate::errors::Result<()> {
like
struct MetadataParser {
// page index policy:
page_index_policy: PageIndexPolicy,
// ... other state fields like crypto policies
}
...
impl MetadataParser {
fn parse_column_index(
metadata: &mut ParquetMetaData,
bytes: &Bytes,
start_offset: u64,
) -> crate::errors::Result<()> {
However, I want to coordinate to avoid a massive merge conflict.
For now, I plan to just keep the same pattern
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat parallel, but they for the most part converge eventually (except for the main ParquetMetaData
parser).
Do you think encryption support will always be feature gated? Or will there eventually be just the one path? I wouldn't want to get too fancy if the latter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of it will always be feature gated, because of the additional dependencies encryption brings in. However, we could potentially feature gate less of the actual code (maybe only feature gate the actual decryption calls 🤔 )
I was also thinking having some sort of stateful struct while decoding metadata could be useful for other reasons (for example, specifying for which columns to decode the statistics)
return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len])); | ||
let footer_len = FOOTER_SIZE as u64; | ||
loop { | ||
match std::mem::replace(&mut self.state, DecodeState::Intermediate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am quite pleased with how this decoder state machine is looking
86cdf90
to
c9ba4e0
Compare
3b14d65
to
0b49853
Compare
0b49853
to
e8ff5cb
Compare
e8ff5cb
to
fc2fd81
Compare
Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier
You can see the results in this PR as the last commit If/when those PRs are merged I'll rebase this one and mark it as ready for review |
🤖 |
# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review Note while this is a large (in line count) code change, it should be relatively easy to review as it is just moving code around # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. The first part of this is simply to move the code that handles the "decode thrift into Rust structures" into its own module. # What changes are included in this PR? 1. Move most of the "parse thrift bytes into rust structure" code from `parquet/src/file/metadata/mod.rs ` to `parquet/src/file/metadata/parser.rs` # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. I want to make it as easy as possible to review so I split it into pieces, but you can see #8340 for how it all fits together # What changes are included in this PR? This PR cleans up the code that handles parsing the 8 byte parquet file footer, `FooterTail`, into its own module and construtor # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization and I left a `pub use` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
Which issue does this PR close?
ParquetMetadataReader
into IO/decoder state machine and thrift parsing #8439Builds on
Rationale for this change
The current ParquetMetadataDecoder intermixes three things:
This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases
Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder
What changes are included in this PR?
parser
moduleThis almost certainly will conflict with @etseidl 's plans in thrift-remodel.
Are these changes tested?
by existing tests
Are there any user-facing changes?
Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example