Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb · 2025-09-12T19:46:51Z

Which issue does this PR close?

Builds on

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
orchestrating IO (aka calling read, etc)
Decoding thrift encoded byte into objects

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

Extract decoding state machine into PushMetadataDecoder
Extract thrift parsing into its own parser module
Update ParquetMetadataDecoder to use the PushMetadataDecoder
Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

etseidl · 2025-09-12T20:11:21Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

alamb · 2025-09-12T20:14:10Z

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

etseidl · 2025-09-12T20:17:56Z

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

etseidl · 2025-09-15T16:04:55Z

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

etseidl · 2025-09-15T16:30:06Z

parquet/src/file/metadata/parser.rs

+    }
+}
+
+pub(crate) fn parse_column_index(


One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.

That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense

Hmm, there are details (the parse_xxx_index methods need access to private fields in the metadata). I guess leave it here for now.

@etseidl -- I spent a bit more time on this one, and I encountered a challenge with encryption -- namely there are now 2 parallel codepaths -- one for encrypted data and one for non encrypted data.

The only way I can come up with to avoid replicating the same pattern (🤮 ) is to change the free functions into some sort of struct that can pass the state along with self

Something like intead of

pub(crate) fn parse_column_index( metadata: &mut ParquetMetaData, column_index_policy: PageIndexPolicy, bytes: &Bytes, start_offset: u64, ) -> crate::errors::Result<()> {

like

struct MetadataParser { // page index policy: page_index_policy: PageIndexPolicy, // ... other state fields like crypto policies } ... impl MetadataParser { fn parse_column_index( metadata: &mut ParquetMetaData, bytes: &Bytes, start_offset: u64, ) -> crate::errors::Result<()> {

However, I want to coordinate to avoid a massive merge conflict.

For now, I plan to just keep the same pattern

Somewhat parallel, but they for the most part converge eventually (except for the main ParquetMetaData parser).

Do you think encryption support will always be feature gated? Or will there eventually be just the one path? I wouldn't want to get too fancy if the latter.

I think part of it will always be feature gated, because of the additional dependencies encryption brings in. However, we could potentially feature gate less of the actual code (maybe only feature gate the actual decryption calls 🤔 )

I was also thinking having some sort of stateful struct while decoding metadata could be useful for other reasons (for example, specifying for which columns to decode the statistics)

alamb · 2025-09-20T10:55:24Z

parquet/src/file/metadata/push_decoder.rs

-            return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len]));
+        let footer_len = FOOTER_SIZE as u64;
+        loop {
+            match std::mem::replace(&mut self.state, DecodeState::Intermediate) {


I am quite pleased with how this decoder state machine is looking

alamb · 2025-09-24T18:25:22Z

Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier

You can see the results in this PR as the last commit

If/when those PRs are merged I'll rebase this one and mark it as ready for review

alamb · 2025-09-25T14:14:30Z

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (fc2fd81) to 3027dbc diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review Note while this is a large (in line count) code change, it should be relatively easy to review as it is just moving code around # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. The first part of this is simply to move the code that handles the "decode thrift into Rust structures" into its own module. # What changes are included in this PR? 1. Move most of the "parse thrift bytes into rust structure" code from `parquet/src/file/metadata/mod.rs ` to `parquet/src/file/metadata/parser.rs` # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization --------- Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>

# Which issue does this PR close? - Part of #8000 - Prep PR for #8340, to make it easier to review # Rationale for this change In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. I want to make it as easy as possible to review so I split it into pieces, but you can see #8340 for how it all fits together # What changes are included in this PR? This PR cleans up the code that handles parsing the 8 byte parquet file footer, `FooterTail`, into its own module and construtor # Are these changes tested? yes, by CI # Are there any user-facing changes? No, this is entirely internal reorganization and I left a `pub use` --------- Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com> Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>

github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025

alamb changed the title ~~Alamb/refactor push decoder~~ Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025

This was referenced Sep 12, 2025

[thrift-remodel] Begin replacing file metadata reader and convert footer decryption code #8313

Merged

[Parquet] Add ParquetMetadataPushDecoder #8080

Merged

etseidl reviewed Sep 15, 2025

View reviewed changes

alamb commented Sep 20, 2025

View reviewed changes

alamb force-pushed the alamb/refactor_push_decoder branch from 86cdf90 to c9ba4e0 Compare September 23, 2025 19:27

Refactor: Move parquet metadata parsing code into its own module

4b6117a

This was referenced Sep 24, 2025

Refactor: Move parquet metadata parsing code into its own module #8436

Merged

Refactor: extract FooterTail from ParquetMetadataReader #8437

Merged

alamb force-pushed the alamb/refactor_push_decoder branch from 3b14d65 to 0b49853 Compare September 24, 2025 17:51

Refactor: extract FooterTail from ParquetMetadataReader

d46cc53

alamb force-pushed the alamb/refactor_push_decoder branch from 0b49853 to e8ff5cb Compare September 24, 2025 18:00

Move state machine into ParquetMetadataDecoder

fc2fd81

alamb force-pushed the alamb/refactor_push_decoder branch from e8ff5cb to fc2fd81 Compare September 24, 2025 18:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

alamb commented Sep 12, 2025 •

edited

Loading

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

etseidl commented Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Uh oh!

alamb Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Uh oh!

alamb Sep 23, 2025

Uh oh!

etseidl Sep 23, 2025

Uh oh!

alamb Sep 24, 2025

Uh oh!

alamb Sep 20, 2025

Uh oh!

alamb commented Sep 24, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

Uh oh!

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Are you sure you want to change the base?

Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder #8340

Conversation

alamb commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

alamb commented Sep 12, 2025

Uh oh!

etseidl commented Sep 12, 2025

Uh oh!

etseidl commented Sep 15, 2025

Uh oh!

etseidl Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Sep 20, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Sep 24, 2025

Uh oh!

alamb commented Sep 25, 2025

Uh oh!

Uh oh!

alamb commented Sep 12, 2025 •

edited

Loading