Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 12, 2025

Which issue does this PR close?

Builds on

Rationale for this change

The current ParquetMetadataDecoder intermixes three things:

  1. The state machine for decoding parquet metadata (footer, then metadata, then (optional) indexes)
  2. orchestrating IO (aka calling read, etc)
  3. Decoding thrift encoded byte into objects

This makes it almost impossible to add features like "only decode a subset of the columns in the ColumnIndex" and other potentially advanced usecases

Now that we have a "push" style API for metadata decoding that avoids IO, the next step is to extract out the actual work into this API so that the existing ParquetMetadataDecoder just calls into the PushDecoder

What changes are included in this PR?

  1. Extract decoding state machine into PushMetadataDecoder
  2. Extract thrift parsing into its own parser module
  3. Update ParquetMetadataDecoder to use the PushMetadataDecoder
  4. Extract the bytes --> object code into its own module

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

Are these changes tested?

by existing tests

Are there any user-facing changes?

Not really -- this is an internal change that will make it easier to add features like "only decode a subset of the columns in the ColumnIndex, for example

@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 12, 2025
@alamb alamb changed the title Alamb/refactor push decoder Move ParquetMetadata decoder state machine into ParquetMetadataPushDecoder Sep 12, 2025
@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2025

This almost certainly will conflict with @etseidl 's plans in thrift-remodel.

I took a quick peak and I think it won't be too hard to merge this into what I'm doing. It will likely help if I can merge this without any other changes from main to contend with. We can coordinate that pas de deux when the time comes 😅

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

@etseidl
Copy link
Contributor

etseidl commented Sep 12, 2025

Cool -- I likewise don't think it will conflict too badly logically, though I think it may generate lots of textual diffs that we'll have to be careful with.

Agreed. Looking forward to this one. I'm hoping for a much more flexible metadata parsing regime after the dust settles.

@etseidl
Copy link
Contributor

etseidl commented Sep 15, 2025

I just did a test merge of this branch with the head of my remodel branch and it went pretty smoothly. The few conflicts were easily resolved. 🚀

}
}

pub(crate) fn parse_column_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One option to consider is to move the column and offset index handling to file/metadata/page_index/index_reader.rs. Or I can do that later as part of the thrift remodel. That would keep all the page index parsing in one place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great -- I don't know why it is here, but I am quite happy to put it elsewhere if that makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, there are details (the parse_xxx_index methods need access to private fields in the metadata). I guess leave it here for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@etseidl -- I spent a bit more time on this one, and I encountered a challenge with encryption -- namely there are now 2 parallel codepaths -- one for encrypted data and one for non encrypted data.

The only way I can come up with to avoid replicating the same pattern (🤮 ) is to change the free functions into some sort of struct that can pass the state along with self

Something like intead of

pub(crate) fn parse_column_index(
    metadata: &mut ParquetMetaData,
    column_index_policy: PageIndexPolicy,
    bytes: &Bytes,
    start_offset: u64,
) -> crate::errors::Result<()> {

like

struct MetadataParser {
  // page index policy:
  page_index_policy: PageIndexPolicy,
  // ... other state fields like crypto policies
  
}
...
impl MetadataParser {
 fn parse_column_index(
    metadata: &mut ParquetMetaData,
    bytes: &Bytes,
    start_offset: u64,
) -> crate::errors::Result<()> {

However, I want to coordinate to avoid a massive merge conflict.

For now, I plan to just keep the same pattern

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat parallel, but they for the most part converge eventually (except for the main ParquetMetaData parser).

Do you think encryption support will always be feature gated? Or will there eventually be just the one path? I wouldn't want to get too fancy if the latter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of it will always be feature gated, because of the additional dependencies encryption brings in. However, we could potentially feature gate less of the actual code (maybe only feature gate the actual decryption calls 🤔 )

I was also thinking having some sort of stateful struct while decoding metadata could be useful for other reasons (for example, specifying for which columns to decode the statistics)

return Ok(DecodeResult::NeedsData(vec![file_len - 8..file_len]));
let footer_len = FOOTER_SIZE as u64;
loop {
match std::mem::replace(&mut self.state, DecodeState::Intermediate) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am quite pleased with how this decoder state machine is looking

@alamb alamb force-pushed the alamb/refactor_push_decoder branch from 86cdf90 to c9ba4e0 Compare September 23, 2025 19:27
@alamb alamb force-pushed the alamb/refactor_push_decoder branch from 0b49853 to e8ff5cb Compare September 24, 2025 18:00
@alamb alamb force-pushed the alamb/refactor_push_decoder branch from e8ff5cb to fc2fd81 Compare September 24, 2025 18:23
@alamb
Copy link
Contributor Author

alamb commented Sep 24, 2025

Ok, I am now pretty happy with this PR and how it looks. I broke it up into a few PRs to make reviews easier

You can see the results in this PR as the last commit

If/when those PRs are merged I'll rebase this one and mark it as ready for review

@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1016-gcp #17~24.04.1-Ubuntu SMP Wed Sep 3 01:55:36 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing alamb/refactor_push_decoder (fc2fd81) to 3027dbc diff
BENCH_NAME=metadata
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench metadata
BENCH_FILTER=
BENCH_BRANCH_NAME=alamb_refactor_push_decoder
Results will be posted here when complete

alamb added a commit that referenced this pull request Sep 25, 2025
# Which issue does this PR close?

- Part of #8000
- Prep PR for #8340, to make it
easier to review

Note while this is a large (in line count) code change, it should be
relatively easy to review as it is just moving code around

# Rationale for this change

In #8340 I am trying to split the
"IO" from the "where is the metadata in the file" from the "decode
thrift into Rust structures" logic. The first part of this is simply to
move the code that handles the "decode thrift into Rust structures" into
its own module.


# What changes are included in this PR?

1. Move most of the "parse thrift bytes into rust structure" code from
`parquet/src/file/metadata/mod.rs ` to
`parquet/src/file/metadata/parser.rs`

# Are these changes tested?

yes, by CI


# Are there any user-facing changes?

No, this is entirely internal reorganization

---------

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
alamb added a commit that referenced this pull request Sep 25, 2025
# Which issue does this PR close?

- Part of #8000
- Prep PR for #8340, to make it
easier to review

# Rationale for this change

In #8340 I am trying to split the
"IO" from the "where is the metadata in the file" from the "decode
thrift into Rust structures" logic.

I want to make it as easy as possible to review so I split it into
pieces, but you can see #8340 for
how it all fits together

# What changes are included in this PR?

This PR cleans up the code that handles parsing the 8 byte parquet file
footer, `FooterTail`, into its own module and construtor

# Are these changes tested?

yes, by CI


# Are there any user-facing changes?

No, this is entirely internal reorganization and I left a `pub use`

---------

Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Parquet] Split ParquetMetadataReader into IO/decoder state machine and thrift parsing
2 participants