Skip to content

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Sep 24, 2025

Which issue does this PR close?

Note while this is a large (in line count) code change, it should be relatively easy to review as it is just moving code around

Rationale for this change

In #8340 I am trying to split the "IO" from the "where is the metadata in the file" from the "decode thrift into Rust structures" logic. The first part of this is simply to move the code that handles the "decode thrift into Rust structures" into its own module.

What changes are included in this PR?

  1. Move most of the "parse thrift bytes into rust structure" code from parquet/src/file/metadata/mod.rs to parquet/src/file/metadata/parser.rs

Are these changes tested?

yes, by CI

Are there any user-facing changes?

No, this is entirely internal reorganization

// specific language governing permissions and limitations
// under the License.

//! Internal metadata parsing routines
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is just moved, with minimal changes, from the reader.rs module


self.parse_column_index(&bytes, offset)?;
self.parse_offset_index(&bytes, offset)?;
parse_column_index(metadata, self.column_index, &bytes, offset)?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these functions need access to some state on self, so I passed them as explicit arguments instead

@alamb alamb marked this pull request as ready for review September 24, 2025 17:51
@alamb alamb requested review from rok and removed request for rok September 24, 2025 18:23
Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looks good. I'll note that most of this code is going to move again as part of the thrift remodel as it needs to be closer to the private thrift structs.

Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @alamb

Copy link
Member

@rok rok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good @alamb, just a minor nit.

Co-authored-by: Matthijs Brobbel <m1brobbel@gmail.com>
@alamb alamb merged commit b540248 into apache:main Sep 25, 2025
16 checks passed
@alamb
Copy link
Contributor Author

alamb commented Sep 25, 2025

Thank you @etseidl @mbrobbel and @rok for the reviews

@alamb alamb deleted the alamb/extract_parser_from_reader branch September 25, 2025 20:58
@etseidl
Copy link
Contributor

etseidl commented Sep 25, 2025

Ouch time 😅

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

Ouch time 😅

As in this caused merge conflicts with the thrift-remodel branch?

@etseidl
Copy link
Contributor

etseidl commented Sep 26, 2025

Ouch time 😅

As in this caused merge conflicts with the thrift-remodel branch?

Precisely 😁

Took me two tries, but I got it merged. But that's what I signed up for. Actually I've been amazed so far by the lack of conflicts.

@alamb
Copy link
Contributor Author

alamb commented Sep 26, 2025

Ouch time 😅

As in this caused merge conflicts with the thrift-remodel branch?

Precisely 😁

Took me two tries, but I got it merged. But that's what I signed up for. Actually I've been amazed so far by the lack of conflicts.

Ideally the code to parse thrift should be pretty isolated (though I realize that is not the current state of main 😆 )

I feel like we are finally getting to the point where parsing / decoding and IO are all separated nicely, which I think sets us up very nicely for additional performance improvements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants