Skip to content
This repository was archived by the owner on Feb 18, 2024. It is now read-only.

Reduced re-alloc in parquet#1337

Open
jorgecarleitao wants to merge 2 commits intomainfrom
improve_capacity
Open

Reduced re-alloc in parquet#1337
jorgecarleitao wants to merge 2 commits intomainfrom
improve_capacity

Conversation

@jorgecarleitao
Copy link
Copy Markdown
Owner

Closes #1324

@codecov
Copy link
Copy Markdown

codecov bot commented Dec 18, 2022

Codecov Report

Base: 83.63% // Head: 83.78% // Increases project coverage by +0.14% 🎉

Coverage data is based on head (ef34171) compared to base (3a8da98).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1337      +/-   ##
==========================================
+ Coverage   83.63%   83.78%   +0.14%     
==========================================
  Files         373      373              
  Lines       40284    40392     +108     
==========================================
+ Hits        33692    33841     +149     
+ Misses       6592     6551      -41     
Impacted Files Coverage Δ
src/io/parquet/read/deserialize/binary/basic.rs 80.72% <100.00%> (+1.04%) ⬆️
...c/io/parquet/read/deserialize/binary/dictionary.rs 89.87% <100.00%> (ø)
src/io/parquet/read/deserialize/binary/nested.rs 80.70% <100.00%> (ø)
src/io/parquet/read/deserialize/binary/utils.rs 67.92% <100.00%> (+2.61%) ⬆️
src/io/parquet/read/deserialize/boolean/basic.rs 92.91% <100.00%> (+0.11%) ⬆️
src/io/parquet/read/deserialize/dictionary/mod.rs 76.75% <100.00%> (+0.25%) ⬆️
...arquet/read/deserialize/fixed_size_binary/basic.rs 95.05% <100.00%> (+0.13%) ⬆️
src/io/parquet/read/deserialize/primitive/basic.rs 95.58% <100.00%> (+0.04%) ⬆️
...c/io/parquet/read/deserialize/primitive/integer.rs 85.92% <100.00%> (+0.65%) ⬆️
src/io/parquet/read/deserialize/utils.rs 82.23% <100.00%> (ø)
... and 5 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@ritchie46
Copy link
Copy Markdown
Collaborator

I noticed a significant performance regression on reading utf8 data. I think it is related to now defaulting to a values_capacity of 0 instead of 24 * capacity as a reasonable default.

| State::OptionalDictionary(_, _)
| State::OptionalDelta(_, _)
| State::FilteredOptionalDelta(_, _) => (
Binary::<O>::with_capacity(capacity, 0),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

| State::RequiredDictionary(_)
| State::Delta(_)
| State::FilteredDelta(_) => (
Binary::<O>::with_capacity(capacity, 0),
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

let values = SizedBinaryIter::new(&dict.buffer, dict.num_values);

let mut data = Binary::<O>::with_capacity(dict.num_values);
let mut data = Binary::<O>::with_capacity(dict.num_values, 0);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This value capacity of 0 is very costly in most cases.

impl<O: Offset> Binary<O> {
#[inline]
pub fn with_capacity(capacity: usize) -> Self {
pub fn with_capacity(capacity: usize, values_capacity: usize) -> Self {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should have this signature:

pub fn with_capacity(capacity: usize, values_capacity: Option<usize>) -> Self {

and then

let values_capacity = values_capacity.unwrap_or(capacity.min(100) * 24);

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the values_capacity is zero by default, maybe we could use a need_estimated to reserve the space.

We use this in databend

@jorgecarleitao
Copy link
Copy Markdown
Owner Author

Sorry for the delay on this one - if anyone would like to take an extra pass go ahead, otherwise I will merge it in.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow2 read parquet file did not reuse the page decoder buffer to array

3 participants