feat: support nested STRUCT and ARRAY data display in anywidget mode#2359
feat: support nested STRUCT and ARRAY data display in anywidget mode#2359
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
f583833 to
60785f3
Compare
bigframes/display/_flatten.py
Outdated
|
|
||
| def flatten_nested_data( | ||
| dataframe: pd.DataFrame, | ||
| ) -> tuple[pd.DataFrame, dict[str, list[int]], list[str], set[str]]: |
There was a problem hiding this comment.
Tuple is hard to understand. Can we use a frozen dataclass, instead?
2bb97d3 to
3944249
Compare
bigframes/display/_flatten.py
Outdated
| ) | ||
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), |
There was a problem hiding this comment.
to_pylist() can be quite expensive to call. If we already have a pyarrow array, I don't think it's necessary to convert it.
There was a problem hiding this comment.
Done. I've removed the .to_pylist() calls and now pass the Arrow arrays directly to pandas for better performance.
bigframes/display/_flatten.py
Outdated
|
|
||
| new_cols_to_add[new_col_name] = pd.Series( | ||
| new_list_array.to_pylist(), | ||
| dtype=pd.ArrowDtype(pa.list_(field.type)), |
There was a problem hiding this comment.
I'm confused. Why are we creating a list type here? Could you explain in comments what the purpose is? I thought we were flattening based on the function name.
There was a problem hiding this comment.
Good point. I've added a comment to clarify that the function is transforming an array<struct<...>> into separate array columns.
bigframes/display/_flatten.py
Outdated
| for orig_idx in dataframe.index: | ||
| non_array_data = non_array_df.loc[orig_idx].to_dict() | ||
| array_values = {} | ||
| max_len_in_row = 0 | ||
| non_na_array_found = False | ||
|
|
||
| for col_name in array_columns: | ||
| val = dataframe.loc[orig_idx, col_name] |
There was a problem hiding this comment.
This is looping through each value in Python, which is going to be very slow. Please use native code such as https://arrow.apache.org/docs/python/generated/pyarrow.compute.list_flatten.html to avoid such loops.
There was a problem hiding this comment.
Thanks for the suggestion. I've refactored the array explosion logic to use a much faster vectorized approach with pandas.explode and merge, which removes the Python loops entirely.
bigframes/display/_flatten.py
Outdated
| continue | ||
|
|
||
| # Create one row per array element, up to max_len_in_row | ||
| for array_idx in range(max_len_in_row): |
There was a problem hiding this comment.
This is looping through each element of each array in Python, which is going to be even slower.
There was a problem hiding this comment.
I have completely refactored _explode_array_columns to use a vectorized approach with pandas.explode and merge. This eliminated all Python loops, including the slow inner loop you pointed out, significantly improving performance.
bigframes/display/_flatten.py
Outdated
| return "struct" | ||
| if pa.types.is_list(pa_type): | ||
| return ( | ||
| "array_of_struct" | ||
| if pa.types.is_struct(pa_type.value_type) | ||
| else "array" | ||
| ) | ||
| return "clear" |
There was a problem hiding this comment.
These magic strings worry me. Could you create an enum for category, instead?
There was a problem hiding this comment.
Done. I've replaced the strings with a private _ColumnCategory Enum.
bigframes/display/_flatten.py
Outdated
| continuation_rows: A set of row indices that are continuation rows. | ||
| cleared_on_continuation: A list of column names that should be cleared on continuation rows. |
There was a problem hiding this comment.
It's not 100% clear to me what is meant by "continuation". I assume that it means rows post-flattening that correspond to the second element of an array and beyond? Please expand these docstrings further.
There was a problem hiding this comment.
You are right. I've updated the docstrings in FlattenResult to explicitly clarify that "continuation rows" refer to the 2nd element onwards of an exploded array, and "cleared" columns are those (typically scalars) that are replicated but shouldn't be visually repeated.
bigframes/display/_flatten.py
Outdated
| """The result of flattening a DataFrame. | ||
|
|
||
| Attributes: | ||
| dataframe: The flattened DataFrame. |
There was a problem hiding this comment.
Please add some comments about what happens to the original index columns. Based on the description of the other fields, I assume that a unique index is created post-flatten?
There was a problem hiding this comment.
I've updated the docstrings and the implementation. The original index (including named Index and MultiIndex) is preserved and duplicated across the exploded rows. This serves as the visual grouping key for the table display.
bigframes/display/_flatten.py
Outdated
|
|
||
|
|
||
| @dataclasses.dataclass(frozen=True) | ||
| class ColumnClassification: |
There was a problem hiding this comment.
Please put a leading _ in front of class names that aren't intended to be used outside of this module.
| continuation_rows: set[int] | None, | ||
| clear_on_continuation: list[str], |
There was a problem hiding this comment.
Same here, add some more explanation to the docstrings. To keep it shorter, you could reference bigframes/display/_flatten.py so that folks can look there for the complete explanation.
There was a problem hiding this comment.
Done. I updated the docstrings to reference bigframes.display._flatten.FlattenResult for the detailed definitions.
There was a problem hiding this comment.
Please create a test_flatten.py file with a few tests that check some of the flattening logic directly without the HTML rendering part. Specifically, let's focus on what happens to index/multiindex columns, as that's my main worry / question.
There was a problem hiding this comment.
Done. I created tests/unit/display/test_flatten.py. I moved the logic-specific tests there and added dedicated test cases (test_flatten_preserves_original_index, test_flatten_preserves_multiindex) to verify that indices are correctly preserved and duplicated during the flattening process.
8eb7211 to
ca19957
Compare
|
|
||
| classification = _classify_columns(result_df) | ||
|
|
||
| # Process ARRAY-of-STRUCT columns into multiple ARRAY columns (one per struct field). |
There was a problem hiding this comment.
why do we need special logic for array of struct? why can we not achieve through just aplying array logic and then struct logic? Also, might we want to just keep on recursively unpacking stuff until there is not more array/struct left?
There was a problem hiding this comment.
You are correct that we could achieve this by applying array logic (explode) first and then struct logic, but that would require a second pass (loop) because the explosion would produce new STRUT columns that need flattening.
The current approach (Transpose Array -> Flatten Structs -> Explode Arrays) allows us to:
- keep the pipeline linear: we resolve the nesting structure in a single pass without needing recursion or re-classification loops.
- Optimize performance: we flatten the struct fields column-wise before expanding the row count via explosion.
For recursion, I agree that a recursive visitor is the correct long-term solution for arbitrary nesting depths (e.g., ARRAY<STRUCT>). For this PR, I aimed to support the common BQ ARRAY pattern within the current architecture, but we should definitely refactor to full recursion if we need to support depper/arbitrary nesting.
| continuation_rows: A set of row indices in the flattened DataFrame that are | ||
| "continuation rows". These are additional rows created to display the | ||
| 2nd to Nth elements of an array. The first row (index i-1) contains | ||
| the 1st element, while these rows contain subsequent elements. | ||
| cleared_on_continuation: A list of column names that should be "cleared" | ||
| (displayed as empty) on continuation rows. Typically, these are | ||
| scalar columns (non-array) that were replicated during the explosion | ||
| process but should only be visually displayed once per original row group. |
There was a problem hiding this comment.
Might need to individually mark continuation rows rather than take the intersection of a row set and column set
There was a problem hiding this comment.
Thanks for the suggestion. Currently, we enforce synchronous explosion (all arrays align), so the "continuation" status effectively applies to the whole row. When we support independent array explosions, we will definitely need to track.
c405008 to
d2710c2
Compare
Implements flattening and expansion for complex data types in the interactive display for anywidget mode.
Key Features:
verified at:
Fixes #<438181139> 🦕