Skip to content

Conversation

emilk
Copy link
Contributor

@emilk emilk commented Sep 7, 2025

Which issue does this PR close?

Rationale for this change

We would previously format structs like so:

Struct(name1 type1, name2 nullable type2)

This will break badly whenever the field name is anything but a simple identifier. In other words: it allows string injection if the field name contains an end-paranthesis.

Except for that, it is also difficult to debug mistakingly bad field names like " " or "\n".

What changes are included in this PR?

We change the Display and Debug formatting of Struct

Before: Struct(name1 type1, name2 nullable type2)
After: Struct("name1": type1, "name2": nullable type2)

Are these changes tested?

Yes - I've updated the existing tests.

Are there any user-facing changes?

Yes, changing the Display formatting is a breaking change

@github-actions github-actions bot added parquet Changes to the parquet crate arrow Changes to the arrow crate labels Sep 7, 2025
@emilk emilk changed the title https://github.com/apache/arrow-rs/pull/8290 Quote struct field names in Display/Debug formatting Sep 7, 2025
@mbrobbel mbrobbel added api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Sep 16, 2025
@mbrobbel mbrobbel added this to the 57.0.0 milestone Sep 16, 2025
mbrobbel pushed a commit that referenced this pull request Sep 23, 2025
This is part of an attempt to improve the error reporting of `arrow-rs`,
`datafusion`, and any other 3rd party crates.

I believe that error messages should be as readable as possible. Aim for
`rustc` more than `gcc`.

Here's an example of how this PR improves some existing error messages:

Before:
> Casting from Map(Field { name: "entries", data_type: Struct([Field {
name: "key", data_type: Utf8, nullable: false, dict_id: 0,
dict_is_ordered: false, metadata: {} }, Field { name: "value",
data_type: Interval(DayTime), nullable: true, dict_id: 0,
dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0,
dict_is_ordered: false, metadata: {} }, false) to Map(Field { name:
"entries", data_type: Struct([Field { name: "key", data_type: Utf8,
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} },
Field { name: "value", data_type: Duration(Second), nullable: false,
dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false,
dict_id: 0, dict_is_ordered: false, metadata: {} }, true) not supported

After:
> Casting from Map(Field { "entries": Struct(key Utf8, value nullable
Interval(DayTime)) }, false) to Map(Field { "entries": Struct(key Utf8,
value Duration(Second)) }, true) not supported

# Which issue does this PR close?
- Closes #7048
- Continues and closes #7051
- Continues #7469
- More improvements coming in
#8291
- Sibling PR: apache/datafusion#17565
- Part of #8351

# Rationale for this change
DataType:s are often shown in error messages. Making these error
messages readable is _very important_.

# What changes are included in this PR?
## ~Unify `Debug` and `Display`~
~The `Display` and `Debug` of `DataType` are now the SAME.~

~Why? Both are frequently used in error messages (both in arrow, and
`datafusion`), and both benefit from being readable yet reversible.~

Reverted based on PR feedback. I will try to improve the `Debug`
formatting in a future PR, with clever use of
https://doc.rust-lang.org/std/fmt/struct.Formatter.html#method.debug_struct

## Improve `Display` of lists
Improve the `Display` formatting of
* `DataType::List`
* `DataType::LargeList`
* `DataType::FixedSizeList`

**Before**: `List(Field { name: \"item\", data_type: Int32, nullable:
true, dict_id: 0, dict_is_ordered: false, metadata: {} })`
**After**: `List(nullable Int32)`

**Before**: `FixedSizeList(Field { name: \"item\", data_type: Int32,
nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 5)`
**After**: `FixedSizeList(5 x Int32)`

### Better formatting of `DataType::Struct`
The formatting of `Struct` is now reversible, including nullability and
metadata.

- Continues #7469

### ~Improve `Debug` format of `Field`~
~Best understood with this diff for an existing test:~

<img width="1140" height="499" alt="Screenshot 2025-09-07 at 18 30 44"
src="https://github.com/user-attachments/assets/794b4de9-8459-4ee7-82d2-8f5ae248614c"
/>

**EDIT**: reverted

# Are these changes tested?
Yes - new tests cover them

# Are there any user-facing changes?
`Display/to_string` has changed, and so this is a **BREAKING CHANGE**.

Care has been taken that the formatting contains all necessary
information (i.e. is reversible), though the actual `FromStr`
implementation is still not written (it is missing on `main`, and
missing in this PR - so no change).


----

Let me know if I went to far… or not far enough 😆

---------

Co-authored-by: irenjj <renj.jiang@gmail.com>
@emilk emilk force-pushed the emilk/quote-struct-fields branch from 55aaf82 to 39646c4 Compare September 23, 2025 14:39
@emilk emilk marked this pull request as ready for review September 23, 2025 14:52
@emilk emilk changed the title Quote struct field names in Display/Debug formatting Quote DataType::Struct field names in Display formatting Sep 23, 2025
@alamb
Copy link
Contributor

alamb commented Sep 24, 2025

github says this PR has conflicts which is preventing me from merging it

Screenshot 2025-09-24 at 3 49 28 PM

Thanks @emilk and @mbrobbel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants