Skip to content

Conversation

emilk
Copy link
Contributor

@emilk emilk commented Sep 7, 2025

This is part of an attempt to improve the error reporting of arrow-rs, datafusion, and any other 3rd party crates.

I believe that error messages should be as readable as possible. Aim for rustc more than gcc.

Here's an example of how this PR improves some existing error messages:

Before:

Casting from Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Interval(DayTime), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false) to Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Utf8, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Duration(Second), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, true) not supported

After:

Casting from Map(Field { "entries": Struct(key Utf8, value nullable Interval(DayTime)) }, false) to Map(Field { "entries": Struct(key Utf8, value Duration(Second)) }, true) not supported

Which issue does this PR close?

Rationale for this change

DataType:s are often shown in error messages. Making these error messages readable is very important.

What changes are included in this PR?

Unify Debug and Display

The Display and Debug of DataType are now the SAME.

Why? Both are frequently used in error messages (both in arrow, and datafusion), and both benefit from being readable yet reversible.

Reverted based on PR feedback. I will try to improve the Debug formatting in a future PR, with clever use of https://doc.rust-lang.org/std/fmt/struct.Formatter.html#method.debug_struct

Improve Display of lists

Improve the Display formatting of

  • DataType::List
  • DataType::LargeList
  • DataType::FixedSizeList

Before: List(Field { name: \"item\", data_type: Int32, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} })
After: List(nullable Int32)

Before: FixedSizeList(Field { name: \"item\", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, 5)
After: FixedSizeList(5 x Int32)

Better formatting of DataType::Struct

The formatting of Struct is now reversible, including nullability and metadata.

Improve Debug format of Field

Best understood with this diff for an existing test:

Screenshot 2025-09-07 at 18 30 44

EDIT: reverted

Are these changes tested?

Yes - new tests cover them

Are there any user-facing changes?

Display/to_string has changed, and so this is a BREAKING CHANGE.

Care has been taken that the formatting contains all necessary information (i.e. is reversible), though the actual FromStr implementation is still not written (it is missing on main, and missing in this PR - so no change).


Let me know if I went to far… or not far enough 😆

@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 7, 2025
@emilk emilk changed the title Improve Display and Debug for DataType Improve Display and Debug for DataType and Field Sep 7, 2025
@github-actions github-actions bot added the parquet Changes to the parquet crate label Sep 7, 2025
@emilk emilk marked this pull request as ready for review September 7, 2025 17:28
@mbrobbel mbrobbel added the next-major-release the PR has API changes and it waiting on the next major version label Sep 8, 2025

impl fmt::Display for DataType {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
// NOTE: `Display` and `Debug` formatting are ALWAYS the same,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more common in Rust code to have the Debug implementation try and print out something close to the underlying representation, and Display is for human consumption

Specifically https://doc.rust-lang.org/std/fmt/trait.Debug.html

Debug should format the output in a programmer-facing, debugging context.

Generally speaking, you should just derive a Debug implementation.

vs https://doc.rust-lang.org/std/fmt/trait.Display.html

Copy link
Contributor Author

@emilk emilk Sep 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is uncommon, but unfortunately so many error messages in datafusion and arrow use the Debug formatting of DataType instead of Display, which means we end up with huge difficult-to-read error messages.

There are three solutions to this, afaict:

A) Use Display=Debug, like this PR.
It's still programmer-facing, because it contains ALL the info (metadata etc)

B) Replace all uses of {:?} with {} when printing datatypes in datafusion, arrow, and other third party crates.
This is VERY hard to do, as I know of no automated tool to find all these places.

C) Improve Debug formatting by omitting empty/default fields. This will help, but the Debug format for DataType::List will still be very ugly, since it wraps a Field.

(or maybe I mistakingly think a lot of places use Debug instead of Display because the old Display implementation for List used the Debug formatting…)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment to the code to motivate this choice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

B) Replace all uses of {:?} with {} when printing datatypes in datafusion, arrow, and other third party crates.
This is VERY hard to do, as I know of no automated tool to find all these places.

I think grepping for type:?} and {:?} would catch most of them. Maybe a good think to ask some AI tool to do

(venv) andrewlamb@Andrews-MacBook-Pro-3:~/Software/arrow-rs$ grep -r 'type:?' `find . -name '*.rs'`
./arrow-schema/src/datatype_parse.rs:        println!("Input '{data_type_string}' ({data_type:?})");
./arrow-schema/src/datatype_parse.rs:            println!("Parsing '{data_type_string}', expecting '{expected_data_type:?}'");
./arrow-data/src/transform/run.rs:        _ => panic!("Invalid run end type for RunEndEncoded array: {run_end_type:?}"),
./arrow-data/src/transform/run.rs:                _ => panic!("Invalid run end type for RunEndEncoded array: {dest_run_end_type:?}",),
./arrow-ipc/src/compression.rs:                "compression type {other_type:?} not supported "
./arrow-string/src/like.rs:                        "{value_type:?} «{value}» like {pattern_type:?} «{pattern}»"
./arrow-string/src/like.rs:                        "{value_type:?} «{value}» ilike {pattern_type:?} «{pattern}»"
./arrow-string/src/like.rs:                        "{value_type:?} «{value}» nlike {pattern_type:?} «{pattern}»"
./arrow-string/src/like.rs:                        "{value_type:?} «{value}» nilike {pattern_type:?} «{pattern}»"
./arrow-csv/src/reader/mod.rs:                            "Unsupported dictionary key type {key_type:?}"
./arrow-row/src/list.rs:                "Expected FixedSizeListArray, found: {list_type:?}",
./arrow-array/src/array/fixed_size_list_array.rs:                panic!("FixedSizeListArray data should contain a FixedSizeList data type, got {data_type:?}")
./arrow-array/src/array/primitive_array.rs:        write!(f, "PrimitiveArray<{data_type:?}>\n[\n")?;
./arrow-array/src/array/primitive_array.rs:                            "Cast error: Failed to convert {v} to temporal for {data_type:?}"
./arrow-array/src/array/primitive_array.rs:                            "Cast error: Failed to convert {v} to temporal for {data_type:?}"
./arrow-array/src/record_batch.rs:                "column types must match schema types, expected {field_type:?} but found {col_type:?} at column index {i}")));
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" doesn't expect buffer at index 0. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 2 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 2 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 2 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 2 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 3 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 3 buffers, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 1 buffer, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" expects 2 buffer, but requested {i}. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" doesn't expect buffer at index 0. Please verify that the C data interface is correctly implemented."
./arrow-array/src/ffi.rs:                "The datatype \"{data_type:?}\" is still not supported in Rust implementation"
./arrow-array/src/builder/mod.rs:                    panic!("Data type {t:?} with key type {key_type:?} is not currently supported")
./arrow-cast/src/cast/mod.rs:                "Casting from dictionary type {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from type {from_type:?} to dictionary type {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported"
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported"
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:                "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported",
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported"
./arrow-cast/src/cast/mod.rs:            "Casting from {from_type:?} to {to_type:?} not supported"
./arrow-cast/src/cast/dictionary.rs:                        "Unsupported type {to_index_type:?} for dictionary index"
./arrow-cast/src/cast/dictionary.rs:            "Unsupported output type for dictionary packing: {dict_value_type:?}"
./parquet/benches/arrow_reader_row_filter.rs:            let benchmark_name = format!("{filter_type:?}/{proj_case}",);
./parquet/src/record/reader.rs:                        "Map key type is expected to be a primitive type, but found {key_type:?}"
./parquet/src/arrow/arrow_reader/mod.rs:                    "data type: {data_type:?}, expected: {expected_err}, got: {err}"
./parquet/src/arrow/arrow_reader/mod.rs:                    "data type: {data_type:?}, expected: {expected_err}, got: {err}"
./parquet/src/arrow/arrow_writer/mod.rs:                    "Attempting to write an Arrow type {data_type:?} to parquet that is not yet implemented"
./parquet/src/arrow/buffer/view_buffer.rs:            _ => panic!("Unsupported data type: {data_type:?}"),
./parquet/src/schema/visitor.rs:                panic!("{list_type:?} is a list type and must be a group type")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emilk emilk changed the title Improve Display and Debug for DataType and Field Improve Display for DataType and Field Sep 15, 2025
@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Sep 15, 2025
@mbrobbel mbrobbel added the api-change Changes to the arrow API label Sep 15, 2025
Copy link
Member

@mbrobbel mbrobbel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emilk

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @emilk and @mbrobbel -- I think this looks great now. Once we get the CI to pass let's merge it in

};
assert_eq!(
t,
r#"Casting from Map(Field { "entries": Struct(key Utf8, value nullable Utf8) }, false) to Map(Field { "entries": Struct(key Utf8, value Utf8) }, true) not supported"#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is certainly much nicer!

@emilk
Copy link
Contributor Author

emilk commented Sep 15, 2025

Green!

@mbrobbel
Copy link
Member

We can merge after #7836.

@emilk
Copy link
Contributor Author

emilk commented Sep 23, 2025

#7836 is merged!

@mbrobbel mbrobbel merged commit cdbbbf7 into apache:main Sep 23, 2025
32 checks passed
@mbrobbel
Copy link
Member

Looks like we should've synced this branch with main before merging: https://github.com/apache/arrow-rs/actions/runs/17949653507/job/51045091164

@kylebarron
Copy link
Contributor

Maybe a merge queue would make sense?

@emilk
Copy link
Contributor Author

emilk commented Sep 23, 2025

Oops!

Fix in:

We should be able to just cherry-pick 6de27e6 to its own PR

@mbrobbel
Copy link
Member

We could also try to cherry-pick 6de27e6 to its own PR

+1

@emilk
Copy link
Contributor Author

emilk commented Sep 23, 2025

Quick fix:

mbrobbel pushed a commit that referenced this pull request Sep 23, 2025
* Fixes merge-race induced problem from
#8290
alamb pushed a commit that referenced this pull request Sep 24, 2025
# Rationale for this change
Despite us having `Display` implementations for `DataType`, a lot of
error messages still use `Debug`. See for instance:

* apache/datafusion#17565
* #8290

Therefor I want to make sure the `Debug` formatting of `Field` (and, by
extension, `DataType`) is not _utterly horrible_. This PR makes things…
slightly better.

# What changes are included in this PR?
Omits fields of `Field` that have their "default" values.

# Are these changes tested?
Yes, there are new tests.

# Are there any user-facing changes?
Though this changes the `Debug` formatting, I would NOT consider this a
breaking change, because nobody should rely on consistent `Debug`
formatting. See for instance
https://doc.rust-lang.org/std/fmt/trait.Debug.html#stability
mbrobbel pushed a commit that referenced this pull request Sep 25, 2025
# Which issue does this PR close?
* Follows #8290 (merge that
first, and the diff of this PR will drop)
* #7469
* Part of #8351

# Rationale for this change
We would previously format structs like so:

`Struct(name1 type1, name2 nullable type2)`

This will break badly whenever the field name is anything but a simple
identifier. In other words: it allows [string
injection](https://xkcd.com/327/) if the field name contains an
end-paranthesis.

Except for that, it is also difficult to debug mistakingly bad field
names like " " or "\n".

# What changes are included in this PR?
We change the `Display` and `Debug` formatting of `Struct`

**Before**: `Struct(name1 type1, name2 nullable type2)`
**After**: `Struct("name1": type1, "name2": nullable type2)`

# Are these changes tested?
Yes - I've updated the existing tests.

# Are there any user-facing changes?
Yes, changing the `Display` formatting is a **breaking change**
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate parquet-variant parquet-variant* crates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve human readable display for DataType::List
5 participants