-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
Description
Feature Type
-
Adding new functionality to pandas
-
Changing existing functionality in pandas
-
Removing existing functionality in pandas
Problem Description
I wish when testing.assert_frame_equal failed in one my unit tests that it would tell me all of the columns which have diffs.
The documentation for this function states:
Check that left and right DataFrame are equal.
This function is intended to compare two DataFrames and output any differences. It is mostly intended for use in unit tests. Additional parameters allow varying the strictness of the equality checks performed.
However, it does not output any differences. It goes through the columns and stops at the first diff it encounters. For someone unfamiliar with this behavior, it can be very confusing - you think only the one reported column has a diff, you fix it, then you re-run and now someone in fixing the first column you've broken another column!?
Once you understand that this is how it behaves, you can work around it by setting a breakpoint in your code and running pd.compare() but until you understand this, it's quite perplexing.
Example:
df1 = pd.DataFrame([ ['A',1],['B',2,] ],columns=['letter','number'])
df2 = pd.DataFrame([ ['a',1],['B',3,] ],columns=['letter','number'])
pd.testing.assert_frame_equal(df1,df2)
returns
AssertionError: DataFrame.iloc[:, 0] (column name="letter") are different
DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]: [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a
with no indication of the errors in the number column.
Feature Description
I propose to enhance something like the following:
AssertionError: DataFrames are different
The following columns contain diffs: ["letter","number"]
First diff: DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]: [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a
or
AssertionError: DataFrame.iloc[:, [0,1]] (column name=["letter","number"]) are different
First diff: DataFrame.iloc[:, 0] (column name="letter") values are different (50.0 %)
[index]: [0, 1]
[left]: [A, B]
[right]: [a, B]
At positional index 0, first diff: A != a
Alternative Solutions
Alternatively, if the community strongly prefers to keep the existing behavior, I would advocate that we should update the docs to make this behavior more explicitly clear to the user.
Additional Context
If there are both Index and column differences, the Index differences are flagged first, example:
df3 = df1.set_index('letter')
df4 = df2.set_index('letter')
pd.testing.assert_frame_equal(df3,df4)
returns:
AssertionError: DataFrame.index are different
DataFrame.index values are different (50.0 %)
[left]: Index(['A', 'B',], dtype='object', name='letter')
[right]: Index(['a', 'B'], dtype='object', name='letter')
At positional index 0, first diff: A != a
in which case I would recommend something like:
AssertionError: DataFrames are different
- DataFrame.index are different
- The following columns contain diffs: ["number"]
First diff: DataFrame.index values are different (50.0 %)
[left]: Index(['A', 'B'], dtype='object', name='letter')
[right]: Index(['a', 'B'], dtype='object', name='letter')
At positional index 0, first diff: A != a