-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes #62242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
b766c15
ea08043
206cb69
247d514
3e9f04e
c86f33f
81b50db
5990400
d9f6983
0ebca38
22bd4e3
22e4129
74f01a7
9d68f91
a324f4a
8cda906
df9e96d
d2925bc
d6f3a5c
5df2f20
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
from io import StringIO | ||
|
||
import pytest | ||
|
||
|
||
def test_leading_zeros_preserved_with_dtype_str(all_parsers, request): | ||
# GH#57666: pyarrow engine strips leading zeros when dtype=str is passed | ||
# GH#61618: further discussion on ensuring string dtype preservation across engines | ||
|
||
parser = all_parsers | ||
engine_name = getattr(parser, "engine", "unknown") | ||
|
||
data = """col1,col2,col3,col4 | ||
AB,000388907,abc,0150 | ||
CD,101044572,def,0150 | ||
EF,000023607,ghi,0205 | ||
GH,100102040,jkl,0205""" | ||
|
||
result = parser.read_csv( | ||
StringIO(data), | ||
dtype=str, | ||
) | ||
|
||
try: | ||
assert result.shape == (4, 4) | ||
assert list(result.columns) == ["col1", "col2", "col3", "col4"] | ||
|
||
assert result.loc[0, "col2"] == "000388907", "lost zeros in col2 row 0" | ||
assert result.loc[2, "col2"] == "000023607", "lost zeros in col2 row 2" | ||
assert result.loc[0, "col4"] == "0150", "lost zeros in col4 row 0" | ||
assert result.loc[2, "col4"] == "0205", "lost zeros in col4 row 2" | ||
|
||
except AssertionError as exc: | ||
if engine_name == "pyarrow": | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. take a look at how we handle xfails elsewhere. we check and add the marker before the meat of the test There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @jbrockmendel I considered that option, but it doesn't seem appropriate for this case. The tests only fail for the pyarrow engine, and only because there is an underlying flaw in the pyarrow read logic. Is there another preferred way to handle this? |
||
# Temporary workaround for GH#57666 | ||
# Remove once type preservation is fixed in pyarrow engine. | ||
request.node.add_marker( | ||
pytest.mark.xfail(reason=f"failed assertions: {exc}", strict=False) | ||
) | ||
raise | ||
|
||
|
||
def test_leading_zeros_preserved_with_dtype_dict(all_parsers): | ||
# GH#57666: pyarrow engine strips leading zeros when dtype=str is passed | ||
# GH#61618: further discussion on ensuring string dtype preservation across engines | ||
|
||
parser = all_parsers | ||
|
||
data = """col1,col2,col3,col4 | ||
AB,000388907,199,0150 | ||
CD,101044572,200,0150 | ||
EF,000023607,201,0205 | ||
GH,100102040,202,0205""" | ||
|
||
result = parser.read_csv( | ||
StringIO(data), | ||
dtype={"col2": str, "col3": int, "col4": str}, | ||
) | ||
|
||
assert result.shape == (4, 4) | ||
assert list(result.columns) == ["col1", "col2", "col3", "col4"] | ||
|
||
assert result.loc[0, "col2"] == "000388907", "lost zeros in col2 row 0" | ||
assert result.loc[2, "col2"] == "000023607", "lost zeros in col2 row 2" | ||
assert result.loc[0, "col4"] == "0150", "lost zeros in col4 row 0" | ||
assert result.loc[2, "col4"] == "0205", "lost zeros in col4 row 2" | ||
|
||
assert result.loc[0, "col3"] == 199 | ||
assert result.loc[1, "col3"] == 200 | ||
assert result.loc[2, "col3"] == 201 | ||
assert result.loc[3, "col3"] == 202 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't merit its own file. try to find plausibly-related tests to put it with
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seemsl ike the
tests/io/parser
is the right place, but I don't see any other files there that seem appropriate. could you suggest another place? happy to move it.