Skip to content

Conversation

dxdc
Copy link

@dxdc dxdc commented Sep 2, 2025

Summary

This PR partially addresses issue #57666 by improving leading zeros preservation when dtype=str is used with dictionary-based dtype specifications. While the global dtype=str issue with pyarrow engine remains unfixed, this PR resolves the problem for more targeted dtype specifications.

Problem

Issue #57666 reports that the pyarrow engine does not preserve leading zeros in numeric-looking strings when dtype=str is specified, while other engines correctly preserve them.

Solution

  • Fixed: Dictionary-based dtype specifications (dtype={'col': str}) now properly preserve leading zeros across all engines
  • Partial: Global dtype=str still fails with pyarrow engine (marked with xfail for now)
  • Added: Test coverage for dtype specification patterns

What's Fixed vs Still Broken

✅ Now Working:

# This now preserve leading zeros correctly across all engines:
pd.read_csv(data, dtype={'col2': str, 'col3': int, 'col4': str})

⚠️ Still Broken (pyarrow only):

# This still strips leading zeros with pyarrow engine:
pd.read_csv(data, dtype=str)  # global string dtype

Next Steps

This PR provides a foundation for the complete fix. The remaining work involves:

  1. Fully resolving the pyarrow engine's global dtype handling
  2. Removing the xfail marker once completely resolved
  3. Improving the pyarrow engine's dtype enforcement during parsing rather than post-processing conversion

Checklist

Files Changed

  • pandas/io/parsers/arrow_parser_wrapper.py - Fix for dict-based dtypes
  • pandas/tests/io/parser/test_preserve_leading_zeros.py - Comprehensive test suite

Test Output

  • C engine: ✅ All tests pass
  • Python engine: ✅ All tests pass
  • PyArrow engine:
    • Dict-based dtypes now pass (with strings)
    • ⚠️ Global dtype=str marked as xfail (temporary)

@jbrockmendel
Copy link
Member

Looks like AI

@dxdc
Copy link
Author

dxdc commented Sep 2, 2025

@jbrockmendel

I did use AI to help draft this. I tried setting up a pandas development environment (both via pip and the Docker image) to create a reproducible test case, but running pytest from the CLI kept failing with a message about pandas._libs.

There seems to be a significant issue with the pyarrow implementation. Specifically, pyarrow does not enforce dtypes during load - it applies them afterward. As a result, integer-to-string conversions lose leading zeros. I wanted to at least contribute a working test that highlights this problem.

@dxdc
Copy link
Author

dxdc commented Sep 3, 2025

@jbrockmendel test is passing now. I have it marked as xfail for pyarrow only, but you can clearly see the issue. Once the issue is remedied we can remove the try/except block.

@jbrockmendel
Copy link
Member

We discourage AI-generated PRs since they take more time and effort to review than they do to write. I'll take a look since you took the time to get the CI passing, but in the future please avoid it.

assert result.loc[2, "col4"] == "0205", "lost zeros in col4 row 2"

except AssertionError as exc:
if engine_name == "pyarrow":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

take a look at how we handle xfails elsewhere. we check and add the marker before the meat of the test

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel I considered that option, but it doesn't seem appropriate for this case. The tests only fail for the pyarrow engine, and only because there is an underlying flaw in the pyarrow read logic. Is there another preferred way to handle this?

import pytest


def test_leading_zeros_preserved_with_dtype_str(all_parsers, request):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't merit its own file. try to find plausibly-related tests to put it with

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seemsl ike the tests/io/parser is the right place, but I don't see any other files there that seem appropriate. could you suggest another place? happy to move it.

@jbrockmendel
Copy link
Member

big picture adding a test isn't wrong, but not a good use of time. if you'd like to actually fix the issue, i think there are some good comments in the original thread

@dxdc
Copy link
Author

dxdc commented Sep 3, 2025

big picture adding a test isn't wrong, but not a good use of time. if you'd like to actually fix the issue, i think there are some good comments in the original thread

I agree, but the fix doesn't appear to be very straightforward. Happy to work on it if there is some guidance on where to find the relevant pieces. It requires proper mapping of pandas dtypes to pyarrow types, and also handling other logic that pandas supports but pyarrow doesn't (e.g., col index-based dtypes, global dtypes, etc.).

@dxdc dxdc changed the title TST: Add test for leading zeros preservation with dtype=str across parser engines PARTIAL FIX: Improve leading zeros preservation with dtype=str for dict-based dtypes Sep 3, 2025
@dxdc
Copy link
Author

dxdc commented Sep 4, 2025

@jbrockmendel I dug into the pandas and PyArrow APIs and landed on a more general fix for the issue. I wasn't sure how to run pytest locally against the dev branch - my attempts with the Docker container didn't work - so I relied on pandas' test suite.

This patch improves dtype handling during the pyarrow write path by converting supported column-specific dtypes into pyarrow types and passing them via convert_options["column_types"]. However, a few limitations remain:

Known Issues / Remaining Work

  1. Global dtype support is not implemented
    The current logic handles only column-specific dtype dictionaries. Global dtypes (e.g., dtype=str) are ignored, which could lead to inconsistent behavior across engines, especially with things like leading-zero string preservation. Supporting this would require column name/index context, which doesn't seem to be readily available here. I couldn't find a safe and clean way to retrieve it without broader architectural changes.

EDIT: After review, I think this feature would be best handled with a change to the PyArrow API, which we could adapt here quite easily. I've posted that issue here: apache/arrow#47502

  1. Unsupported dtypes are silently skipped
    If a dtype (e.g., "category") is not mappable to a PyArrow type, we currently drop it from column_types. This fallback behavior avoids breaking the pipeline, but it may lead to silent mismatches when PyArrow falls back to its default inference. We may want to revisit this to either emit warnings or fail explicitly for better visibility.

  2. Possible redundancy in _finalize_dtype()
    Now that dtypes are mapped earlier during the pyarrow conversion, we may no longer need the final call to self._finalize_dtype(). However, it might still be necessary for preserving native pandas types (e.g., CategoricalDtype) in some cases that do not have native pyarrow support.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pyarrow stripping leading zeros with dtype=str
2 participants