Skip to content

BUG: Fix Series.str.contains with compiled regex on Arrow string dtype #61946

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Aniketsy
Copy link
Contributor

@Aniketsy Aniketsy commented Jul 25, 2025

closes #61942

This PR fixes an issue in Series.str.contains() where passing a compiled regex object failed when the underlying string data is backed by PyArrow.

Please, provide feedback if my approach is not correct , I would love to improve and contribute in this.

@Aniketsy Aniketsy force-pushed the fix-arrow-contains-regex-v2 branch from 0b16375 to 838b1c5 Compare July 25, 2025 13:15
@Aniketsy
Copy link
Contributor Author

Hi @mroeschke
I've worked on the issue
BUG: Fix Series.str.contains with compiled regex on Arrow string dtype ([#61942])
and have opened a pull request for it.

I'd appreciate it if you could take a look and share your feedback.
Please let me know if anything needs to be improved or clarified.

Thanks!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix should go into _str_contains of ArrowExtensionArray

@Aniketsy
Copy link
Contributor Author

Thankyou for the feedback!
I will update that.

@jorisvandenbossche
Copy link
Member

Additionally, if this is something that is not implemented by pyarrow, we should not raise a NotImplementedError, but fall back on the python object implementation (you can see a similar pattern in some other str methods, like ArrowStringArray._str_replace)

@jorisvandenbossche jorisvandenbossche added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Jul 26, 2025
@jorisvandenbossche jorisvandenbossche added this to the 2.3.2 milestone Jul 26, 2025
@Aniketsy
Copy link
Contributor Author

@jorisvandenbossche Thank you for the feedback! I will update the PR accordingly.

Would you mind letting me know the reason behind the one failing check (pre-commit.ci)?
Thanks again!

@jorisvandenbossche
Copy link
Member

Would you mind letting me know the reason behind the one failing check (pre-commit.ci)?

ruff is failing, which is used for auto formatting. I would recommend to install the pre-commit locally to avoid having this fail on CI: https://pandas.pydata.org/docs/dev/development/contributing_codebase.html#pre-commit

@Aniketsy
Copy link
Contributor Author

hi @jorisvandenbossche
Please review this PR, and if area needs changes please suggest.
Also I want to know if i would need to write unit test for this .

Thankyou!

Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a whatsnew note in v2.3.2 and tests for this?

@@ -344,6 +344,9 @@ def _str_contains(
na=lib.no_default,
regex: bool = True,
):
if isinstance(pat, re.Pattern) and regex:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you combine this case with the one below?

Comment on lines 286 to 287
if any_string_dtype == "string[pyarrow]":
pytest.importorskip("pyarrow")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if any_string_dtype == "string[pyarrow]":
pytest.importorskip("pyarrow")

That should already happen via the fixture, I think?

"str": bool,
}.get(any_string_dtype, object)
expected = Series([False, True, True], dtype=expected_dtype)
assert str(result.dtype) == str(expected.dtype)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert str(result.dtype) == str(expected.dtype)

This check will also be done by the assert_series_equal called on the next line

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now , I have applied changes as per your suggestion .

@jorisvandenbossche
Copy link
Member

Can you try to run the test you added locally? Then you can make sure to get it working correctly. Right now it is still failing according to CI

@Aniketsy
Copy link
Contributor Author

Sure, I will try to run tests locally and update this PR .
Thankyou !

pat = re.compile("ba.")
result = ser.str.contains(pat)

expected = Series([False, True, True], dtype=bool)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
expected = Series([False, True, True], dtype=bool)
expected_dtype = (
np.bool_ if is_object_or_nan_string_dtype(any_string_dtype) else "boolean"
)
expected = Series([False, True, True], dtype=expected_dtype)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for suggesting these changes! I tried running the tests locally to verify, but couldn’t get them working on my setup.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests are passing now on CI!

Aniketsy added a commit to Aniketsy/pandas that referenced this pull request Aug 13, 2025
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good now!

@jorisvandenbossche jorisvandenbossche dismissed mroeschke’s stale review August 14, 2025 08:26

test and whatsnew have been added

@jorisvandenbossche jorisvandenbossche changed the title BUG: Fix Series.str.contains with compiled regex on Arrow string dtype (#61942) BUG: Fix Series.str.contains with compiled regex on Arrow string dtype Aug 14, 2025
@jorisvandenbossche jorisvandenbossche merged commit 1d22331 into pandas-dev:main Aug 14, 2025
38 checks passed
@jorisvandenbossche
Copy link
Member

Thanks @Aniketsy

Copy link

lumberbot-app bot commented Aug 14, 2025

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

  1. Checkout backport branch and update it.
git checkout 2.3.x
git pull
  1. Cherry pick the first parent branch of the this PR on top of the older branch:
git cherry-pick -x -m1 1d2233185083423b8ecb27986f11175b2d6e8fa6
  1. You will likely have some merge/cherry-pick conflict here, fix them and commit:
git commit -am 'Backport PR #61946: BUG: Fix Series.str.contains with compiled regex on Arrow string dtype'
  1. Push to a named branch:
git push YOURFORK 2.3.x:auto-backport-of-pr-61946-on-2.3.x
  1. Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #61946 on branch 2.3.x (BUG: Fix Series.str.contains with compiled regex on Arrow string dtype)"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

@Aniketsy
Copy link
Contributor Author

@jorisvandenbossche
Big thanks for patiently guiding me at every step and helping me get this right. I learned a lot from this, and I’m glad the PR is now merged!
Thankyou .

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Aug 15, 2025
pandas-dev#61946)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
@jorisvandenbossche
Copy link
Member

@Aniketsy you're welcome!

@jorisvandenbossche
Copy link
Member

Manual backport -> #62116

jorisvandenbossche added a commit that referenced this pull request Aug 15, 2025
…Arrow string dtype (#61946) (#62116)

Co-authored-by: Aniket <148300120+Aniketsy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Using Series.str.contains() with a compiled regex and arrow strings fails
3 participants