-
-
Notifications
You must be signed in to change notification settings - Fork 19.2k
BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
slack link update issue pandas-dev#61690
|
@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures. I see two ways to resolve without changing overall behavior.
Let me know your thoughts. |
| codes, categories = factorize(values, sort=False) | ||
| if dtype.ordered: | ||
| # raise, as we don't have a sortable data structure and so | ||
| # the user should give us one by specifying categories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt the comments were redundant as TypeError already explain it clearly and also new logic is added to detect if the input values is a pandas Series or Index with "object" dtype, and then force the categories to use object dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I do not have any strong preference, I am happy to add it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add it back to keep the diff focused
@jbrockmendel Thank you for your comments. I have addressed them. Although I doubt it would help address all the failing tests. Please let me know your thoughts. Any preference from above two options. |
I tried out the 1st way and all the test cases are passing. Let me know your thoughts. |
| # If we should prserve object dtype, force categories to object dtype | ||
| if preserve_object_dtpe: | ||
| # Only preserve object dtype if not all elements are strings | ||
| if not all(isinstance(x, str) for x in categories): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this check necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs.
To make sure other functionality doesn't break, I am preserving object datatype when all elements are not string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think that is necessary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the more i look at it, the more i think this misses the point of the motivating issue. i mean, wouldn't the new test this PR adds pass on main?
|
@niruta25 - are you interested in continuing here? If not, I can pick this up. |
Hey I would like to continue working in this if it's ok. Should resolve comments by this weekend. |
@rhshadrach @jbrockmendel addressed the comments. Please review. |
| dtype = CategoricalDtype(categories, values.dtype.pyarrow_dtype.ordered) | ||
| else: | ||
| # Check for pandas Series/ Index with object dtye | ||
| preserve_object_dtpe = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo dtpe -> dtype
| - Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`) | ||
| - Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`) | ||
| - | ||
| - Bug in :class:`Categorical` where constructing from a pandas :class:`Series` or :class:`Index` with ``dtype='object'`` did not preserve the categories' dtype as ``object``; now the dtype is preserved as ``object`` for these cases, while numpy arrays and Python sequences with ``dtype='object'`` continue to infer the most specific dtype (for example, ``str`` if all elements are strings). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue ref at the end
|
|
||
| # Series/Index with object dtype: infer string | ||
| # dtype if all elements are strings | ||
| assert cat_from_ser.categories.inferred_type == "string" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checking inferred_type isn't going to give us what we want. check the categories.dtype directly
|
closed by #62757 |
doc/source/whatsnew/vX.X.X.rstfile if fixing a bug or adding a new feature.Preserve object dtype for categories when constructing Categorical from pandas objects
This PR fixes an inconsistency in how pandas infers the dtype of categories when constructing a Categorical from different input types:
When constructing a Categorical from a pandas Series or Index with dtype="object", the categories' dtype is now preserved as object.
When constructing from a NumPy array with dtype="object" or a raw Python sequence, pandas continues to infer the most specific dtype for the categories (e.g., str if all elements are strings).
This change brings the behavior of Categorical in line with how Series and Index handle dtype preservation, making the API more consistent and predictable.
Example
Documentation and release notes have been updated.
Closes: #61778