BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080

niruta25 · 2025-08-09T03:19:14Z

closes BUG?: creating Categorical from pandas Index/Series with "object" dtype infers string #61778
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Preserve object dtype for categories when constructing Categorical from pandas objects

This PR fixes an inconsistency in how pandas infers the dtype of categories when constructing a Categorical from different input types:

When constructing a Categorical from a pandas Series or Index with dtype="object", the categories' dtype is now preserved as object.
When constructing from a NumPy array with dtype="object" or a raw Python sequence, pandas continues to infer the most specific dtype for the categories (e.g., str if all elements are strings).
This change brings the behavior of Categorical in line with how Series and Index handle dtype preservation, making the API more consistent and predictable.

Example

pd.options.future.infer_string = True

ser = pd.Series(["foo", "bar", "baz"], dtype="object")
idx = pd.Index(["foo", "bar", "baz"], dtype="object")
arr = np.array(["foo", "bar", "baz"], dtype="object")
pylist = ["foo", "bar", "baz"]

cat_from_ser = pd.Categorical(ser)
cat_from_idx = pd.Categorical(idx)
cat_from_arr = pd.Categorical(arr)
cat_from_list = pd.Categorical(pylist)

# Series/Index with object dtype: preserve object dtype
assert cat_from_ser.categories.dtype == "object"
assert cat_from_idx.categories.dtype == "object"

# Numpy array or list: infer string dtype
assert cat_from_arr.categories.dtype == "str"
assert cat_from_list.categories.dtype == "str"

Documentation and release notes have been updated.
Closes: #61778

slack link update issue pandas-dev#61690

niruta25 · 2025-08-09T05:27:16Z

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.

I see two ways to resolve without changing overall behavior.

Only Preserve object Dtype When All Elements Are Not Strings

If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.
If all elements are strings, allow inference to str (the current behavior).

Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)

Add an explicit option to the Categorical constructor to preserve the object dtype for categories.
Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

pandas/tests/extension/test_categorical.py

jbrockmendel · 2025-08-20T17:05:50Z

pandas/core/arrays/categorical.py

                    codes, categories = factorize(values, sort=False)
                    if dtype.ordered:
-                        # raise, as we don't have a sortable data structure and so
-                        # the user should give us one by specifying categories


why is this removed?

I felt the comments were redundant as TypeError already explain it clearly and also new logic is added to detect if the input values is a pandas Series or Index with "object" dtype, and then force the categories to use object dtype.

Although I do not have any strong preference, I am happy to add it back.

please add it back to keep the diff focused

niruta25 · 2025-08-28T06:24:51Z

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.

I see two ways to resolve without changing overall behavior.

Only Preserve object Dtype When All Elements Are Not Strings

If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.

If all elements are strings, allow inference to str (the current behavior).

Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)

Add an explicit option to the Categorical constructor to preserve the object dtype for categories.

Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

@jbrockmendel Thank you for your comments. I have addressed them. Although I doubt it would help address all the failing tests. Please let me know your thoughts. Any preference from above two options.

niruta25 · 2025-08-29T04:34:08Z

@jbrockmendel Regarding this bug, the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs. Hence I am seeing a lot of failures.
I see two ways to resolve without changing overall behavior.

Only Preserve object Dtype When All Elements Are Not Strings

If the input is a pandas Series/Index with dtype="object", only preserve object dtype for categories if not all elements are strings.

If all elements are strings, allow inference to str (the current behavior).

Add a Keyword Argument to Categorical (e.g., preserve_object_dtype=False)

Add an explicit option to the Categorical constructor to preserve the object dtype for categories.

Default to the current behavior, but allow users to opt in to preservation.

Let me know your thoughts.

@jbrockmendel Thank you for your comments. I have addressed them. Although I doubt it would help address all the failing tests. Please let me know your thoughts. Any preference from above two options.

I tried out the 1st way and all the test cases are passing. Let me know your thoughts.

pandas/core/arrays/categorical.py

jbrockmendel · 2025-09-19T19:32:39Z

pandas/core/arrays/categorical.py

+                # If we should prserve object dtype, force categories to object dtype
+                if preserve_object_dtpe:
+                    # Only preserve object dtype if not all elements are strings
+                    if not all(isinstance(x, str) for x in categories):


why is this check necessary?

the change to always preserve object dtype for categories when constructing a Categorical from a pandas Series or Index with dtype="object" is a behavioral change that affects a wide range of pandas internals and user-facing APIs.
To make sure other functionality doesn't break, I am preserving object datatype when all elements are not string.

i dont think that is necessary

the more i look at it, the more i think this misses the point of the motivating issue. i mean, wouldn't the new test this PR adds pass on main?

rhshadrach · 2025-10-04T17:15:43Z

@niruta25 - are you interested in continuing here? If not, I can pick this up.

niruta25 · 2025-10-05T05:15:06Z

@niruta25 - are you interested in continuing here? If not, I can pick this up.

Hey I would like to continue working in this if it's ok. Should resolve comments by this weekend.

niruta25 · 2025-10-06T05:38:27Z

@niruta25 - are you interested in continuing here? If not, I can pick this up.

Hey I would like to continue working in this if it's ok. Should resolve comments by this weekend.

@rhshadrach @jbrockmendel addressed the comments. Please review.

jbrockmendel · 2025-10-19T18:54:36Z

pandas/core/arrays/categorical.py

                dtype = CategoricalDtype(categories, values.dtype.pyarrow_dtype.ordered)
            else:
+                # Check for pandas Series/ Index with object dtye
+                preserve_object_dtpe = False


typo dtpe -> dtype

jbrockmendel · 2025-10-19T18:57:52Z

doc/source/whatsnew/v3.0.0.rst

 - Bug in :meth:`DataFrame.pivot` and :meth:`DataFrame.set_index` raising an ``ArrowNotImplementedError`` for columns with pyarrow dictionary dtype (:issue:`53051`)
 - Bug in :meth:`Series.convert_dtypes` with ``dtype_backend="pyarrow"`` where empty :class:`CategoricalDtype` :class:`Series` raised an error or got converted to ``null[pyarrow]`` (:issue:`59934`)
-
+- Bug in :class:`Categorical` where constructing from a pandas :class:`Series` or :class:`Index` with ``dtype='object'`` did not preserve the categories' dtype as ``object``; now the dtype is preserved as ``object`` for these cases, while numpy arrays and Python sequences with ``dtype='object'`` continue to infer the most specific dtype (for example, ``str`` if all elements are strings).


issue ref at the end

jbrockmendel · 2025-10-19T19:18:11Z

pandas/tests/arrays/categorical/test_constructors.py

+
+            # Series/Index with object dtype: infer string
+            # dtype if all elements are strings
+            assert cat_from_ser.categories.inferred_type == "string"


checking inferred_type isn't going to give us what we want. check the categories.dtype directly

jbrockmendel · 2025-10-19T19:25:27Z

@niruta25 i updated your branch with the necessary changes in #62757. If you'd like to pull those changes into this branch we can merge this instead.

jbrockmendel · 2025-10-24T17:19:37Z

closed by #62757

niruta25 added 9 commits June 24, 2025 14:10

slack link update

c4e1c18

Merge pull request #2 from niruta25/niruta-#61690-slack

e1a893d

slack link update issue pandas-dev#61690

Merge branch 'pandas-dev:main' into main

cfa767f

Merge branch 'pandas-dev:main' into main

c0ae870

Merge branch 'pandas-dev:main' into main

5188b81

Merge branch 'pandas-dev:main' into main

b63a723

Merge branch 'pandas-dev:main' into main

0fb42cc

object

9216954

whatsnew

8f460ac

niruta25 changed the title ~~Niruta issue61778~~ BUG: creating Categorical from pandas Index/Series with "object" dtype infers string Aug 9, 2025

jbrockmendel reviewed Aug 20, 2025

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 20, 2025

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 20, 2025

View reviewed changes

pandas/tests/extension/test_categorical.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Aug 20, 2025

View reviewed changes

niruta25 added 2 commits August 27, 2025 22:51

some comments

87a54fe

comment restore

cddc574

niruta25 added 3 commits August 28, 2025 00:27

assertionerror fix

e83e4f9

rst changes

5ed039a

rst import error

9b4b2d9

Merge branch 'main' into niruta_issue61778

4855994

jbrockmendel reviewed Sep 19, 2025

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Sep 19, 2025

View reviewed changes

pandas/core/arrays/categorical.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Sep 19, 2025

View reviewed changes

change condition

1b81162

jbrockmendel reviewed Oct 19, 2025

View reviewed changes

jbrockmendel mentioned this pull request Oct 19, 2025

BUG: Categorical(Series[object]) not preserving categories.dtype as object #62757

Merged

5 tasks

jbrockmendel closed this Oct 24, 2025

Uh oh!

BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080

BUG: creating Categorical from pandas Index/Series with "object" dtype infers string #62080

Uh oh!

Conversation

niruta25 commented Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

niruta25 commented Aug 9, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

niruta25 commented Aug 28, 2025

Uh oh!

niruta25 commented Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Oct 4, 2025

Uh oh!

niruta25 commented Oct 5, 2025

Uh oh!

niruta25 commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Oct 19, 2025

Uh oh!

jbrockmendel commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

niruta25 commented Aug 9, 2025 •

edited

Loading