[Prep-refactor-2] Move categorical feature detection logic into different file #717

alanprior · 2026-01-08T13:18:05Z

Issue

Please link the corresponding GitHub issue. If an issue does not already exist,
please open one to describe the bug or feature request before creating a pull request.

This allows us to discuss the proposal and helps avoid unnecessary work.

Motivation and Context

Public API Changes

No Public API changes
Yes, Public API changes (Details below)

How Has This Been Tested?

Checklist

The changes have been tested locally.
Documentation has been updated (if the public API or usage changes).
A entry has been added to CHANGELOG.md (if relevant for users).
The code follows the project's style guidelines.
I have considered the impact of these changes on the public API.

chatgpt-codex-connector · 2026-01-08T13:18:11Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

CLAassistant · 2026-01-08T13:18:14Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

gemini-code-assist

Code Review

This pull request successfully moves the infer_categorical_features function to a new, more appropriate location in src/tabpfn/preprocessing/typing.py, which improves the project structure. My review focuses on two main points: suggesting a performance improvement in the moved function, and addressing the numerous TODO DETECT comments that have been added. While these comments are useful notes for future work, it's best practice to track them as issues in your project's issue tracker to ensure they are not lost and can be properly managed. I've left specific comments on each of these.

gemini-code-assist · 2026-01-08T13:20:26Z

src/tabpfn/architectures/base/encoders.py

            x: The input tensor.
            **kwargs: Additional keyword arguments (unused).
        """
+        # TODO DETECT: We detect here empty features, this could happen in advance.


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:26Z

src/tabpfn/classifier.py

        # as handle `np.object` arrays or otherwise `object` dtype pandas columns.

        if not self.differentiable_input:
+            # TODO DETECT: this part is very important, as detection occurs here. We shall replace this for sure.


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/classifier.py

            self.preprocessor_ = ord_encoder

        else:  # Minimal preprocessing for prompt tuning
+            # TODO DETECT: Unsure if and how we want to change this flow, as we would like to deprecate inferred_categorical_indices_?


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/preprocessing/steps/encode_categorical_features_step.py

        X: np.ndarray,
        categorical_features: list[int],
    ) -> tuple[ColumnTransformer | None, list[int]]:
+        # TODO DETECT: This area should be aware of what are the categorical features


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/preprocessing/steps/remove_constant_features_step.py

    def _fit(  # type: ignore
        self, X: np.ndarray | torch.Tensor, categorical_features: list[int]
    ) -> list[int]:
+        # TODO DETECT: We would like to detect 'constant' in advance, and pass it to here


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/preprocessing/typing.py

+import pandas as pd
+
+
+# TODO DETECT: This function should probably be decomposed to operate on a Series rather than matrix/dataframe, so we can decouple


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/preprocessing/typing.py

+        The indices of inferred categorical features.
+    """
+    # We presume everything is numerical and go from there
+    maybe_categoricals = () if provided is None else provided


For better performance, consider converting provided to a set before the loop. Checking for an item's existence in a list or tuple has a time complexity of O(n), while for a set it's O(1) on average. This can make a noticeable difference if provided contains many feature indices.

Suggested change

maybe_categoricals = () if provided is None else provided

maybe_categoricals = set(provided) if provided is not None else set()

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/utils.py

    cat_indices: Sequence[int | str] | None,
    numeric_dtype: Literal["float32", "float64"] = "float64",
 ) -> pd.DataFrame:
+    # TODO DETECT: This function does both detection and conversion. We would like to decouple, and store it


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/utils.py

            X = pd.DataFrame(X, copy=True)
            convert_dtype = True
        elif X.dtype.kind in STRING_DTYPE_KINDS:
+            # TODO DETECT: We would like to detect in advance that there are strings


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

gemini-code-assist · 2026-01-08T13:20:27Z

src/tabpfn/utils.py


    Note that this function sometimes mutates its input.
    """
+    # TODO DETECT: This function should be aware of the types.


This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

alanprior · 2026-01-08T20:05:46Z

Some of this will already be handled in:
#720

Closing.

alanarazi7 added 3 commits January 8, 2026 15:09

Commenting potential changes for detection code

9df9789

Reverting removed comments

fac02b0

Moving infer_categorical_features to another file

435cdfb

alanprior requested a review from a team as a code owner January 8, 2026 13:18

alanprior requested review from adrian-prior and removed request for a team January 8, 2026 13:18

alanprior assigned bejaeger Jan 8, 2026

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

bejaeger removed the request for review from adrian-prior January 8, 2026 13:28

alanprior closed this Jan 8, 2026

		import pandas as pd


		# TODO DETECT: This function should probably be decomposed to operate on a Series rather than matrix/dataframe, so we can decouple

	maybe_categoricals = () if provided is None else provided
	maybe_categoricals = set(provided) if provided is not None else set()

[Prep-refactor-2] Move categorical feature detection logic into different file #717

[Prep-refactor-2] Move categorical feature detection logic into different file #717

Uh oh!

Conversation

alanprior commented Jan 8, 2026

Issue

Motivation and Context

Public API Changes

How Has This Been Tested?

Checklist

Uh oh!

chatgpt-codex-connector bot commented Jan 8, 2026

Uh oh!

CLAassistant commented Jan 8, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

alanprior commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants