Skip to content

Conversation

@alanprior
Copy link

Issue

Please link the corresponding GitHub issue. If an issue does not already exist,
please open one to describe the bug or feature request before creating a pull request.

This allows us to discuss the proposal and helps avoid unnecessary work.

Motivation and Context


Public API Changes

  • No Public API changes
  • Yes, Public API changes (Details below)

How Has This Been Tested?


Checklist

  • The changes have been tested locally.
  • Documentation has been updated (if the public API or usage changes).
  • A entry has been added to CHANGELOG.md (if relevant for users).
  • The code follows the project's style guidelines.
  • I have considered the impact of these changes on the public API.

@alanprior alanprior requested a review from a team as a code owner January 8, 2026 13:18
@alanprior alanprior requested review from adrian-prior and removed request for a team January 8, 2026 13:18
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully moves the infer_categorical_features function to a new, more appropriate location in src/tabpfn/preprocessing/typing.py, which improves the project structure. My review focuses on two main points: suggesting a performance improvement in the moved function, and addressing the numerous TODO DETECT comments that have been added. While these comments are useful notes for future work, it's best practice to track them as issues in your project's issue tracker to ensure they are not lost and can be properly managed. I've left specific comments on each of these.

x: The input tensor.
**kwargs: Additional keyword arguments (unused).
"""
# TODO DETECT: We detect here empty features, this could happen in advance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

# as handle `np.object` arrays or otherwise `object` dtype pandas columns.

if not self.differentiable_input:
# TODO DETECT: this part is very important, as detection occurs here. We shall replace this for sure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

self.preprocessor_ = ord_encoder

else: # Minimal preprocessing for prompt tuning
# TODO DETECT: Unsure if and how we want to change this flow, as we would like to deprecate inferred_categorical_indices_?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

X: np.ndarray,
categorical_features: list[int],
) -> tuple[ColumnTransformer | None, list[int]]:
# TODO DETECT: This area should be aware of what are the categorical features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

def _fit( # type: ignore
self, X: np.ndarray | torch.Tensor, categorical_features: list[int]
) -> list[int]:
# TODO DETECT: We would like to detect 'constant' in advance, and pass it to here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

import pandas as pd


# TODO DETECT: This function should probably be decomposed to operate on a Series rather than matrix/dataframe, so we can decouple
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

The indices of inferred categorical features.
"""
# We presume everything is numerical and go from there
maybe_categoricals = () if provided is None else provided
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better performance, consider converting provided to a set before the loop. Checking for an item's existence in a list or tuple has a time complexity of O(n), while for a set it's O(1) on average. This can make a noticeable difference if provided contains many feature indices.

Suggested change
maybe_categoricals = () if provided is None else provided
maybe_categoricals = set(provided) if provided is not None else set()

cat_indices: Sequence[int | str] | None,
numeric_dtype: Literal["float32", "float64"] = "float64",
) -> pd.DataFrame:
# TODO DETECT: This function does both detection and conversion. We would like to decouple, and store it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

X = pd.DataFrame(X, copy=True)
convert_dtype = True
elif X.dtype.kind in STRING_DTYPE_KINDS:
# TODO DETECT: We would like to detect in advance that there are strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

Note that this function sometimes mutates its input.
"""
# TODO DETECT: This function should be aware of the types.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This TODO comment highlights a potential future improvement. To ensure this is tracked and addressed, it would be better to create an issue in your project's issue tracker and remove this comment from the code. This helps in maintaining a clean codebase and managing technical debt effectively.

@bejaeger bejaeger removed the request for review from adrian-prior January 8, 2026 13:28
@alanprior alanprior closed this Jan 8, 2026
@alanprior
Copy link
Author

Some of this will already be handled in:
#720

Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants