-
Notifications
You must be signed in to change notification settings - Fork 539
[Prep-refactor-2] Move categorical feature detection logic into different file #717
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully moves the infer_categorical_features function to a new, more appropriate location in src/tabpfn/preprocessing/typing.py, which improves the project structure. My review focuses on two main points: suggesting a performance improvement in the moved function, and addressing the numerous TODO DETECT comments that have been added. While these comments are useful notes for future work, it's best practice to track them as issues in your project's issue tracker to ensure they are not lost and can be properly managed. I've left specific comments on each of these.
| x: The input tensor. | ||
| **kwargs: Additional keyword arguments (unused). | ||
| """ | ||
| # TODO DETECT: We detect here empty features, this could happen in advance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # as handle `np.object` arrays or otherwise `object` dtype pandas columns. | ||
|
|
||
| if not self.differentiable_input: | ||
| # TODO DETECT: this part is very important, as detection occurs here. We shall replace this for sure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| self.preprocessor_ = ord_encoder | ||
|
|
||
| else: # Minimal preprocessing for prompt tuning | ||
| # TODO DETECT: Unsure if and how we want to change this flow, as we would like to deprecate inferred_categorical_indices_? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| X: np.ndarray, | ||
| categorical_features: list[int], | ||
| ) -> tuple[ColumnTransformer | None, list[int]]: | ||
| # TODO DETECT: This area should be aware of what are the categorical features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def _fit( # type: ignore | ||
| self, X: np.ndarray | torch.Tensor, categorical_features: list[int] | ||
| ) -> list[int]: | ||
| # TODO DETECT: We would like to detect 'constant' in advance, and pass it to here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| import pandas as pd | ||
|
|
||
|
|
||
| # TODO DETECT: This function should probably be decomposed to operate on a Series rather than matrix/dataframe, so we can decouple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The indices of inferred categorical features. | ||
| """ | ||
| # We presume everything is numerical and go from there | ||
| maybe_categoricals = () if provided is None else provided |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better performance, consider converting provided to a set before the loop. Checking for an item's existence in a list or tuple has a time complexity of O(n), while for a set it's O(1) on average. This can make a noticeable difference if provided contains many feature indices.
| maybe_categoricals = () if provided is None else provided | |
| maybe_categoricals = set(provided) if provided is not None else set() |
| cat_indices: Sequence[int | str] | None, | ||
| numeric_dtype: Literal["float32", "float64"] = "float64", | ||
| ) -> pd.DataFrame: | ||
| # TODO DETECT: This function does both detection and conversion. We would like to decouple, and store it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| X = pd.DataFrame(X, copy=True) | ||
| convert_dtype = True | ||
| elif X.dtype.kind in STRING_DTYPE_KINDS: | ||
| # TODO DETECT: We would like to detect in advance that there are strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Note that this function sometimes mutates its input. | ||
| """ | ||
| # TODO DETECT: This function should be aware of the types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
Some of this will already be handled in: Closing. |
Issue
Please link the corresponding GitHub issue. If an issue does not already exist,
please open one to describe the bug or feature request before creating a pull request.
This allows us to discuss the proposal and helps avoid unnecessary work.
Motivation and Context
Public API Changes
How Has This Been Tested?
Checklist
CHANGELOG.md(if relevant for users).