Skip to content

Conversation

@bejaeger
Copy link
Contributor

@bejaeger bejaeger commented Jan 8, 2026

This is another broad refactoring without any functionality change or introducing any new objects.
I think it's useful to get an overview and before starting to define better abstractions in the next step.

Changes

  • Splits paths between initialization code with differentiable input and standard inputs in classifier.py.
  • Moves code to appropriate files -> Note that the naming here might still change and is not set in stone.
  • Groups code and shares data tagging and cleaning logic between classifier and regressor.

@bejaeger
Copy link
Contributor Author

bejaeger commented Jan 8, 2026

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant refactoring effort that improves the codebase's structure by grouping related logic into new modules. The separation of validation, data cleaning, and feature tagging logic is well-executed and enhances maintainability. The refactoring of the initialization logic in the classifier and regressor is also a good improvement. I've identified a couple of minor issues: an incorrect TODO comment and a misleading error message, for which I've provided suggestions. Overall, this is a high-quality refactoring.

@bejaeger bejaeger changed the title [Pre-refactor-2] Refactor processing of differential logic and group code [Prep-refactor-2] Refactor processing of differential logic and group code Jan 8, 2026
@bejaeger bejaeger marked this pull request as ready for review January 8, 2026 17:52
@bejaeger bejaeger requested a review from a team as a code owner January 8, 2026 17:52
@bejaeger bejaeger requested review from Copilot and oscarkey and removed request for a team and Copilot January 8, 2026 17:52
@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@bejaeger bejaeger requested review from LeoGrin and alanprior and removed request for oscarkey January 8, 2026 17:53
@alanprior
Copy link

So all in all it feels like the changes are a good step towards what we want to do. It is indeed very necessary to start splitting into more logical files rather than what we had. A few general comments:

  1. It's obviously a bit hard for the reviewer to make sure that nothing really breaks. I would hope that we have enough test coverage and AI agents wandering around, but obviously it will be impossible for me to catch a bad import etc.

  2. I think that some functionalities are unclear or non-intuitive even for people which by now have some experience with TabPFN :) For instance, I'm not familiar with differentiable input / promptuning (?), and I believe that there are other areas. I wonder whether adding slightly more documentation explaining such cases would be relevant. In the same spirit, I encourage slightly more verbose names when possible. For instance, "tag.py" could be more explicit; this is only an example of a broader approach I believe we should consider in general, but especially now that we redesign the code: the word 'tag' will stay from here on. Do we feel it's the best name for it?

  3. Structure: I'm unsure about whether some files should belong outside/inside folders, and it's a good chance to think about what would be the correct skeleton. For instance, it's great we are breaking down utils.py, perhaps utils could be a folder with many unrelated files containing more specific names, e.g. hardware and whatever else we'll have down the road?

Anyway - all of this is very subjective, and shouldn't stop us from moving fast. As long as (1) holds, feel free to go ahead.

Copilot AI review requested due to automatic review settings January 9, 2026 08:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors preprocessing and validation logic by consolidating shared code between classifier and regressor, separating differentiable and standard input paths, and organizing functionality into dedicated modules. The changes improve code maintainability without altering functionality.

Key changes:

  • Created validation.py module to centralize input validation logic
  • Split classifier initialization into separate methods for differentiable vs standard inputs
  • Introduced tag_features_and_sanitize_data helper to consolidate data preprocessing

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/tabpfn/validation.py New module centralizing validation logic previously scattered across utils and base
src/tabpfn/preprocessing/clean.py New module for data cleaning operations (dtype fixing, NA processing)
src/tabpfn/preprocessing/type_detection.py New module for categorical feature inference
src/tabpfn/preprocessing/initialization.py New module with tag_features_and_sanitize_data helper
src/tabpfn/classifier.py Split _initialize_dataset_preprocessing into separate methods for differentiable vs standard inputs
src/tabpfn/regressor.py Refactored to use new validation/preprocessing modules and handle ensemble config reuse
src/tabpfn/utils.py Removed validation functions moved to dedicated modules
src/tabpfn/base.py Removed check_cpu_warning function moved to validation module
tests/test_utils.py Updated imports to reflect new module structure
tests/test_regressor_interface.py Fixed mock patch to use mock.patch.object
changelog/720.added.md Added changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@bejaeger
Copy link
Contributor Author

bejaeger commented Jan 9, 2026

Thanks a lot for the review @alanprior !

I did some renaming and commented on the things I kept as is.

RE your general remarks:

  1. when we change the implementations we should benchmark the model on real data. Before that, I think we can live with the tests and reviews and proceed carefully with the refactor.
  2. Changed some of the names and hope this is fine. However, I wouldn't go overboard with explaining specialized finetuning methods in this base class.
  3. I agree for the most part. We can think about that going forward.

Copy link
Collaborator

@LeoGrin LeoGrin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love it!!

@bejaeger bejaeger force-pushed the ben/preref-refactor-diff-input-processing branch from 9930223 to 66557af Compare January 9, 2026 16:13
@alanprior
Copy link

Thanks a lot for the review @alanprior !

I did some renaming and commented on the things I kept as is.

RE your general remarks:

  1. when we change the implementations we should benchmark the model on real data. Before that, I think we can live with the tests and reviews and proceed carefully with the refactor.
  2. Changed some of the names and hope this is fine. However, I wouldn't go overboard with explaining specialized finetuning methods in this base class.
  3. I agree for the most part. We can think about that going forward.

Sounds good!

@bejaeger bejaeger enabled auto-merge (squash) January 10, 2026 08:28
@bejaeger bejaeger disabled auto-merge January 10, 2026 08:28
@bejaeger bejaeger enabled auto-merge (squash) January 10, 2026 08:28
@bejaeger bejaeger merged commit 4ee3fe2 into main Jan 10, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants