[Prep-refactor-2] Refactor processing of differential logic and group code #720

bejaeger · 2026-01-08T17:21:54Z

This is another broad refactoring without any functionality change or introducing any new objects.
I think it's useful to get an overview and before starting to define better abstractions in the next step.

Changes

Splits paths between initialization code with differentiable input and standard inputs in classifier.py.
Moves code to appropriate files -> Note that the naming here might still change and is not set in stone.
Groups code and shares data tagging and cleaning logic between classifier and regressor.

bejaeger · 2026-01-08T17:21:55Z

This change is part of the following stack:

[Prep-refactor-1] Move preprocessing logic into different files #713
- [Prep-refactor-2] Refactor processing of differential logic and group code #720 ◀

_{Change managed by git-spice.}

gemini-code-assist

Code Review

This pull request is a significant refactoring effort that improves the codebase's structure by grouping related logic into new modules. The separation of validation, data cleaning, and feature tagging logic is well-executed and enhances maintainability. The refactoring of the initialization logic in the classifier and regressor is also a good improvement. I've identified a couple of minor issues: an incorrect TODO comment and a misleading error message, for which I've provided suggestions. Overall, this is a high-quality refactoring.

src/tabpfn/regressor.py

src/tabpfn/validation.py

chatgpt-codex-connector · 2026-01-08T17:52:48Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

src/tabpfn/validation.py

src/tabpfn/preprocessing/type_detection.py

src/tabpfn/classifier.py

alanprior · 2026-01-08T19:58:16Z

So all in all it feels like the changes are a good step towards what we want to do. It is indeed very necessary to start splitting into more logical files rather than what we had. A few general comments:

It's obviously a bit hard for the reviewer to make sure that nothing really breaks. I would hope that we have enough test coverage and AI agents wandering around, but obviously it will be impossible for me to catch a bad import etc.
I think that some functionalities are unclear or non-intuitive even for people which by now have some experience with TabPFN :) For instance, I'm not familiar with differentiable input / promptuning (?), and I believe that there are other areas. I wonder whether adding slightly more documentation explaining such cases would be relevant. In the same spirit, I encourage slightly more verbose names when possible. For instance, "tag.py" could be more explicit; this is only an example of a broader approach I believe we should consider in general, but especially now that we redesign the code: the word 'tag' will stay from here on. Do we feel it's the best name for it?
Structure: I'm unsure about whether some files should belong outside/inside folders, and it's a good chance to think about what would be the correct skeleton. For instance, it's great we are breaking down utils.py, perhaps utils could be a folder with many unrelated files containing more specific names, e.g. hardware and whatever else we'll have down the road?

Anyway - all of this is very subjective, and shouldn't stop us from moving fast. As long as (1) holds, feel free to go ahead.

src/tabpfn/validation.py

Copilot

Pull request overview

This PR refactors preprocessing and validation logic by consolidating shared code between classifier and regressor, separating differentiable and standard input paths, and organizing functionality into dedicated modules. The changes improve code maintainability without altering functionality.

Key changes:

Created validation.py module to centralize input validation logic
Split classifier initialization into separate methods for differentiable vs standard inputs
Introduced tag_features_and_sanitize_data helper to consolidate data preprocessing

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`src/tabpfn/validation.py`	New module centralizing validation logic previously scattered across utils and base
`src/tabpfn/preprocessing/clean.py`	New module for data cleaning operations (dtype fixing, NA processing)
`src/tabpfn/preprocessing/type_detection.py`	New module for categorical feature inference
`src/tabpfn/preprocessing/initialization.py`	New module with `tag_features_and_sanitize_data` helper
`src/tabpfn/classifier.py`	Split `_initialize_dataset_preprocessing` into separate methods for differentiable vs standard inputs
`src/tabpfn/regressor.py`	Refactored to use new validation/preprocessing modules and handle ensemble config reuse
`src/tabpfn/utils.py`	Removed validation functions moved to dedicated modules
`src/tabpfn/base.py`	Removed `check_cpu_warning` function moved to validation module
`tests/test_utils.py`	Updated imports to reflect new module structure
`tests/test_regressor_interface.py`	Fixed mock patch to use `mock.patch.object`
`changelog/720.added.md`	Added changelog entry

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/tabpfn/validation.py

src/tabpfn/preprocessing/clean.py

src/tabpfn/regressor.py

src/tabpfn/classifier.py

bejaeger · 2026-01-09T08:55:00Z

Thanks a lot for the review @alanprior !

I did some renaming and commented on the things I kept as is.

RE your general remarks:

when we change the implementations we should benchmark the model on real data. Before that, I think we can live with the tests and reviews and proceed carefully with the refactor.
Changed some of the names and hope this is fine. However, I wouldn't go overboard with explaining specialized finetuning methods in this base class.
I agree for the most part. We can think about that going forward.

LeoGrin

Love it!!

changelog/720.added.md

alanprior · 2026-01-09T21:01:49Z

Thanks a lot for the review @alanprior !

I did some renaming and commented on the things I kept as is.

RE your general remarks:

when we change the implementations we should benchmark the model on real data. Before that, I think we can live with the tests and reviews and proceed carefully with the refactor.

Changed some of the names and hope this is fine. However, I wouldn't go overboard with explaining specialized finetuning methods in this base class.

I agree for the most part. We can think about that going forward.

Sounds good!

refactor processing of differential logic and group code

380f4a4

gemini-code-assist bot reviewed Jan 8, 2026

View reviewed changes

src/tabpfn/regressor.py Show resolved Hide resolved

src/tabpfn/validation.py Show resolved Hide resolved

bejaeger added 2 commits January 8, 2026 18:32

fix error msg

bc175a0

add to changelog and ensure sklearn compatibility

8264b0a

bejaeger changed the title ~~[Pre-refactor-2] Refactor processing of differential logic and group code~~ [Prep-refactor-2] Refactor processing of differential logic and group code Jan 8, 2026

bejaeger added 3 commits January 8, 2026 18:42

add correct changelog

b860197

wip

7871e4e

reorder functions and make some private

743843e

bejaeger marked this pull request as ready for review January 8, 2026 17:52

bejaeger requested a review from a team as a code owner January 8, 2026 17:52

bejaeger requested review from Copilot and oscarkey and removed request for a team and Copilot January 8, 2026 17:52

bejaeger requested review from LeoGrin and alanprior and removed request for oscarkey January 8, 2026 17:53