Skip to content

Refactor of data operations, trainers and models.#73

Merged
rafaelpadilla merged 177 commits intodevfrom
eduardo/refactor_data_operations
Apr 24, 2026
Merged

Refactor of data operations, trainers and models.#73
rafaelpadilla merged 177 commits intodevfrom
eduardo/refactor_data_operations

Conversation

@czewski
Copy link
Copy Markdown
Collaborator

@czewski czewski commented Apr 2, 2026

This PR is a core architectural refactor of the 3W Toolkit focused on consistency, extensibility, and reproducibility.

  • Core architecture refactor: new/expanded base abstractions (BaseTrainer, BasePipeline, BaseTransform, DatasetOutputs, Instantiable) and cleanup/removal of older base layers.

  • Data operations redesign: preprocessing/feature extraction split into modular classes + adapters, with lazy dataset wrappers (TransformedDataset, SubsetDataset, TransformDataset).

  • Trainer split/generalization: monolithic trainer.py removed; dedicated torch_trainer.py and sklearn_trainer.py added, with class-weight support and seed/reproducibility logic reworked.

  • Pipeline + CV support: pipeline reworked for training vs cross-validation flows, fold splitting utility added (utils/data_splitter.py), and CV-related metadata/assessment wiring updated.

  • Assessment/reporting rework: major assessment/model_assess.py rewrite and report-generation/template-manager adjustments.

  • Dataset/utilities updates: CSV loader moved under dataset/, parquet dataset typing/behavior updates, model recorder/data utils refactors.

  • Test suite overhaul: tests reorganized by module (tests/core, tests/trainer, tests/preprocessing, etc.) with broad new coverage replacing older flat tests.

  • Docs/CI/demos refresh: new toolkit/versioning/contributing docs, workflow/lint script updates, and notebooks moved from old docs/overviews locations into toolkit/demos and dataset/demos.

Issues affected:

Closes #22: We added support to automatic class_weights in the TrainerConfig.

Closes #23: We now support TorchModel or SkLearnModels, so we don't need to have a list of supported models, as every "TorchModel" instance will be supported in the "TorchTrainer".

Closes #64: For this issue, we have a wrapper for each model to save/load, so we still will use the ModelRecorder class to point to the correct directory.

Closes #41: We will keep supporting this function because eventually we will need to add support for older versions of the 3W dataset, which are .csv files. Although we moved this method to the dataset folder.

Closes #32: We completely refactored the tests suites, all unit tests are passing accordingly to this refactor.

Closes #14: With the addition of the Instantiable class to all configs, we won't be removing the Config files anymore.

Closes #11: As we reworked the seed logic within the Trainer/Model classes.

Closes #10: This was kinda solved with #64 but we replaced the use of prints in the majority of our classes.

Closes #8: As we have the MetricRegistry class that allow only mapped metrics.

Issues that can be solved with this refactor:

By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):

Our CLAs are based on the Apache Software Foundation's CLAs:

czewski and others added 30 commits March 7, 2026 15:00
… instead use unstandardized moment, as it is stable.
…adilla/3W into eduardo/refactor_data_operations
Copy link
Copy Markdown
Collaborator

@Mathtzt Mathtzt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@czewski, @thadeuluiz and @rafaelpadilla, as agreed in the internal meeting, I have marked as resolved everything that has already been adjusted, keeping the ones that are still pending. Again, I thank you for all your attention to the observations.

@czewski
Copy link
Copy Markdown
Collaborator Author

czewski commented Apr 16, 2026

@czewski, @thadeuluiz and @rafaelpadilla, as agreed in the internal meeting, I have marked as resolved everything that has already been adjusted, keeping the ones that are still pending. Again, I thank you for all your attention to the observations.

Thanks! I've added the missing docstrings and marked all your comments as resolved.

At this point we have only one pending comment to resolve, related to the torch device type hint.

This was referenced Apr 21, 2026
@rafaelpadilla rafaelpadilla merged commit f5cb1fb into dev Apr 24, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants