Refactor of data operations, trainers and models.#73
Merged
rafaelpadilla merged 177 commits intodevfrom Apr 24, 2026
Merged
Conversation
…do/refactor_data_operations
… instead use unstandardized moment, as it is stable.
…proved documentation
…adilla/3W into eduardo/refactor_data_operations
Mathtzt
requested changes
Apr 16, 2026
Collaborator
Mathtzt
left a comment
There was a problem hiding this comment.
@czewski, @thadeuluiz and @rafaelpadilla, as agreed in the internal meeting, I have marked as resolved everything that has already been adjusted, keeping the ones that are still pending. Again, I thank you for all your attention to the observations.
Collaborator
Author
Thanks! I've added the missing docstrings and marked all your comments as resolved. At this point we have only one pending comment to resolve, related to the torch device type hint. |
…adilla/3W into eduardo/refactor_data_operations
…lti-class inputs with label binarization (one-hot)
This was referenced Apr 21, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR is a core architectural refactor of the 3W Toolkit focused on consistency, extensibility, and reproducibility.
Core architecture refactor: new/expanded base abstractions (BaseTrainer, BasePipeline, BaseTransform, DatasetOutputs, Instantiable) and cleanup/removal of older base layers.
Data operations redesign: preprocessing/feature extraction split into modular classes + adapters, with lazy dataset wrappers (TransformedDataset, SubsetDataset, TransformDataset).
Trainer split/generalization: monolithic trainer.py removed; dedicated torch_trainer.py and sklearn_trainer.py added, with class-weight support and seed/reproducibility logic reworked.
Pipeline + CV support: pipeline reworked for training vs cross-validation flows, fold splitting utility added (utils/data_splitter.py), and CV-related metadata/assessment wiring updated.
Assessment/reporting rework: major assessment/model_assess.py rewrite and report-generation/template-manager adjustments.
Dataset/utilities updates: CSV loader moved under dataset/, parquet dataset typing/behavior updates, model recorder/data utils refactors.
Test suite overhaul: tests reorganized by module (tests/core, tests/trainer, tests/preprocessing, etc.) with broad new coverage replacing older flat tests.
Docs/CI/demos refresh: new toolkit/versioning/contributing docs, workflow/lint script updates, and notebooks moved from old docs/overviews locations into toolkit/demos and dataset/demos.
Issues affected:
Closes #22: We added support to automatic class_weights in the TrainerConfig.
Closes #23: We now support TorchModel or SkLearnModels, so we don't need to have a list of supported models, as every "TorchModel" instance will be supported in the "TorchTrainer".
Closes #64: For this issue, we have a wrapper for each model to save/load, so we still will use the ModelRecorder class to point to the correct directory.
Closes #41: We will keep supporting this function because eventually we will need to add support for older versions of the 3W dataset, which are .csv files. Although we moved this method to the dataset folder.
Closes #32: We completely refactored the tests suites, all unit tests are passing accordingly to this refactor.
Closes #14: With the addition of the Instantiable class to all configs, we won't be removing the Config files anymore.
Closes #11: As we reworked the seed logic within the Trainer/Model classes.
Closes #10: This was kinda solved with #64 but we replaced the use of prints in the majority of our classes.
Closes #8: As we have the MetricRegistry class that allow only mapped metrics.
Issues that can be solved with this refactor:
By creating this pull request, I confirm that I have read and fully accept and agree with one of the Petrobras' Contributor License Agreements (CLAs):
Our CLAs are based on the Apache Software Foundation's CLAs: