This project was built from scratch with one goal:
To deeply understand how an ML system works end-to-end — from notebooks to a production-ready pipeline.
I rewrote every line on this project from zero, because I want to learn everything. (I didn't use AI, just as a tutor) In the process I found inconsistencies or things that can be done better (e.g. use uv instead of pip), I try to implement best practices and understand why.
- Python
- scikit-learn
- MLflow
- Pandas / NumPy
- uv (dependency management)
- GitHub Actions (CI)
Decided to use uv as the dependency manager instead of pip.
- Introduced dependency groups (training, inference, dev)
- Avoid installing unnecessary dependencies in each environment
Result:
- Better reproducibility
- Cleaner environments
- Lighter deployments
Rebuilt the GitHub Actions pipeline in a more modular way and integrated uv.
- Structured pipeline from training → build → image publication
- Reduced coupling between steps Result:
- More reliable CI
- Easier to maintain and extend
Removed this feature because it depends directly on the target (price). Result:
- Avoided data leakage
- More realistic model behavior
Some features were hardcoded in the pipeline. Removed hardcoding Standardized feature handling Result:
- Avoids column mismatches
- Safer training and
Feature selection (RFE) was done in notebooks but not used in training. Integrated selected features into the training pipeline Result:
- Consistent experimentation and production
Added a features.pkl file to improve image package. Result:
- Full control over which features are used
- Easier reproducibility
- Prevents silent errors
- Data processing
- Feature engineering
- Feature selection
- Feature tracking (features.pkl)
- Model training
- CI pipeline → build → image publication
Currently improving the project by building a frontend using Reflex (instead of Streamlit) to interact with the model.