Author: Hina Bandukwala, Julia Everitt, Sean McKay & Yimeng Xia
Students of UBC MDS cohort 8
In this project, we attempt to build a classification model using Logistic Regression to predict wine origin. Our final classifier performed fairly well on an unseen test data cases, with accuracy rate of 98.15%. Considering the applicability of wine origin prediction, our model can be implemented for business use, providing a faster and more accurate service in classifying wine origin compared to traditional methods that require experts with sufficient knowledge and experience.
This project employs a data set comprising 13 chemical information from 178 Italian wine samples of three distinct cultivars from the same region. Originating in 1991, the data set was collected and contributed by M. Forina and Stefan Aeberhard. This data set is accessible from the UC Irvine Machine Learning Repository and can be found here.
The final report can be found here.
Install and launch Docker on your device, then clone this Github repository. Navigate to the root of this project using the command line.
Run docker compose up jupyter-lab. This will output two urls once complete - select the one beginning with http://127.0.0.1:8888/lab? and paste it into your browser to launch jupyter lab. Open a terminal window and navigate to the work directory, then run the command conda activate wine-origin-prediction to activate the environment.
Navigate to the root of the project directory using the command line and run make clean to remove all files generated by the analysis previously. Then, run make all to re-produce the analysis and re-generate all files from scratch.
Once docker is set up, the following commands can be used to run the analysis. Copy paste these in the terminal at the project root to reproduce our analysis step by step:
# Fetch data from the web, save, and split
python scripts/fetch_split_data.py --output-raw-path='data/raw' --output-processed-path='data/processed/'
# Preprocess data and save preprocessor object
python scripts/preprocessing.py --train-data ./data/processed/train.csv --test-data ./data/processed/test.csv --variable-data ./data/raw/variables.csv --output-file-path ./data/processed/scaled_wine_train.csv ./data/processed/scaled_wine_test.csv --output-preprocessor ./results/models/preprocessor_model --output-metadata-path ./data/processed/preprocessor_model
# Perform EDA and save plots
python scripts/eda.py --input_path='data/processed/scaled_wine_train.csv' --output_figure_path='results/figures/' --plot_width=150 --plot_height=100
# Fit model and optimize hyperparameters
python scripts/fit_wine_classifier.py --training-data='data/processed/train.csv' --preprocessor='results/models/preprocessor_model.pickle' --pipeline-to='results/models/' --plot-to='results/figures/' --seed=123
# Evaluate model on full train/test set
python scripts/evaluation_test.py --input-test-path='data/processed/test.csv' --pipeline-from='results/models/wine_pipeline.pickle' --target-col='class' --results-to='results/tables/'
Open the project in terminal and navigate to the tests folder, then run the following command
pytest
Shut down the container by clicking Ctrl + C on your keyboard in the terminal where you launched Docker. Run docker compose rm to finish cleaning up.
Contribution guidelines for this project
Tests are located in the tests folder. To test a function, run pytest <file-path> in the terminal at the root of the project.
Dependencies are managed using Docker and based on the jupter minimal-notebook image, with exact details available in the Dockerfile. Additionally, the environment details can be found in environment.yaml
- Python and packages listed in
environment.yamlhere
Software licensed under the MIT License, non-software content licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) License. See LICENSE.md for more information.