Group Number: 39
Members: Mohamed-Obay Alshaer (300170489), Samih Karroum (300188957)
Course: CSI4142 - Fundamentals of Data Science
Instructor: Caroline Barrière
Term: Winter 2025
Submission Date: February 25, 2025
This repository contains our submission for Assignment 2 of CSI4142 - Fundamentals of Data Science. The focus of this assignment is Data Cleaning, specifically:
- Duplicate detection
- Data validation processes
- Imputation for missing values
The assignment is implemented entirely in Python and documented using Jupyter Notebooks to ensure transparency, reproducibility, and clarity in our analysis.
This repository consists of three branches:
-
Main: Contains only this README file.
-
dataset-1: Contains the Jupyter Notebook for cleaning data using a "Clean Data Checker." This tool detects various types of errors such as:
- Data type errors
- Range errors
- Format inconsistencies
- Duplicate entries
- Missing values, etc.
-
dataset-2: Contains the Jupyter Notebook focusing on data imputation. This includes testing different imputation techniques such as:
- Mean/Median/Mode imputation
- Regression-based imputation
- Correlation-based imputation
Each dataset and its respective processing are thoroughly explained within their respective Jupyter Notebooks.
-
Clone the repository:
git clone <repo_url> cd <repo_directory>
-
Checkout the branch you want to work with:
git checkout dataset-1 # For Clean Data Checker git checkout dataset-2 # For Data Imputation
-
Open Jupyter Notebook:
jupyter notebook
-
Navigate to the corresponding notebook and run the cells sequentially.
This assignment was completed as part of the CSI4142 - Fundamentals of Data Science course under the guidance of Professor Caroline Barrière at the University of Ottawa.
For any questions or clarifications, please reach out via email.
Mohamed-Obay Alshaer & Samih Karroum
Winter 2025