The primary goal of this project is to learn and apply the essential steps in data analysis, with a strong focus on data cleaning and exploratory data analysis (EDA). The dataset used is the classic Titanic dataset, which contains inconsistencies in its values, such as missing and null entries. These issues highlight the crucial role of data cleaning as a foundational step.
By properly cleaning the data, it becomes more reliable for the next stages of data analysis and can even be used for building machine learning models. This project aims to develop a solid understanding of handling real-world datasets and preparing them for further analysis or predictive tasks.
This project was implemented using Python, along with the following libraries:
- Pandas – for data manipulation and cleaning
- Matplotlib – for data visualization
- Seaborn – for enhanced statistical plotting
Several insights were uncovered during the exploratory data analysis process:
- There were more male passengers than female passengers on board the Titanic.
- A majority of male passengers did not survive (only 73 out of 288 survived), while most female passengers survived (196 out of 211). This supports the idea of the "women and children first" policy during the evacuation.
- The survival rate for male passengers was only 25.35%, while female passengers had a survival rate of 92.89%. This significant difference shows a strong correlation between gender and survival chances.
This project can serve as a strong starting point for further exploration. Suggested next steps include:
- Expanding into advanced data analysis, such as correlation heatmaps, hypothesis testing, or multivariate analysis.
- Applying machine learning models (e.g., logistic regression, decision trees, or random forests) to predict survival based on passenger features.
- Building a dashboard to make insights more interactive and accessible.