This project utilizes unsupervised machine learning (K-Means Clustering) on a Road Traffic Accident dataset to identify distinct patterns within the data. The primary goal is to segment accidents into meaningful clusters and statistically validate if these clusters correlate with the Accident Severity (Slight, Serious, or Fatal Injury). This provides critical insights for understanding and mitigating high-risk accident scenarios.
Cluster drivers and vehicles to discover patterns of high accident risk, helping in road safety analysis and preventive policy design.
-
Data Preprocessing
- Target Mapping: The nominal Accident_severity feature was mapped to an ordinal scale (1 to 3, where 3 is Fatal Injury) for validation purposes.
- Cyclical Encoding: Features like Time and Day_of_week were transformed using Sine/Cosine encoding to ensure the clustering algorithm correctly interprets their continuous cyclical nature.
-
Preprocessing Pipeline
- Standard Scaling was applied to numerical features.
- One-Hot Encoding was applied to categorical features.
-
Clustering Model
- Model: K-Means Clustering
- Parameter: The number of clusters (k) was set to 2.
To set up and run this project, you will need a Python environment with the necessary libraries.
- Ensure you have Python 3.8+ installed.
- Install the required libraries using:
pip install pandas numpy scikit-learn matplotlib
- Place your dataset as RTA Dataset.csv in the same directory as the notebook.
- The entire project workflow is contained within a single Notebook:
- Open the notebook:
ML_Mini_Proj_CS413_CS399.ipynb. - Ensure the data file path (as mentioned above) is correct.
- Run all cells sequentially. The notebook will perform data cleaning, preprocessing, K-Means clustering, statistical validation (Chi-Squared Test), and generate visualizations (PCA plot and Average Severity Bar Chart).
- Open the notebook:
CS399_CS413_average_severity_per_cluster.png→ Average Accident Severity per Cluster- PCA scatter plot & bar chart of accident rate per cluster