🚦 Graph-Aware Traffic Risk Persona Clustering

A machine learning project that identifies traffic risk patterns in Madrid's road network using graph-aware clustering algorithms. The system analyzes traffic intensity, weather conditions, and road connectivity to classify segments into risk personas across different scenarios.

🎯 Overview

This project implements a novel approach to traffic risk assessment by combining:

Graph-based features: Leveraging road network topology using NetworkX
Unsupervised clustering: K-Means algorithm with hyperparameter optimization
Scenario analysis: Multiple traffic conditions (weekday/weekend, time-of-day, weather)
Stability tracking: Cross-scenario segment behavior analysis

✨ Features

Graph-Aware Feature Engineering: Computes neighbor traffic patterns using adjacency matrices
Automated Hyperparameter Tuning: Grid search across 108+ parameter combinations
Multi-Metric Evaluation: F1-score, silhouette score, Davies-Bouldin index, and more
Interactive Dashboard: Real-time visualization with Streamlit and Plotly
Scenario Comparison: Weekday morning/evening, weekend, and rainy conditions
Stability Classification: Identifies stable high-risk, low-risk, and condition-sensitive segments

🔧 Installation

Prerequisites

Python 3.8+
Jupyter Notebook or VS Code with Jupyter extension

Setup

Clone or download the project

cd ml-project

Install dependencies

pip install pandas numpy networkx scikit-learn plotly streamlit

Download the Madrid Traffic Dataset

a. Visit the Mendeley Data repository: https://data.mendeley.com/datasets/697ht4f65b/2

b. Download the Complete Dataset (contains data from 554 sensors across 30 months)

c. Uncompress the downloaded file (ZIP format)

d. Move the uncompressed folder into your project directory
Verify dataset structure

    traffic-risk-clustering/
    ├── project_main.ipynb
    ├── streamlit_dashboard.py
    └── Enriched Traffic Datasets for Madrid/
        └── Complete dataset/
            └── 0095ccde8f905222e8791d32c6f3b9dc/
                ├── MTD_complete_data.csv
                ├── MTD_id_longitude_latitude.csv
                └── MTD_adj_matrix.npy

🚀 Usage

Running the Main Pipeline

Execute the notebook

jupyter notebook project_main.ipynb

Or open in VS Code and run all cells.

Pipeline stages (automated):

Data loading and preprocessing
Graph construction from adjacency matrix
Scenario filtering (4 scenarios)
Feature engineering with neighbor aggregation
Hyperparameter tuning via grid search
Model training and evaluation
Stability analysis
Export results to CSV files

Launching the Dashboard

After running the main notebook:

streamlit run streamlit_dashboard.py

Access the dashboard at http://localhost:8501

Dashboard features:

Interactive map visualization with risk overlays
Scenario performance metrics
Stability class distributions
Hyperparameter comparison charts

📁 Project Architecture

├── project_main.ipynb # Main ML pipeline
├── streamlit_dashboard.py # Interactive visualization dashboard
├── detailed_metrics.csv # Model performance metrics (generated)
├── segment_stability.csv # Stability classifications (generated)
├── hyperparameter_results.csv # Tuning results (generated)
└── Enriched Traffic Datasets for Madrid/ # Data directory

🔬 Technical Implementation

1. Dataset Preprocessing

Missing Values: Features with >20% missing data are dropped; remaining NaN values imputed with median values.

Normalization: StandardScaler applied to all features before clustering to ensure equal weighting.

Feature Engineering:

Time-based features: hour, day_of_week, month, is_weekend
Graph-aware features: neighbor_traffic_mean computed from adjacency relationships
Aggregated features: mean traffic intensity, temperature, precipitation, wind, lanes per segment

2. Model Design

Algorithm: K-Means clustering with k-means++ initialization.

Design Justification:

K-Means suits the continuous feature space and performs efficiently on 553 road segments
k-means++ initialization reduces sensitivity to random starting points
Graph-aware features capture spatial dependencies critical for traffic patterns

Hyperparameter Grid:

Number of clusters: [2, 3, 4]
Initialization method: ['k-means++', 'random']
Max iterations: [300, 500]
Risk threshold percentile: [60, 70, 80]
Three feature set combinations

3. Training & Evaluation

Data Splitting: Data aggregated per segment with scenario-based validation (no temporal leakage).

Clustering Metrics:

Silhouette Score: Measures cluster separation (higher is better)
Davies-Bouldin Index: Evaluates cluster compactness (lower is better)
Calinski-Harabasz Score: Ratio of between-cluster to within-cluster dispersion

Classification Metrics (after cluster-to-risk mapping):

Accuracy, Precision, Recall, F1-Score
Confusion matrices for each scenario

Model Selection: Best model selected by maximizing F1-score, balancing precision and recall.

Bias-Variance Trade-off:

Low cluster counts (k=2) may underfit, missing nuanced patterns
High cluster counts (k=4) risk overfitting to noise
Grid search identifies optimal balance through cross-scenario validation

Regularization: Feature standardization and median imputation prevent extreme values from dominating.

4. Documentation

All functions include docstrings explaining inputs, outputs, and processing logic. Code comments detail:

Preprocessing steps (datetime conversion, numeric coercion)
Training loops (parameter grid iteration, metric computation)
Evaluation logic (majority voting for cluster-to-risk mapping)

📊 Expected Outputs

Console Outputs (during execution)

Libraries imported successfully!
Configuration loaded!
Loading data...
Original data shape: (45941664, 17)
Preprocessed data shape: (45941664, 21)
Building graph...
Graph created with 553 nodes and 50186 edges
Processing Scenario: weekday_morning_dry
...
Best F1 Score: 0.XXXX

Generated CSV Files

detailed_metrics.csv: Per-scenario model performance
- Columns: scenario, best_f1_score, best_accuracy, silhouette_score, etc.
segment_stability.csv: Segment-level stability classifications
- Columns: segment_id, stability_class, high_risk_percentage, n_scenarios
hyperparameter_results.csv: All tested parameter combinations with scores

Dashboard Visualizations

🗺️ Interactive Map: All scenarios displayed with segments colored by stability class
📊 Performance Metrics: F1 Score and Accuracy comparison across scenarios
📈 Stability Distribution: Heatmap showing risk percentage distribution
🔍 Feature Importance: Bar chart showing which features are most important

📦 Dependencies

pandas>=1.3.0 # Data manipulation
numpy>=1.21.0 # Numerical operations
networkx>=2.6.0 # Graph construction and analysis
scikit-learn>=1.0.0 # Clustering algorithms and metrics
plotly>=5.3.0 # Interactive visualizations
streamlit>=1.12.0 # Dashboard framework

Install all at once:

pip install pandas numpy networkx scikit-learn plotly streamlit

🎓 Academic Context

This project demonstrates:

Unsupervised learning for pattern discovery in unlabeled traffic data
Graph-based feature engineering leveraging network topology
Systematic hyperparameter optimization with multi-metric evaluation
Model validation through scenario-based testing and stability analysis

Author: [Shaurya Srivastava]
Course: [BITS F464 - Machine Learning]
Date: 3rd December 2025

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.DS_Store		.DS_Store
README.md		README.md
detailed_metrics.csv		detailed_metrics.csv
executed_until_metrics.ipynb		executed_until_metrics.ipynb
hyperparameter_results.csv		hyperparameter_results.csv
project_main.ipynb		project_main.ipynb
requirements.txt		requirements.txt
segment_stability.csv		segment_stability.csv
streamlit_dashboard.py		streamlit_dashboard.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚦 Graph-Aware Traffic Risk Persona Clustering

📋 Table of Contents

🎯 Overview

✨ Features

🔧 Installation

Prerequisites

Setup

🚀 Usage

Running the Main Pipeline

Launching the Dashboard

📁 Project Architecture

🔬 Technical Implementation

1. Dataset Preprocessing

2. Model Design

3. Training & Evaluation

4. Documentation

📊 Expected Outputs

Console Outputs (during execution)

Generated CSV Files

Dashboard Visualizations

📦 Dependencies

🎓 Academic Context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚦 Graph-Aware Traffic Risk Persona Clustering

📋 Table of Contents

🎯 Overview

✨ Features

🔧 Installation

Prerequisites

Setup

🚀 Usage

Running the Main Pipeline

Launching the Dashboard

📁 Project Architecture

🔬 Technical Implementation

1. Dataset Preprocessing

2. Model Design

3. Training & Evaluation

4. Documentation

📊 Expected Outputs

Console Outputs (during execution)

Generated CSV Files

Dashboard Visualizations

📦 Dependencies

🎓 Academic Context

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages