Skip to content

Manognaaaaaa/Traffic_Pattern_Clustering

Repository files navigation

🚦 Graph-Aware Traffic Risk Persona Clustering

A machine learning project that identifies traffic risk patterns in Madrid's road network using graph-aware clustering algorithms. The system analyzes traffic intensity, weather conditions, and road connectivity to classify segments into risk personas across different scenarios.

📋 Table of Contents

🎯 Overview

This project implements a novel approach to traffic risk assessment by combining:

  • Graph-based features: Leveraging road network topology using NetworkX
  • Unsupervised clustering: K-Means algorithm with hyperparameter optimization
  • Scenario analysis: Multiple traffic conditions (weekday/weekend, time-of-day, weather)
  • Stability tracking: Cross-scenario segment behavior analysis

✨ Features

  • Graph-Aware Feature Engineering: Computes neighbor traffic patterns using adjacency matrices
  • Automated Hyperparameter Tuning: Grid search across 108+ parameter combinations
  • Multi-Metric Evaluation: F1-score, silhouette score, Davies-Bouldin index, and more
  • Interactive Dashboard: Real-time visualization with Streamlit and Plotly
  • Scenario Comparison: Weekday morning/evening, weekend, and rainy conditions
  • Stability Classification: Identifies stable high-risk, low-risk, and condition-sensitive segments

🔧 Installation

Prerequisites

  • Python 3.8+
  • Jupyter Notebook or VS Code with Jupyter extension

Setup

  1. Clone or download the project
cd ml-project
  1. Install dependencies
pip install pandas numpy networkx scikit-learn plotly streamlit
  1. Download the Madrid Traffic Dataset

    a. Visit the Mendeley Data repository: https://data.mendeley.com/datasets/697ht4f65b/2

    b. Download the Complete Dataset (contains data from 554 sensors across 30 months)

    c. Uncompress the downloaded file (ZIP format)

    d. Move the uncompressed folder into your project directory

  2. Verify dataset structure

    traffic-risk-clustering/
    ├── project_main.ipynb
    ├── streamlit_dashboard.py
    └── Enriched Traffic Datasets for Madrid/
        └── Complete dataset/
            └── 0095ccde8f905222e8791d32c6f3b9dc/
                ├── MTD_complete_data.csv
                ├── MTD_id_longitude_latitude.csv
                └── MTD_adj_matrix.npy

🚀 Usage

Running the Main Pipeline

  1. Execute the notebook
jupyter notebook project_main.ipynb

Or open in VS Code and run all cells.

  1. Pipeline stages (automated):
  • Data loading and preprocessing
  • Graph construction from adjacency matrix
  • Scenario filtering (4 scenarios)
  • Feature engineering with neighbor aggregation
  • Hyperparameter tuning via grid search
  • Model training and evaluation
  • Stability analysis
  • Export results to CSV files

Launching the Dashboard

After running the main notebook:

streamlit run streamlit_dashboard.py

Access the dashboard at http://localhost:8501

Dashboard features:

  • Interactive map visualization with risk overlays
  • Scenario performance metrics
  • Stability class distributions
  • Hyperparameter comparison charts

📁 Project Architecture

├── project_main.ipynb # Main ML pipeline
├── streamlit_dashboard.py # Interactive visualization dashboard
├── detailed_metrics.csv # Model performance metrics (generated)
├── segment_stability.csv # Stability classifications (generated)
├── hyperparameter_results.csv # Tuning results (generated)
└── Enriched Traffic Datasets for Madrid/ # Data directory

🔬 Technical Implementation

1. Dataset Preprocessing

Missing Values: Features with >20% missing data are dropped; remaining NaN values imputed with median values.

Normalization: StandardScaler applied to all features before clustering to ensure equal weighting.

Feature Engineering:

  • Time-based features: hour, day_of_week, month, is_weekend
  • Graph-aware features: neighbor_traffic_mean computed from adjacency relationships
  • Aggregated features: mean traffic intensity, temperature, precipitation, wind, lanes per segment

2. Model Design

Algorithm: K-Means clustering with k-means++ initialization.

Design Justification:

  • K-Means suits the continuous feature space and performs efficiently on 553 road segments
  • k-means++ initialization reduces sensitivity to random starting points
  • Graph-aware features capture spatial dependencies critical for traffic patterns

Hyperparameter Grid:

  • Number of clusters: [2, 3, 4]
  • Initialization method: ['k-means++', 'random']
  • Max iterations: [300, 500]
  • Risk threshold percentile: [60, 70, 80]
  • Three feature set combinations

3. Training & Evaluation

Data Splitting: Data aggregated per segment with scenario-based validation (no temporal leakage).

Clustering Metrics:

  • Silhouette Score: Measures cluster separation (higher is better)
  • Davies-Bouldin Index: Evaluates cluster compactness (lower is better)
  • Calinski-Harabasz Score: Ratio of between-cluster to within-cluster dispersion

Classification Metrics (after cluster-to-risk mapping):

  • Accuracy, Precision, Recall, F1-Score
  • Confusion matrices for each scenario

Model Selection: Best model selected by maximizing F1-score, balancing precision and recall.

Bias-Variance Trade-off:

  • Low cluster counts (k=2) may underfit, missing nuanced patterns
  • High cluster counts (k=4) risk overfitting to noise
  • Grid search identifies optimal balance through cross-scenario validation

Regularization: Feature standardization and median imputation prevent extreme values from dominating.

4. Documentation

All functions include docstrings explaining inputs, outputs, and processing logic. Code comments detail:

  • Preprocessing steps (datetime conversion, numeric coercion)
  • Training loops (parameter grid iteration, metric computation)
  • Evaluation logic (majority voting for cluster-to-risk mapping)

📊 Expected Outputs

Console Outputs (during execution)

Libraries imported successfully!
Configuration loaded!
Loading data...
Original data shape: (45941664, 17)
Preprocessed data shape: (45941664, 21)
Building graph...
Graph created with 553 nodes and 50186 edges
Processing Scenario: weekday_morning_dry
...
Best F1 Score: 0.XXXX

Generated CSV Files

  1. detailed_metrics.csv: Per-scenario model performance

    • Columns: scenario, best_f1_score, best_accuracy, silhouette_score, etc.
  2. segment_stability.csv: Segment-level stability classifications

    • Columns: segment_id, stability_class, high_risk_percentage, n_scenarios
  3. hyperparameter_results.csv: All tested parameter combinations with scores

Dashboard Visualizations

  • 🗺️ Interactive Map: All scenarios displayed with segments colored by stability class
  • 📊 Performance Metrics: F1 Score and Accuracy comparison across scenarios
  • 📈 Stability Distribution: Heatmap showing risk percentage distribution
  • 🔍 Feature Importance: Bar chart showing which features are most important

📦 Dependencies

pandas>=1.3.0 # Data manipulation
numpy>=1.21.0 # Numerical operations
networkx>=2.6.0 # Graph construction and analysis
scikit-learn>=1.0.0 # Clustering algorithms and metrics
plotly>=5.3.0 # Interactive visualizations
streamlit>=1.12.0 # Dashboard framework

Install all at once:

pip install pandas numpy networkx scikit-learn plotly streamlit

🎓 Academic Context

This project demonstrates:

  • Unsupervised learning for pattern discovery in unlabeled traffic data
  • Graph-based feature engineering leveraging network topology
  • Systematic hyperparameter optimization with multi-metric evaluation
  • Model validation through scenario-based testing and stability analysis

Author: [Shaurya Srivastava]
Course: [BITS F464 - Machine Learning]
Date: 3rd December 2025

About

Graph-Aware Traffic Risk Persona Clustering: A complete ML pipeline using K-Means and NetworkX to analyze traffic intensity and stability across Madrid urban segments under varying weather and time scenarios.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors