A machine learning project that identifies traffic risk patterns in Madrid's road network using graph-aware clustering algorithms. The system analyzes traffic intensity, weather conditions, and road connectivity to classify segments into risk personas across different scenarios.
- Overview
- Features
- Installation
- Usage
- Project Structure
- Technical Implementation
- Expected Outputs
- Dependencies
This project implements a novel approach to traffic risk assessment by combining:
- Graph-based features: Leveraging road network topology using NetworkX
- Unsupervised clustering: K-Means algorithm with hyperparameter optimization
- Scenario analysis: Multiple traffic conditions (weekday/weekend, time-of-day, weather)
- Stability tracking: Cross-scenario segment behavior analysis
- Graph-Aware Feature Engineering: Computes neighbor traffic patterns using adjacency matrices
- Automated Hyperparameter Tuning: Grid search across 108+ parameter combinations
- Multi-Metric Evaluation: F1-score, silhouette score, Davies-Bouldin index, and more
- Interactive Dashboard: Real-time visualization with Streamlit and Plotly
- Scenario Comparison: Weekday morning/evening, weekend, and rainy conditions
- Stability Classification: Identifies stable high-risk, low-risk, and condition-sensitive segments
- Python 3.8+
- Jupyter Notebook or VS Code with Jupyter extension
- Clone or download the project
cd ml-project- Install dependencies
pip install pandas numpy networkx scikit-learn plotly streamlit-
Download the Madrid Traffic Dataset
a. Visit the Mendeley Data repository: https://data.mendeley.com/datasets/697ht4f65b/2
b. Download the Complete Dataset (contains data from 554 sensors across 30 months)
c. Uncompress the downloaded file (ZIP format)
d. Move the uncompressed folder into your project directory
-
Verify dataset structure
traffic-risk-clustering/
├── project_main.ipynb
├── streamlit_dashboard.py
└── Enriched Traffic Datasets for Madrid/
└── Complete dataset/
└── 0095ccde8f905222e8791d32c6f3b9dc/
├── MTD_complete_data.csv
├── MTD_id_longitude_latitude.csv
└── MTD_adj_matrix.npy- Execute the notebook
jupyter notebook project_main.ipynbOr open in VS Code and run all cells.
- Pipeline stages (automated):
- Data loading and preprocessing
- Graph construction from adjacency matrix
- Scenario filtering (4 scenarios)
- Feature engineering with neighbor aggregation
- Hyperparameter tuning via grid search
- Model training and evaluation
- Stability analysis
- Export results to CSV files
After running the main notebook:
streamlit run streamlit_dashboard.pyAccess the dashboard at http://localhost:8501
Dashboard features:
- Interactive map visualization with risk overlays
- Scenario performance metrics
- Stability class distributions
- Hyperparameter comparison charts
├── project_main.ipynb # Main ML pipeline
├── streamlit_dashboard.py # Interactive visualization dashboard
├── detailed_metrics.csv # Model performance metrics (generated)
├── segment_stability.csv # Stability classifications (generated)
├── hyperparameter_results.csv # Tuning results (generated)
└── Enriched Traffic Datasets for Madrid/ # Data directoryMissing Values: Features with >20% missing data are dropped; remaining NaN values imputed with median values.
Normalization: StandardScaler applied to all features before clustering to ensure equal weighting.
Feature Engineering:
- Time-based features: hour, day_of_week, month, is_weekend
- Graph-aware features: neighbor_traffic_mean computed from adjacency relationships
- Aggregated features: mean traffic intensity, temperature, precipitation, wind, lanes per segment
Algorithm: K-Means clustering with k-means++ initialization.
Design Justification:
- K-Means suits the continuous feature space and performs efficiently on 553 road segments
- k-means++ initialization reduces sensitivity to random starting points
- Graph-aware features capture spatial dependencies critical for traffic patterns
Hyperparameter Grid:
- Number of clusters: [2, 3, 4]
- Initialization method: ['k-means++', 'random']
- Max iterations: [300, 500]
- Risk threshold percentile: [60, 70, 80]
- Three feature set combinations
Data Splitting: Data aggregated per segment with scenario-based validation (no temporal leakage).
Clustering Metrics:
- Silhouette Score: Measures cluster separation (higher is better)
- Davies-Bouldin Index: Evaluates cluster compactness (lower is better)
- Calinski-Harabasz Score: Ratio of between-cluster to within-cluster dispersion
Classification Metrics (after cluster-to-risk mapping):
- Accuracy, Precision, Recall, F1-Score
- Confusion matrices for each scenario
Model Selection: Best model selected by maximizing F1-score, balancing precision and recall.
Bias-Variance Trade-off:
- Low cluster counts (k=2) may underfit, missing nuanced patterns
- High cluster counts (k=4) risk overfitting to noise
- Grid search identifies optimal balance through cross-scenario validation
Regularization: Feature standardization and median imputation prevent extreme values from dominating.
All functions include docstrings explaining inputs, outputs, and processing logic. Code comments detail:
- Preprocessing steps (datetime conversion, numeric coercion)
- Training loops (parameter grid iteration, metric computation)
- Evaluation logic (majority voting for cluster-to-risk mapping)
Libraries imported successfully!
Configuration loaded!
Loading data...
Original data shape: (45941664, 17)
Preprocessed data shape: (45941664, 21)
Building graph...
Graph created with 553 nodes and 50186 edges
Processing Scenario: weekday_morning_dry
...
Best F1 Score: 0.XXXX-
detailed_metrics.csv: Per-scenario model performance
- Columns: scenario, best_f1_score, best_accuracy, silhouette_score, etc.
-
segment_stability.csv: Segment-level stability classifications
- Columns: segment_id, stability_class, high_risk_percentage, n_scenarios
-
hyperparameter_results.csv: All tested parameter combinations with scores
- 🗺️ Interactive Map: All scenarios displayed with segments colored by stability class
- 📊 Performance Metrics: F1 Score and Accuracy comparison across scenarios
- 📈 Stability Distribution: Heatmap showing risk percentage distribution
- 🔍 Feature Importance: Bar chart showing which features are most important
pandas>=1.3.0 # Data manipulation
numpy>=1.21.0 # Numerical operations
networkx>=2.6.0 # Graph construction and analysis
scikit-learn>=1.0.0 # Clustering algorithms and metrics
plotly>=5.3.0 # Interactive visualizations
streamlit>=1.12.0 # Dashboard frameworkInstall all at once:
pip install pandas numpy networkx scikit-learn plotly streamlitThis project demonstrates:
- Unsupervised learning for pattern discovery in unlabeled traffic data
- Graph-based feature engineering leveraging network topology
- Systematic hyperparameter optimization with multi-metric evaluation
- Model validation through scenario-based testing and stability analysis
Author: [Shaurya Srivastava]
Course: [BITS F464 - Machine Learning]
Date: 3rd December 2025