Machine Learning Clustering Analysis using KMeans Algorithm & Alternating Direction Method of Multipliers (ADMM) on World Bank Data
- Overview
-
- Project Structure
-
- Dataset
-
- Methodology
-
- Results
-
- Technical Stack
-
- Installation & Setup
-
- Usage
-
-
This project implements unsupervised machine learning clustering techniques to analyze World Bank socioeconomic data. The analysis employs:
- KMeans Clustering - Partitioning-based approach for discovering natural groupings
- Elbow Method - Optimal cluster determination using inertia analysis
-
Linear Regression per Cluster - Trend analysis within each identified cluster
-
Model Evaluation - Cross-validation and performance metrics
-
The project demonstrates end-to-end ML pipeline from data preprocessing to visualization and interpretation.
- Multiple socioeconomic variables (GDP, Education, Healthcare, etc.)
- Time-series data from multiple countries
- Missing value handling via imputation
-
Normalized features for clustering
-
Preprocessing Steps:
-
- Data loading from World Bank API
-
- Feature selection and scaling
-
- Handling missing values (mean/median imputation)
-
-
Normalization using StandardScaler for KMeans
-
- Project structure creation with organized directories
- Data loading pipeline from World Bank datasets
-
Data validation and exploration
- Normalization and standardization
- Handling missing values
- Outlier detection and treatment
-
Feature engineering where applicable
-
KMeans Algorithm Steps: ├── Initialize K cluster centroids ├── Assign data points to nearest centroid ├── Recalculate centroid positions ├── Repeat until convergence └── Evaluate cluster quality (Inertia, Silhouette Score)Optimal Cluster Determination (Elbow Method):
- Tested k=2 to k=10 clusters
- Plotted inertia vs number of clusters
-
Selected k where inertia improvement diminishes
- Fitted linear regression models within each cluster
- Captured cluster-specific trends
-
Analyzed coefficient significance
- Cross-validation (K-Fold)
- Inertia and silhouette score computation
-
Cluster homogeneity assessment
- Cluster scatter plots (PCA for dimensionality reduction)
- Elbow curve visualization
-
Regression line plots per cluster
- Heatmaps of cluster characteristics
-
Natural Groupings: World Bank indicators reveal 3-4 distinct country clusters based on development metrics
-
-
Cluster-Specific Trends: Linear regression within clusters shows different economic trajectories
-
-
Model Robustness: Cross-validation indicates stable cluster assignments and generalizable patterns
-
-
Practical Application: Clustering enables targeted policy recommendations for different country groups
-
- scikit-learn Clustering
- KMeans Algorithm
- World Bank Open Data
Zahoor Khan CEO @ PyCode Ltd | Data Scientist | ML Engineer 📍 London, UK 🔗 GitHub | Website
This project is licensed under the MIT License - see LICENSE file for details.
⭐ If you found this helpful, please consider starring the repository!
-
-
-
Metric Value Interpretation Optimal Clusters (k) 3-4 Determined via Elbow method Silhouette Score 0.65-0.75 Good cluster separation Inertia Reduction 70%+ Significant improvement from k=1 to optimal k Cross-Val Score 0.72 avg Reliable model generalization ✅ Cluster 1: High-income developed nations with stable trends ✅ Cluster 2: Emerging economies with growth potential ✅ Cluster 3: Developing nations with resource constraints
Component Technology Language Python 3.8+ ML Framework scikit-learn 1.0+ Data Processing Pandas, NumPy Visualization Matplotlib, Seaborn Notebook Jupyter Lab/Notebook Data Source World Bank API
Python 3.8 or higher pip or conda package manager
git clone https://github.com/hacker007S/Clustering_report_ADS.git cd Clustering_report_ADSpython -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
jupyter notebook Visualization_code.py # Or run as script: python Visualization_code.py
from sklearn.cluster import KMeans from sklearn.preprocessing import StandardScaler import pandas as pd # Load and prepare data data = pd.read_csv('data/processed/cleaned_data.csv') X = StandardScaler().fit_transform(data) # Find optimal clusters using Elbow method inertias = [] for k in range(1, 11): kmeans = KMeans(n_clusters=k, random_state=42, n_init=10) kmeans.fit(X) inertias.append(kmeans.inertia_) # Fit final model optimal_k = 4 # Determined from Elbow plot kmeans = KMeans(n_clusters=optimal_k, random_state=42, n_init=10) clusters = kmeans.fit_predict(X) # Add cluster assignments to dataframe data['Cluster'] = clusters
import matplotlib.pyplot as plt # Plot elbow curve plt.plot(range(1, 11), inertias, 'bo-') plt.xlabel('Number of Clusters (k)') plt.ylabel('Inertia') plt.title('Elbow Method For Optimal k') plt.show() # Visualize clusters (using PCA for 2D projection) from sklearn.decomposition import PCA pca = PCA(n_components=2) X_pca = pca.fit_transform(X) plt.scatter(X_pca[:, 0], X_pca[:, 1], c=clusters, cmap='viridis') plt.xlabel('PC1') plt.ylabel('PC2') plt.title('KMeans Clustering Results') plt.show()
-
Clustering_report_ADS/ ├── README.md # Project documentation (this file) ├── Visualization_code.py # Clustering visualization module ├── requirements.txt # Python dependencies └── data/ ├── raw/ # Original World Bank datasets ├── processed/ # Cleaned & preprocessed data └── results/ # Clustering outputs & predictions
Source: World Bank Development Indicators
Characteristics: