by Golib Sanaev
A reproducible data science pipeline to forecast international migration flows (1990β2023, extended to 2030) using demographic, economic, and human development indicators.
Built with Linear Regression and Random Forest models, and interpreted using SHAP explainability methods.
You can explore the full analysis directly here:
- π§Ή Data Preparation & Cleaning
- π Exploratory Data Analysis (EDA)
- βοΈ Feature Engineering & Modeling
- π³ Model Interpretation & Scenario Analysis
- π Forecasting & Validation
Author: Golib Sanaev Β· LinkedIn Β· GitHub
Problem Statement:
International migration is driven by intertwined economic, demographic, and social dynamics. Understanding these drivers and projecting future migration trends are essential for policy and planning.
Goal:
To build a reproducible forecasting pipeline that models net migration (per 1,000 people) for 168 countries (1990β2023) using open data from the World Bank and UNDP, and extends forecasts through 2030 under multiple socioeconomic scenarios.
Methods:
- Automated data extraction from World Bank WDI API
- Manual integration of UNDP Human Development Index (HDI)
- Feature engineering (lags, caps, interactions, scaling)
- Machine learning (Linear Regression, Random Forest)
- Explainability with SHAP values
- Forecasting with scenario-based inference and uncertainty intervals
- π Economic and demographic factors (GDP growth, unemployment, fertility) dominate migration variability.
- π HDI and population growth contribute significantly to explaining migration intensity.
- π Regional heterogeneity: Income groups and regional aggregates show distinct migration patterns.
- π‘ Forecast performance is stable with strong temporal generalization (1990β2023).
- π Forecasting update: The pipeline now extends migration projections through 2030, generating baseline, high-growth, crisis, and demographic-pressure scenarios with 90 % prediction intervals and global/regional aggregation outputs.
βββ data/
β βββ raw/ # Original World Bank & UNDP data
β β βββ hdr-data.xlsx # Manual UNDP HDI data
β β βββ wdi_data.csv # Extracted via src/migration/wdi_data.py
β β βββ wdi_metadata.csv # WDI metadata
β βββ processed/ # Cleaned and transformed datasets
β βββ countries_clean.csv # Country-level cleaned dataset
β βββ aggregates_clean.csv # Regional/income group aggregates
β βββ countries_only.csv # Filtered country subset (no aggregates)
β βββ aggregates_only.csv # Filtered aggregates subset
β βββ dropped_countries.csv # Countries removed due to missingness
β βββ model_ready.csv # Final dataset for ML training
β βββ model_ready.parquet # Optimized parquet version
β βββ wdi_hdr.csv # Combined WDI + HDI merged dataset
β βββ .gitkeep
β
βββ data_reserve/ # Backup merged dataset
β βββ wdi_hdr_2025-10-13.csv
β
βββ src/migration/
β βββ wdi_data.py # Downloads WDI indicators
β βββ merge_data.py # Merges WDI & HDI data
β βββ __init__.py
β
βββ notebooks/
β βββ 01-data-preparation-cleaning.ipynb
β βββ 02-exploratory-data-analysis.ipynb
β βββ 03-feature-engineering-modeling.ipynb
β βββ 04-model-interpretation-scenario-analysis.ipynb
β βββ 05-forecasting-validation.ipynb
β
βββ models/ # Trained models & artifacts
β βββ random_forest_model.pkl # Final trained Random Forest model
β βββ X_columns.pkl # Feature column order used in training
β βββ 03_rf_feature_importance.csv # SHAP / permutation feature importance
β βββ 03_results_summary.csv # Cross-validation and training metrics
β βββ .gitkeep
β
βββ outputs/ # Evaluation and forecasting results
β βββ backtest_metrics_by_fold.csv # Fold-level performance metrics
β βββ backtest_diagnostics_by_income.csv # Metrics aggregated by income group
β βββ backtest_oof_predictions.csv # Out-of-fold predictions
β βββ forecast_results_2024_2030.csv # Clean future forecasts (2024β2030)
β βββ forecast_global_trends.csv # Global scenario mean trends
β βββ residuals_vs_pred.png # Residual plot visualization
β βββ .gitkeep
β
βββ docs/ # Executed HTML notebooks and figures
β βββ 01-data-preparation-cleaning.html
β βββ 02-exploratory-data-analysis.html
β βββ 03-feature-engineering-modeling.html
β βββ 04-model-interpretation-scenario-analysis.html
β βββ 05-forecasting-validation.html
β βββ correlation_heatmap_country_level.png
β
βββ DATA_INSTRUCTIONS.md # How to download UNDP HDI data
βββ README.md # Project documentation
βββ pyproject.toml # uv project configuration
Programming language:
- Python 3.12
Core libraries:
pandas,numpy,matplotlib,seabornscikit-learn,shap,joblibpathlib,tqdm,warnings
Tools:
- JupyterLab
- uv (for dependency and environment management)
- Git & GitHub
Sources:
-
π World Bank β World Development Indicators (WDI)
- Downloaded automatically via:
uv run python -m src.migration.wdi_data
- Produces
wdi_data.csvandwdi_metadata.csvindata/raw/
- Downloaded automatically via:
-
π§ UNDP β Human Development Index (HDI)
- Download manually from UNDP Data Center
- Save as
hdr-data.xlsxindata/raw/ - See detailed guide:
DATA_INSTRUCTIONS.md
-
π Merged Dataset (HDI + WDI)
- Created using:
uv run python -m src.migration.merge_data
- Output:
data/processed/wdi_hdr.csv
- Created using:
-
ποΈ Backup (for reproducibility)
- A reference copy is stored as
data_reserve/wdi_hdr_2025-10-13.csv
- A reference copy is stored as
- Cleans and merges WDI and HDI datasets
- Removes aggregates and incomplete countries
- Converts net migration to per 1,000 people
- Examines indicator distributions and missingness
- Visualizes migration trends (global, regional, income-group)
- Produces correlation heatmaps and outlier diagnostics
- Constructs target and feature matrices
- Applies lags, interactions, and scaling
- Implements time-aware validation splits
- Trains and interprets Random Forest using SHAP values
- Runs economic and demographic what-if scenarios
- Extends migration forecasts through 2030
- Uses expanding-window and rolling-origin temporal validation
- Estimates 90 % empirical prediction intervals from residuals
- Generates baseline, growth, crisis, and demographic pressure scenarios
- Exports clean global and regional forecast artifacts
| Metric | Cross-Validation |
|---|---|
| MAE | ~3.0β3.5 |
| RMSE | ~5.5β6.5 |
| RΒ² | 0.60β0.65 |
| 90 % PI coverage | β 0.75 |
Interpretation:
The Random Forest model demonstrates stable temporal performance and solid explanatory power.
Results indicate consistent accuracy across income groups, with best performance among high-income economies.
Top predictive drivers:
GDP growth, unemployment, population growth, HDI, and fertility rate.
Most variable regions:
Sub-Saharan Africa, Middle East & North Africa, and Europe & Central Asia.
Forecasting results (2024β2030):
Baseline forecasts show stable global migration inflows, while high-growth and crisis scenarios diverge moderately, reflecting macroeconomic sensitivity and demographic pressures.
# Clone repository
git clone https://github.com/gsanaev/forecasting-migration-flows-ml.git
cd forecasting-migration-flows-ml
# Sync environment (installs Python and dependencies)
uv syncEnsure raw data is available in data/raw/, then run:
# 1. Retrieve World Bank data
uv run python -m src.migration.wdi_data
# 2. Merge with HDI dataset
uv run python -m src.migration.merge_dataFinally, execute notebooks in sequence:
# 1. notebooks/01-data-preparation-cleaning.ipynb
# 2. notebooks/02-exploratory-data-analysis.ipynb
# 3. notebooks/03-feature-engineering-modeling.ipynb
# 4. notebooks/04-model-interpretation-scenario-analysis.ipynb
# 5. notebooks/05-forecasting-validation.ipynbContext:
Independent research on global migration forecasting using open data and interpretable machine learning.
Period:
2025
Author:
Golib Sanaev
GitHub: @gsanaev
Email: gsanaev@gmail.com
LinkedIn: golib-sanaev
- StackFuel β for supporting applied ML learning
- World Bank and UNDP β for open datasets
scikit-learn,SHAP, andpandascommunities β for transparent ML tools
This repository was created and maintained by Golib Sanaev, Data Scientist specializing in forecasting, econometrics, and applied machine learning.
β If you find this project insightful, please give it a star!