Skip to content

gsanaev/forecasting-migration-flows-ml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

108 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🌍 Forecasting Migration Flows with Machine Learning πŸš€

by Golib Sanaev

A reproducible data science pipeline to forecast international migration flows (1990–2023, extended to 2030) using demographic, economic, and human development indicators.
Built with Linear Regression and Random Forest models, and interpreted using SHAP explainability methods.

Python Version License: MIT Status


πŸ“˜ View Results Online

You can explore the full analysis directly here:


πŸ“Š Project Overview

Author: Golib Sanaev Β· LinkedIn Β· GitHub

Problem Statement:
International migration is driven by intertwined economic, demographic, and social dynamics. Understanding these drivers and projecting future migration trends are essential for policy and planning.

Goal:
To build a reproducible forecasting pipeline that models net migration (per 1,000 people) for 168 countries (1990–2023) using open data from the World Bank and UNDP, and extends forecasts through 2030 under multiple socioeconomic scenarios.

Methods:

  • Automated data extraction from World Bank WDI API
  • Manual integration of UNDP Human Development Index (HDI)
  • Feature engineering (lags, caps, interactions, scaling)
  • Machine learning (Linear Regression, Random Forest)
  • Explainability with SHAP values
  • Forecasting with scenario-based inference and uncertainty intervals

🎯 Key Insights

  • πŸ“ˆ Economic and demographic factors (GDP growth, unemployment, fertility) dominate migration variability.
  • πŸ” HDI and population growth contribute significantly to explaining migration intensity.
  • 🌍 Regional heterogeneity: Income groups and regional aggregates show distinct migration patterns.
  • πŸ’‘ Forecast performance is stable with strong temporal generalization (1990–2023).
  • πŸ“… Forecasting update: The pipeline now extends migration projections through 2030, generating baseline, high-growth, crisis, and demographic-pressure scenarios with 90 % prediction intervals and global/regional aggregation outputs.

πŸ“ Repository Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                         # Original World Bank & UNDP data
β”‚   β”‚   β”œβ”€β”€ hdr-data.xlsx            # Manual UNDP HDI data
β”‚   β”‚   β”œβ”€β”€ wdi_data.csv             # Extracted via src/migration/wdi_data.py
β”‚   β”‚   β”œβ”€β”€ wdi_metadata.csv         # WDI metadata
β”‚   └── processed/                   # Cleaned and transformed datasets
β”‚       β”œβ”€β”€ countries_clean.csv          # Country-level cleaned dataset
β”‚       β”œβ”€β”€ aggregates_clean.csv         # Regional/income group aggregates
β”‚       β”œβ”€β”€ countries_only.csv           # Filtered country subset (no aggregates)
β”‚       β”œβ”€β”€ aggregates_only.csv          # Filtered aggregates subset
β”‚       β”œβ”€β”€ dropped_countries.csv        # Countries removed due to missingness
β”‚       β”œβ”€β”€ model_ready.csv              # Final dataset for ML training
β”‚       β”œβ”€β”€ model_ready.parquet          # Optimized parquet version
β”‚       β”œβ”€β”€ wdi_hdr.csv                  # Combined WDI + HDI merged dataset
β”‚       └── .gitkeep
β”‚
β”œβ”€β”€ data_reserve/                    # Backup merged dataset
β”‚   └── wdi_hdr_2025-10-13.csv
β”‚
β”œβ”€β”€ src/migration/
β”‚   β”œβ”€β”€ wdi_data.py                  # Downloads WDI indicators
β”‚   β”œβ”€β”€ merge_data.py                # Merges WDI & HDI data
β”‚   └── __init__.py
β”‚
β”œβ”€β”€ notebooks/
β”‚   β”œβ”€β”€ 01-data-preparation-cleaning.ipynb
β”‚   β”œβ”€β”€ 02-exploratory-data-analysis.ipynb
β”‚   β”œβ”€β”€ 03-feature-engineering-modeling.ipynb
β”‚   β”œβ”€β”€ 04-model-interpretation-scenario-analysis.ipynb
β”‚   └── 05-forecasting-validation.ipynb
β”‚
β”œβ”€β”€ models/                          # Trained models & artifacts
β”‚   β”œβ”€β”€ random_forest_model.pkl          # Final trained Random Forest model
β”‚   β”œβ”€β”€ X_columns.pkl                    # Feature column order used in training
β”‚   β”œβ”€β”€ 03_rf_feature_importance.csv     # SHAP / permutation feature importance
β”‚   β”œβ”€β”€ 03_results_summary.csv           # Cross-validation and training metrics
β”‚   └── .gitkeep
β”‚
β”œβ”€β”€ outputs/                         # Evaluation and forecasting results
β”‚   β”œβ”€β”€ backtest_metrics_by_fold.csv         # Fold-level performance metrics
β”‚   β”œβ”€β”€ backtest_diagnostics_by_income.csv   # Metrics aggregated by income group
β”‚   β”œβ”€β”€ backtest_oof_predictions.csv         # Out-of-fold predictions
β”‚   β”œβ”€β”€ forecast_results_2024_2030.csv       # Clean future forecasts (2024–2030)
β”‚   β”œβ”€β”€ forecast_global_trends.csv           # Global scenario mean trends
β”‚   β”œβ”€β”€ residuals_vs_pred.png                # Residual plot visualization
β”‚   └── .gitkeep
β”‚
β”œβ”€β”€ docs/                            # Executed HTML notebooks and figures
β”‚   β”œβ”€β”€ 01-data-preparation-cleaning.html
β”‚   β”œβ”€β”€ 02-exploratory-data-analysis.html
β”‚   β”œβ”€β”€ 03-feature-engineering-modeling.html
β”‚   β”œβ”€β”€ 04-model-interpretation-scenario-analysis.html
β”‚   β”œβ”€β”€ 05-forecasting-validation.html
β”‚   └── correlation_heatmap_country_level.png
β”‚
β”œβ”€β”€ DATA_INSTRUCTIONS.md             # How to download UNDP HDI data
β”œβ”€β”€ README.md                        # Project documentation
└── pyproject.toml                   # uv project configuration

πŸ”§ Technologies Used

Programming language:

  • Python 3.12

Core libraries:

  • pandas, numpy, matplotlib, seaborn
  • scikit-learn, shap, joblib
  • pathlib, tqdm, warnings

Tools:

  • JupyterLab
  • uv (for dependency and environment management)
  • Git & GitHub

πŸ“Š Data

Sources:

  • 🌐 World Bank – World Development Indicators (WDI)

    • Downloaded automatically via:
      uv run python -m src.migration.wdi_data
    • Produces wdi_data.csv and wdi_metadata.csv in data/raw/
  • 🧭 UNDP – Human Development Index (HDI)

  • πŸ”— Merged Dataset (HDI + WDI)

    • Created using:
      uv run python -m src.migration.merge_data
    • Output: data/processed/wdi_hdr.csv
  • πŸ—ƒοΈ Backup (for reproducibility)

    • A reference copy is stored as data_reserve/wdi_hdr_2025-10-13.csv

πŸ€– Methodology

Data Preparation

  • Cleans and merges WDI and HDI datasets
  • Removes aggregates and incomplete countries
  • Converts net migration to per 1,000 people

Exploratory Data Analysis (EDA)

  • Examines indicator distributions and missingness
  • Visualizes migration trends (global, regional, income-group)
  • Produces correlation heatmaps and outlier diagnostics

Feature Engineering & Modeling

  • Constructs target and feature matrices
  • Applies lags, interactions, and scaling
  • Implements time-aware validation splits

Model Interpretation & Scenario Analysis

  • Trains and interprets Random Forest using SHAP values
  • Runs economic and demographic what-if scenarios

Forecasting & Validation

  • Extends migration forecasts through 2030
  • Uses expanding-window and rolling-origin temporal validation
  • Estimates 90 % empirical prediction intervals from residuals
  • Generates baseline, growth, crisis, and demographic pressure scenarios
  • Exports clean global and regional forecast artifacts

πŸ“ˆ Results Summary

Metric Cross-Validation
MAE ~3.0–3.5
RMSE ~5.5–6.5
RΒ² 0.60–0.65
90 % PI coverage β‰ˆ 0.75

Interpretation:
The Random Forest model demonstrates stable temporal performance and solid explanatory power.
Results indicate consistent accuracy across income groups, with best performance among high-income economies.

Top predictive drivers:
GDP growth, unemployment, population growth, HDI, and fertility rate.

Most variable regions:
Sub-Saharan Africa, Middle East & North Africa, and Europe & Central Asia.

Forecasting results (2024–2030):
Baseline forecasts show stable global migration inflows, while high-growth and crisis scenarios diverge moderately, reflecting macroeconomic sensitivity and demographic pressures.


πŸš€ Reproducibility

Setup

# Clone repository
git clone https://github.com/gsanaev/forecasting-migration-flows-ml.git
cd forecasting-migration-flows-ml

# Sync environment (installs Python and dependencies)
uv sync

Execution

Ensure raw data is available in data/raw/, then run:

# 1. Retrieve World Bank data
uv run python -m src.migration.wdi_data

# 2. Merge with HDI dataset
uv run python -m src.migration.merge_data

Finally, execute notebooks in sequence:

# 1. notebooks/01-data-preparation-cleaning.ipynb
# 2. notebooks/02-exploratory-data-analysis.ipynb
# 3. notebooks/03-feature-engineering-modeling.ipynb
# 4. notebooks/04-model-interpretation-scenario-analysis.ipynb
# 5. notebooks/05-forecasting-validation.ipynb

πŸŽ“ About This Project

Context:
Independent research on global migration forecasting using open data and interpretable machine learning.

Period:
2025

Author:
Golib Sanaev


πŸ“ž Contact

GitHub: @gsanaev
Email: gsanaev@gmail.com
LinkedIn: golib-sanaev


πŸ™ Acknowledgements

  • StackFuel β€” for supporting applied ML learning
  • World Bank and UNDP β€” for open datasets
  • scikit-learn, SHAP, and pandas communities β€” for transparent ML tools

This repository was created and maintained by Golib Sanaev, Data Scientist specializing in forecasting, econometrics, and applied machine learning.

⭐ If you find this project insightful, please give it a star!

About

Forecasting international migration flows using Machine Learning and explainable AI methods (SHAP, regression, feature analysis).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors