Skip to content

michizler/data-science-projects

Repository files navigation

Data Science & Analytics Portfolio

Turning complex data into decisions that matter — across finance, construction, telecommunications, and macroeconomics.

R Python Power BI React Scikit-learn FastAPI


About This Portfolio

This repository brings together four end-to-end data science projects spanning different industries, techniques, and business problems. Each project follows a consistent approach: start with a real business question, explore and prepare the data rigorously, build and validate models against statistical standards, and translate the results into actionable recommendations with quantified impact.

The work covers the full analytics spectrum — from interactive dashboards and hypothesis testing through regression modelling, time series forecasting, and machine learning classification deployed via REST APIs.

What You'll Find

Domain Techniques Deliverables
Macroeconomics Dashboard design, DAX, data modelling Interactive Power BI report
Construction MLR, logistic regression, hypothesis testing Predictive formula, interactive calculator
Telecommunications Decision trees, RFM analysis, FastAPI deployment Deployed churn prediction API
Finance Holt-Winters, ARIMA, ARIMA+GARCH 365-day stock price forecast with volatility

Projects

1. 🌍 Global Economic Outlook — Interactive Dashboard (2001–2020)

Exploring 20 years of economic performance across 180+ countries using the IMF World Economic Outlook dataset.

Power BI

📖 Expand Full Project Details

The Problem

A thinktank scenario required an analytical tool to help policymakers and the general public understand patterns of economic performance at the country and country-group level. The raw IMF dataset — 44 indicators across 180+ countries over 20 years — was too dense for manual analysis.

What I Built

An interactive Power BI dashboard built on a star schema data model with one fact table and two dimension tables. The dashboard presents six headline indicators through multiple coordinated visualisations, all responsive to country, region, income group, and year-range filters.

Data Model

┌──────────────────┐       ┌──────────────────┐
│   Countries       │       │   Indicators      │
│ (Dimension)       │       │ (Dimension)       │
├──────────────────┤       ├──────────────────┤
│ CountryKey (PK)   │       │ Indicator Code    │
│ ISO               │◄──┐   │ Description       │
│ Country           │   │   └────────┬─────────┘
│ Region            │   │            │
│ Income Group      │   │            │ 1:*
└──────────────────┘   │   ┌────────▼─────────┐
              1:* (ISO) └───┤  Economic Data     │
                            │  (Fact)            │
                            └──────────────────┘

Key Visualisations

  • KPI Cards with Sparklines — six headline indicators with embedded trendlines and colour-coded cues
  • Top 5 Bar Chart — dynamically ranks best-performing countries by GDP growth
  • Interactive Line Chart — tracks any indicator over time via tile slicer switching
  • Radar Chart — multi-dimensional economic profile powered by a SWITCH DAX measure
  • Flexible Slicers — Country, Region, Income Group, Year Range

DAX Measures

Custom measures for each indicator (card callout + sparkline pairs), a DATATABLE-based radar chart, and a field parameter for dynamic line chart switching.

Tech Stack

Power BI · Power Query · DAX · Star Schema · Excel

Data

Source Records Variables Period
IMF World Economic Outlook 180+ countries 44 indicators 2001–2020

📂 View Project →


2. 🏗️ Predicting Concrete Compressive Strength

Reducing material costs by £21,600 per project through regression analysis on 1,030 historical mix trials.

📖 Expand Full Project Details

The Problem

StrataForge Construction Materials Ltd., a UK-based firm specialising in commercial foundations and high-load structural slabs, had accumulated 1,030 concrete mix trial records but still relied on experience-based decision making. Conservative cement-heavy mixes were inflating costs and carbon emissions, while occasional 28-day strength failures triggered expensive rework cycles.

The business needed answers to three questions: What drives compressive strength? Does fly ash reduce performance? Can we predict strength before pouring?

What I Built

A complete statistical analysis pipeline in R, progressing through 12 candidate regression models via forward stepwise selection, a logistic regression classifier for fly ash detection, and three hypothesis tests that proved fly ash can safely replace cement.

The Final Model

concrete_strength = 23.914
                  + 0.0974 × cement
                  − 2.545 × ln(superplasticizer)
                  − 0.2374 × water
                  + 9.759 × ln(age)
                  + 0.0683 × slag
Metric Value
81.35%
All coefficients significant p < 0.001
VIF (multicollinearity) All between 1.01 – 1.49
Assumptions passed 5 / 5

Key Findings

Finding Evidence Business Impact
Cement and age are strongest drivers Highest positive coefficients Optimise cement dosage based on required curing time
Water reduces strength β = −0.2374 Minimise water content; use superplasticizer for workability
Fly ash does NOT reduce strength Kruskal-Wallis p = 0.2324 Confidently substitute cement with cheaper, greener fly ash
Concrete category irrelevant Kruskal-Wallis p = 0.3364 Coarse vs fine texture has no effect on strength

Quantified Business Impact

Metric Before (Conservative) After (Optimised)
Cement per m³ 400 kg 310 kg
Slag per m³ 0 kg 100 kg
Cement cost per m³ £48.00 £37.20
Total cost (2,000 m³ project) £96,000 £74,400
Saving £21,600 (22.5%)
CO₂ reduction ~180 tonnes

Deliverables

  • Multiple Linear Regression model (R² = 81.35%)
  • Logistic regression fly ash classifier (AIC = 145.12)
  • Three hypothesis tests (Kruskal-Wallis, Chi-Square)
  • Interactive React/Vite presentation with live strength calculator

Tech Stack

R · ggplot2 · dplyr · readxl · corrplot · car · caret · RVAideMemoire · React · Vite

Data

Source Records Variables Type
StrataForge materials lab 1,030 mix trials 9 continuous Clean, no missing values

📂 View Project →


3. 📡 Predicting Customer Attrition — Reder Telecommunications

Building and deploying a 97%-accuracy churn prediction model via FastAPI for a telecom provider with 15% annual attrition.

📖 Expand Full Project Details

The Problem

Reder Telecommunications, a Norwegian telecom provider headquartered in Oslo, was losing customers at a rate of 15% annually — with acquisition costs 5–7x higher than retention. The company had over 2,000 customer records with 20+ attributes (many stored as nested JSON), but no predictive capability to identify at-risk customers before they left.

Four key obstacles compounded the problem: limited per-segment analytics, pricing inconsistencies across identical usage patterns, no proactive outage management, and purely reactive retention — interventions only after intent to leave was signalled.

What I Built

An end-to-end machine learning pipeline from raw nested-JSON data to a deployed FastAPI prediction endpoint:

  1. Data Preprocessing — parsed 9 nested JSON columns using ast.literal_eval, normalised into separate DataFrames, merged via CustomerID (12,483 rows × 55 columns)
  2. Feature Engineering — RFM analysis (Recency, Frequency, Monetary) with quintile binning, one-hot/label/target encoding
  3. Feature Selection — Mutual Information scoring identified top 20 predictors; payment behaviour features dominated
  4. Model Training — Logistic Regression baseline (97% accuracy), then Decision Tree with RandomizedSearchCV hyperparameter tuning (4-fold CV)
  5. Deployment — FastAPI REST API with Pydantic validation, serialised model via pickle

Top Features by Mutual Information

Rank Feature MI Score
1 late_payment_rate 0.97
2 payment_risk_score 0.95
3 total_late_payments 0.93
4 total_interactions 0.58
5 TimeSpent(minutes) 0.57
6 NPS 0.52

Payment behaviour is the single strongest predictor of churn — more than engagement, satisfaction, or demographics.

Model Performance

Metric Logistic Regression Decision Tree (Tuned)
Accuracy 97.0% 96.9%
Precision ~97% ~97%
Recall ~97% ~97%
F1-Score ~97% ~97%

Best hyperparameters: criterion='log_loss', splitter='random', max_depth=300, min_samples_split=6, min_samples_leaf=4, max_features='log2'

API Deployment

# POST /predict
{
  "records": [{ "late_payment_rate": 0.4, "NPS": 3, ... }]
}

# Response
{
  "prediction": 1,        # 1 = will churn
  "probability": 0.847    # 84.7% confidence
}

Cloud Cost Analysis

Provider Estimated Monthly Best For
GCP Cloud Run $30–75 POC (pay-per-request)
Azure App Service $45–90 Enterprise integration
AWS ECS Fargate $55–95 Max flexibility

Deliverables

  • Trained Decision Tree classifier (model.pkl)
  • FastAPI prediction endpoint (app.py)
  • RFM customer segmentation
  • Cloud deployment cost comparison (AWS vs Azure vs GCP)
  • Interactive React/Vite presentation

Tech Stack

Python · Pandas · NumPy · Scikit-learn · FastAPI · Pydantic · Uvicorn · Pickle · Jupyter · React · Vite

Data

Source Records Raw Variables Engineered Features
Reder Telecom CRM 2,000+ customers 20+ (incl. nested JSON) 55 after normalisation

📂 View Project →


4. 📈 Time Series Modelling for Share Price Prediction — US Bancorp

Forecasting USB stock prices using an iterative approach through Holt-Winters, ARIMA, and ARIMA-GARCH — demonstrating why financial data requires specialised treatment.

📖 Expand Full Project Details

The Problem

Financial markets exhibit volatility clustering — periods where large price swings follow large swings, and calm follows calm. This violates the constant-variance assumption of standard forecasting methods. An investment firm needed a model that could predict not just the expected price direction, but the range of uncertainty around those forecasts.

What I Built

Three progressively sophisticated models fitted to 47 years (11,835 daily observations) of US Bancorp stock data, with rigorous diagnostic testing at each stage:

Model Ljung-Box p-value Verdict
Holt-Winters (α=0.942, γ=FALSE) < 2.2e-16 ✗ Residuals autocorrelated
ARIMA(1,1,0) via auto.arima < 2.2e-16 ✗ Same volatility problem
ARIMA(1,0,1) + sGARCH(1,1) 0.2369 ✓ White noise residuals

The iterative failure-diagnosis-improvement cycle is intentional — it builds a rigorous justification for why GARCH is necessary on financial data.

The Breakthrough: GARCH

By modelling on log returns instead of raw prices and adding a sGARCH(1,1) conditional variance equation via the rugarch package, the model captured both the mean dynamics and time-varying volatility.

Parameter Value Interpretation
μ 0.000603 Mean daily return (~16% annualised)
ar1 −0.7185 Autoregressive coefficient
ma1 0.7068 Moving average coefficient
α₁ 0.0938 ARCH effect (impact of previous shock)
β₁ 0.8924 GARCH persistence (volatility memory)
α₁ + β₁ 0.986 Highly persistent volatility

All parameters significant at p < 0.01. Log-Likelihood: 34,139.66.

365-Day Forecast

Horizon Predicted Price Change
Last observed $31.93
Day 90 ~$33.73 +5.6%
Day 180 ~$35.39 +10.8%
Day 365 ~$39.79 +24.6%

Unlike Holt-Winters and ARIMA, the GARCH model also provides a volatility forecast (sigma) at each step — essential for confidence intervals and risk management.

Deliverables

  • Three fitted models with full diagnostic output
  • 365-day price forecast with return-to-price conversion
  • Volatility (sigma) forecast showing decay from 6.7% to 1.8%
  • Model performance comparison screenshots
  • Interactive React/Vite presentation (17 slides)

Tech Stack

R · quantmod · xts · forecast · rugarch · highcharter · tseries · ggplot2 · React · Vite

Data

Source Records Variables Period
Yahoo Finance (USB) 11,835 daily observations 7 (OHLCV + Adj Close) May 1973 – Apr 2020

📂 View Project →


Skills & Tools

Languages & Frameworks

Category Technologies
Statistical Computing R (ggplot2, dplyr, forecast, rugarch, car, caret, corrplot)
Machine Learning Python (Scikit-learn, Pandas, NumPy, Jupyter)
API Development FastAPI, Pydantic, Uvicorn
Frontend React, Vite, JavaScript/JSX
Dashboarding Power BI (DAX, Power Query, Star Schema)

Techniques

Area Methods
Regression Multiple Linear Regression, Logistic Regression, Forward Stepwise Selection
Classification Decision Trees, Logistic Regression, RandomizedSearchCV, Mutual Information
Time Series Holt-Winters, ARIMA, sGARCH, Decomposition, ADF Testing, Ljung-Box
Hypothesis Testing Kruskal-Wallis, Chi-Square, Shapiro-Wilk, VIF
Feature Engineering RFM Analysis, Log Transformations, Target/One-Hot/Label Encoding
Data Preparation JSON Normalisation, Outlier Analysis, Correlation Matrices
Deployment REST APIs, Model Serialisation (Pickle), Cloud Cost Analysis (AWS/Azure/GCP)

Repository Structure

data-science-projects/
│
├── imf-globaleconomy-analytics/        # Power BI dashboard (IMF 2001–2020)
│   ├── dashboard/                       # .pbix file + live URL
│   ├── dax-measures/                    # DAX formulas for all visuals
│   ├── source-data/                     # IMF Excel datasets
│   └── report-documentation/            # Full project report (PDF)
│
├── strataforge-prediction/             # Concrete strength prediction (R)
│   ├── models/                          # Linear + logistic regression scripts
│   ├── hypothesis-tests/                # Kruskal-Wallis, Chi-Square
│   ├── preprocessing/                   # Data loading & transformation
│   ├── source-data/                     # 1,030 mix trial records
│   ├── strataforge-presentation/        # Interactive React slide deck
│   └── report-documentation/            # Full project report (PDF)
│
├── reder-analytics/                    # Telecom churn prediction (Python)
│   ├── preprocessing/                   # JSON parsing, normalisation, RFM
│   ├── model/                           # Training, tuning, serialisation
│   ├── app.py                           # FastAPI prediction endpoint
│   ├── reder-presentation/              # Interactive React slide deck
│   └── report-documentation/            # Full project report (PDF)
│
├── usb-stock-timeseries-model/         # Share price forecasting (R)
│   ├── eda/                             # Exploration & decomposition
│   ├── models/                          # Holt-Winters, ARIMA, GARCH scripts
│   ├── public/model-performance/        # Diagnostic plot screenshots
│   ├── source-data/                     # USB.csv (11,835 records)
│   ├── usbforecast-presentation/        # Interactive React slide deck
│   └── report-documentation/            # Full project report (PDF)
│
└── README.md                           # ← You are here

What Ties These Projects Together

Each project follows the same disciplined approach, regardless of domain:

1. Start with the business question. Every project begins with a real problem facing a real organisation — not a dataset looking for a use case. The StrataForge project asks "can we cut cement costs without sacrificing strength?" The Reder project asks "which customers will leave, and can we stop them?" The question shapes every subsequent decision.

2. Let the data dictate the method. The USB project is the clearest example: Holt-Winters and ARIMA both failed because the data exhibited volatility clustering. Rather than forcing a method, the iterative diagnostic process revealed why it failed and what was needed — leading to GARCH as a principled solution, not an arbitrary choice.

3. Validate rigorously. No model is presented without diagnostic evidence. The concrete regression passed all five classical assumptions (linearity, independence, normality, homoscedasticity, no multicollinearity). The GARCH model passed the Ljung-Box test where two predecessors failed. The churn classifier was evaluated on precision, recall, F1, and accuracy. Trust is earned through evidence.

4. Quantify the impact. Results are translated into the language the business cares about. Not just "R² = 81.35%" but "£21,600 saved per project." Not just "97% accuracy" but "5–7x cheaper than acquiring a new customer." Not just "p = 0.2369" but "this is the only model whose residuals are indistinguishable from white noise."

5. Make it accessible. Every project includes an interactive React/Vite presentation that communicates the full analysis to non-technical stakeholders — complete with live calculators, model comparisons, and worked cost examples.


Contact

Open to opportunities in data science, analytics engineering, and quantitative modelling.

📧 Email · 💼 LinkedIn · 🐙 GitHub

About

Data science portfolio spanning finance, construction, telecommunications, and macroeconomics — featuring regression modelling, time series forecasting, ML classification, and interactive dashboards.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors