OULAD Student Success Analytics Platform (Flagship v2)

🚀 A cloud-ready, production-style analytics platform that transforms exploratory notebook work into a modular, CI-validated, containerized student retention system.

This repository demonstrates:

End-to-end ETL → Time-aware ML → SHAP explainability
Postgres-backed BI marts
Threshold & spike alerting
Offline A/B experiment simulation
ROI sensitivity modeling
Dockerized execution + CI quality gates

The original exploratory notebook is preserved: oulad-student-success-prediction.ipynb.

⭐ What Makes This Different

Unlike typical ML portfolio projects, this repository:

Separates exploration from production-grade code
Implements time-based validation to prevent data leakage
Publishes BI-ready database marts
Includes alerting and experiment planning
Enforces formatting, linting, and CI gates
Supports containerized and cloud-ready deployment

⚡ 2-Minute Reviewer Overview

One command runs the full workflow:

ETL → Feature Engineering → Time-aware Model → SHAP → BI Marts → Alerts → A/B Simulation → ROI → Executive Summary

Demo mode ensures reproducibility even without raw OULAD data.

Executive Overview

This repository presents a production-style student retention analytics platform built from the OULAD domain. It transforms exploratory notebook work into a modular pipeline with reproducible execution, database-backed marts, alerting, experimentation, and executive reporting.

The original exploratory notebook is preserved: oulad-student-success-prediction.ipynb.

Business Context (Student Retention)

Universities lose tuition revenue and student outcomes when at-risk learners are not identified early. This platform converts weekly behavior and assessment signals into:

risk scoring for intervention teams,
operational threshold/spike alerting,
offline experiment planning support,
ROI sensitivity analysis for budget decisions.

Architecture Diagram

Local / Container Runtime

Raw OULAD CSVs (or Demo Generator)
          |
          v
      ETL Layer
          |
          v
 Feature Engineering (time-aware)
          |
          v
Model Train/Eval + SHAP Explain
          |
          v
Prediction + BI Marts + Alerts
          |
          v
A/B Simulation + ROI + Executive Report
          |
          v
Postgres (primary) / SQLite (fallback)

Production AWS Reference Architecture

[S3 Raw Zone] --> [ECS Scheduled Pipeline Task] --> [RDS Postgres Marts]
       |                    |                           |
       |                    v                           v
       |             [CloudWatch Logs/Alarms]     [Power BI Service]
       |
       +--> [S3 Artifacts: metrics, SHAP, reports]

Data Pipeline Design

Entrypoint: src/pipeline.py

Extract OULAD data (src/etl/extract.py), with DEMO MODE fallback when raw files are missing.
Transform weekly student-level event data (src/etl/transform.py).
Load processed data and initialize DB schema (src/etl/load.py).
Build time-aware features (src/features/build_features.py).
Train risk model using week-based split (src/model/train.py).
Evaluate performance (src/model/evaluate.py).
Generate SHAP explainability artifacts (src/model/explain.py).
Score full weekly history and publish marts (src/model/predict.py, src/marts/build_marts.py).
Trigger alerts and log them (src/alerts/alert.py).
Run offline A/B simulation + ROI grid (src/experiments/ab_simulation.py).
Produce executive summary (reports/executive_summary.md).

Time-Based ML Validation

To avoid random leakage, the model is evaluated with time ordering:

Train: week < SPLIT_WEEK
Test: week >= SPLIT_WEEK

This approximates real operations where future student behavior must be predicted from historical data.

Marts are generated as full history time series across all available weeks (0..max week) per run date. CURRENT_WEEK is now an optional snapshot override for alerts/experiments only.

Explainability

sklearn backend: SHAP artifacts are generated (outputs/shap_top_features.json, outputs/shap_summary.png).
pytorch/tensorflow backends: permutation importance is generated and written to outputs/shap_top_features.json with the same JSON schema as sklearn.

Alerting System

Implemented alerts:

Threshold alert: high-risk share exceeds configurable threshold.
Spike alert: week-over-week mean risk increase exceeds configured percentage.

Alerts are saved to outputs/alerts/alert_latest.md and inserted into alert_log.

Experiment Framework

Offline experiment simulation on top-K at-risk students:

seeded control/treatment assignment,
uplift scenarios: 3%, 5%, 8%,
bootstrap confidence intervals,
two-proportion z-test,
persisted experiment rows in experiment_results.

ROI Modeling

ROI sensitivity grid is generated via:

ROI = incremental_passes * value_per_pass - intervention_cost

Output: reports/roi_sensitivity.csv.

Validation

Recommended local quality gates:

python -m compileall src
ruff check src tests
black --check src tests
pytest -q
python -m src.pipeline --demo

These same checks are wired into .github/workflows/daily_pipeline.yml.

Assumptions & Limitations

A/B results are offline simulation, not causal proof from live experimentation.
Uplift scenarios (3%, 5%, 8%) are planning assumptions.
Synthetic demo mode is for reproducibility and CI, not for production decisions.
Model and feature definitions are intentionally lightweight for portfolio demonstration.

Outputs (Key Files)

Core metrics & predictions

outputs/metrics_latest.json
outputs/predictions_latest.csv (runtime output; not tracked)
outputs/shap_top_features.json

Marts (BI-ready samples)

outputs/marts/student_risk_daily_sample.csv
outputs/marts/course_summary_daily_sample.csv

Alerts, experiments, and reports

outputs/alerts/alert_latest.md
outputs/experiments/assignment_latest.csv
reports/ab_test_report.md
reports/roi_sensitivity.csv
reports/executive_summary.md

Cloud Deployment Plan

Storage abstraction in src/storage.py: LocalStorage + real S3Storage (boto3).
Database: Postgres by DATABASE_URL, SQLite fallback for local portability.
Compute: Dockerized app service (docker/Dockerfile, docker-compose.yml).
Observability: structured logs ready for CloudWatch-style ingestion.
BI layer: marts aligned for Power BI connectivity.

How to Run

Postgres-backed demo flow (recommended)

Start Postgres:

docker compose up -d postgres

Install dependencies:

pip install -r requirements.txt

Point the pipeline to Postgres (pipeline falls back to SQLite only when DATABASE_URL is unset):

export DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analytics

Run the full demo pipeline and publish marts:

make run

Verify tables and row counts:

make verify-postgres

Local SQLite fallback

If DATABASE_URL is not set, the same make run command writes to local SQLite at data/processed/pipeline.db.

Optional Deep Learning Backends (PyTorch / TensorFlow)

Sklearn remains the default baseline. PyTorch/TensorFlow are optional and only used when MODEL_BACKEND is set explicitly.

# PyTorch backend
pip install -r requirements-pt.txt
MODEL_BACKEND=pytorch python -m src.pipeline --demo

# TensorFlow backend
pip install -r requirements-tf.txt
MODEL_BACKEND=tensorflow python -m src.pipeline --demo

PyTorch installation may vary by OS/CUDA; if needed, use the official PyTorch install command for your platform.

Optional environment variables

DATABASE_URL
PIPELINE_DEMO_MODE=true|false
SPLIT_WEEK=7
CURRENT_WEEK=<optional snapshot week override>
HIGH_RISK_THRESHOLD=0.25
RISK_SPIKE_THRESHOLD_PCT=0.10
MODEL_BACKEND=sklearn|pytorch|tensorflow
STORAGE_BACKEND=local|s3
AWS_REGION=us-east-1
S3_BUCKET=<your-bucket>
S3_PREFIX=oulad-artifacts

Mart Verification SQL (required)

Run these after the pipeline to confirm full-history weekly marts:

select min(week), max(week), count(distinct week) from student_risk_daily;
select week, count(*) from student_risk_daily group by week order by week limit 20;
select min(week), max(week), count(*) from course_summary_daily;
select week, count(*) from course_summary_daily group by week order by week limit 20;

Expected behavior on real/demo OULAD-like data: count(distinct week) > 1 for student_risk_daily.

Power BI: Connect to Postgres

In Power BI Desktop, select Get Data → PostgreSQL database.
Server: localhost.
Database: oulad_analytics.
Credentials: username oulad, password oulad.
Choose Import mode for quickest demo setup.
Select these tables:
- student_risk_daily
- course_summary_daily
- experiment_results
- alert_log
Click Load and build visuals (examples: risk trend by week with legend run_date, module heatmap by weekly high_risk_rate, alert timeline by run_ts).

S3 Artifact Publishing

Set these env vars: STORAGE_BACKEND, AWS_REGION, S3_BUCKET, S3_PREFIX.

When STORAGE_BACKEND=s3, the pipeline uploads only small/stable artifacts:

outputs/metrics_latest.json
outputs/shap_top_features.json
outputs/marts/*.csv
outputs/alerts/*.md
reports/*.md
reports/*.csv
outputs/artifacts_manifest.json

The manifest acts as run audit evidence (run_id, timestamp, model backend, db mode, file sizes, and storage URIs).

Example:

export STORAGE_BACKEND=s3
export AWS_REGION=us-east-1
export S3_BUCKET=my-oulad-artifacts
export S3_PREFIX=prod
python -m src.pipeline --demo

AWS credentials are intentionally not stored in this repo. Use IAM roles, AWS SSO, or a local AWS profile.

CI/CD

GitHub Actions (.github/workflows/daily_pipeline.yml) runs on push, nightly schedule, and manual dispatch. It performs compile/lint/test checks and runs the demo pipeline before uploading artifacts.

Project Structure

src/
  config.py
  storage.py
  pipeline.py
  utils/logging.py
  etl/{extract.py,transform.py,load.py}
  features/build_features.py
  model/{train.py,predict.py,evaluate.py,explain.py}
  marts/build_marts.py
  alerts/alert.py
  experiments/ab_simulation.py

db/{schema.sql,marts.sql}
docker/Dockerfile
docker-compose.yml
.github/workflows/daily_pipeline.yml
tests/test_pipeline_smoke.py
Makefile

Future Improvements

Replace offline simulation with live randomized intervention experiments.
Add model drift monitoring, retraining cadence, and model registry.
Extend security controls (RLS/IAM) and data contracts for enterprise rollout.

Quickstart (Postgres + dbt + Power BI)

Prerequisites

Docker Desktop
Python 3.10+ virtual environment
dbt-postgres
Power BI Desktop

1) Start Postgres

docker compose up -d postgres

2) Install dependencies

pip install -r requirements.txt
pip install dbt-postgres

3) Set `DATABASE_URL`

macOS/Linux

export DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analytics

Windows PowerShell

$env:DATABASE_URL = "postgresql://oulad:oulad@localhost:5432/oulad_analytics"

Windows CMD

set DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analytics

4) Ingest raw CSVs to `raw.*`

python scripts/ingest_raw_postgres.py

5) Run Python ML pipeline (writes `ml.student_risk_scores` + backward-compatible legacy tables)

python -m src.pipeline

Demo mode remains available when needed:

python -m src.pipeline --demo

6) Build marts with dbt

cd dbt_oulad
dbt run
dbt test

7) Power BI connection

Connector: PostgreSQL database
Server: localhost
Port: 5432
Database: oulad_analytics
Username: oulad
Password: oulad

Load only mart tables:

mart.mart_student_risk_weekly
mart.mart_course_summary_weekly

Optional metrics table:

ml.student_risk_scores

If Power BI prompts for encryption on local Docker Postgres, choose unencrypted connection for local development.

Makefile helpers

make postgres-up
make ingest-raw
make pipeline-ml
make dbt-run
make dbt-test
make verify-postgres

Windows fallback (if `make` is unavailable)

docker compose up -d postgres
python scripts/ingest_raw_postgres.py
python -m src.pipeline
cd dbt_oulad
 dbt run
 dbt test

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
dashboards		dashboards
db		db
dbt_oulad		dbt_oulad
docker		docker
outputs		outputs
reports		reports
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
PLANS.md		PLANS.md
README.md		README.md
docker-compose.yml		docker-compose.yml
oulad-student-success-prediction.ipynb		oulad-student-success-prediction.ipynb
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements-pt.txt		requirements-pt.txt
requirements-tf.txt		requirements-tf.txt
requirements.txt		requirements.txt
run_pipeline.sh		run_pipeline.sh

Folders and files

Latest commit

History

Repository files navigation

OULAD Student Success Analytics Platform (Flagship v2)

⭐ What Makes This Different

⚡ 2-Minute Reviewer Overview

Executive Overview

Business Context (Student Retention)

Architecture Diagram

Local / Container Runtime

Production AWS Reference Architecture

Data Pipeline Design

Time-Based ML Validation

Explainability

Alerting System

Experiment Framework

ROI Modeling

Validation

Assumptions & Limitations

Outputs (Key Files)

Core metrics & predictions

Marts (BI-ready samples)

Alerts, experiments, and reports

Cloud Deployment Plan

How to Run

Postgres-backed demo flow (recommended)

Local SQLite fallback

Optional Deep Learning Backends (PyTorch / TensorFlow)

Optional environment variables

Mart Verification SQL (required)

Power BI: Connect to Postgres

S3 Artifact Publishing

CI/CD

Project Structure

Future Improvements

Quickstart (Postgres + dbt + Power BI)

Prerequisites

1) Start Postgres

2) Install dependencies

3) Set DATABASE_URL

4) Ingest raw CSVs to raw.*

5) Run Python ML pipeline (writes ml.student_risk_scores + backward-compatible legacy tables)

6) Build marts with dbt

7) Power BI connection

Makefile helpers

Windows fallback (if make is unavailable)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

3) Set `DATABASE_URL`

4) Ingest raw CSVs to `raw.*`

5) Run Python ML pipeline (writes `ml.student_risk_scores` + backward-compatible legacy tables)

Windows fallback (if `make` is unavailable)

Packages