🚀 A cloud-ready, production-style analytics platform that transforms exploratory notebook work into a modular, CI-validated, containerized student retention system.
This repository demonstrates:
- End-to-end ETL → Time-aware ML → SHAP explainability
- Postgres-backed BI marts
- Threshold & spike alerting
- Offline A/B experiment simulation
- ROI sensitivity modeling
- Dockerized execution + CI quality gates
The original exploratory notebook is preserved: oulad-student-success-prediction.ipynb.
Unlike typical ML portfolio projects, this repository:
- Separates exploration from production-grade code
- Implements time-based validation to prevent data leakage
- Publishes BI-ready database marts
- Includes alerting and experiment planning
- Enforces formatting, linting, and CI gates
- Supports containerized and cloud-ready deployment
One command runs the full workflow:
ETL → Feature Engineering → Time-aware Model → SHAP → BI Marts → Alerts → A/B Simulation → ROI → Executive Summary
Demo mode ensures reproducibility even without raw OULAD data.
This repository presents a production-style student retention analytics platform built from the OULAD domain. It transforms exploratory notebook work into a modular pipeline with reproducible execution, database-backed marts, alerting, experimentation, and executive reporting.
The original exploratory notebook is preserved:
oulad-student-success-prediction.ipynb.
Universities lose tuition revenue and student outcomes when at-risk learners are not identified early. This platform converts weekly behavior and assessment signals into:
- risk scoring for intervention teams,
- operational threshold/spike alerting,
- offline experiment planning support,
- ROI sensitivity analysis for budget decisions.
Raw OULAD CSVs (or Demo Generator)
|
v
ETL Layer
|
v
Feature Engineering (time-aware)
|
v
Model Train/Eval + SHAP Explain
|
v
Prediction + BI Marts + Alerts
|
v
A/B Simulation + ROI + Executive Report
|
v
Postgres (primary) / SQLite (fallback)
[S3 Raw Zone] --> [ECS Scheduled Pipeline Task] --> [RDS Postgres Marts]
| | |
| v v
| [CloudWatch Logs/Alarms] [Power BI Service]
|
+--> [S3 Artifacts: metrics, SHAP, reports]
Entrypoint: src/pipeline.py
- Extract OULAD data (
src/etl/extract.py), with DEMO MODE fallback when raw files are missing. - Transform weekly student-level event data (
src/etl/transform.py). - Load processed data and initialize DB schema (
src/etl/load.py). - Build time-aware features (
src/features/build_features.py). - Train risk model using week-based split (
src/model/train.py). - Evaluate performance (
src/model/evaluate.py). - Generate SHAP explainability artifacts (
src/model/explain.py). - Score full weekly history and publish marts (
src/model/predict.py,src/marts/build_marts.py). - Trigger alerts and log them (
src/alerts/alert.py). - Run offline A/B simulation + ROI grid (
src/experiments/ab_simulation.py). - Produce executive summary (
reports/executive_summary.md).
To avoid random leakage, the model is evaluated with time ordering:
- Train:
week < SPLIT_WEEK - Test:
week >= SPLIT_WEEK
This approximates real operations where future student behavior must be predicted from historical data.
Marts are generated as full history time series across all available weeks (0..max week) per run date. CURRENT_WEEK is now an optional snapshot override for alerts/experiments only.
- sklearn backend: SHAP artifacts are generated (
outputs/shap_top_features.json,outputs/shap_summary.png). - pytorch/tensorflow backends: permutation importance is generated and written to
outputs/shap_top_features.jsonwith the same JSON schema as sklearn.
Implemented alerts:
- Threshold alert: high-risk share exceeds configurable threshold.
- Spike alert: week-over-week mean risk increase exceeds configured percentage.
Alerts are saved to outputs/alerts/alert_latest.md and inserted into alert_log.
Offline experiment simulation on top-K at-risk students:
- seeded control/treatment assignment,
- uplift scenarios: 3%, 5%, 8%,
- bootstrap confidence intervals,
- two-proportion z-test,
- persisted experiment rows in
experiment_results.
ROI sensitivity grid is generated via:
ROI = incremental_passes * value_per_pass - intervention_cost
Output: reports/roi_sensitivity.csv.
Recommended local quality gates:
python -m compileall src
ruff check src tests
black --check src tests
pytest -q
python -m src.pipeline --demoThese same checks are wired into .github/workflows/daily_pipeline.yml.
- A/B results are offline simulation, not causal proof from live experimentation.
- Uplift scenarios (3%, 5%, 8%) are planning assumptions.
- Synthetic demo mode is for reproducibility and CI, not for production decisions.
- Model and feature definitions are intentionally lightweight for portfolio demonstration.
outputs/metrics_latest.jsonoutputs/predictions_latest.csv(runtime output; not tracked)outputs/shap_top_features.json
outputs/marts/student_risk_daily_sample.csvoutputs/marts/course_summary_daily_sample.csv
outputs/alerts/alert_latest.mdoutputs/experiments/assignment_latest.csvreports/ab_test_report.mdreports/roi_sensitivity.csvreports/executive_summary.md
- Storage abstraction in
src/storage.py:LocalStorage+ realS3Storage(boto3). - Database: Postgres by
DATABASE_URL, SQLite fallback for local portability. - Compute: Dockerized app service (
docker/Dockerfile,docker-compose.yml). - Observability: structured logs ready for CloudWatch-style ingestion.
- BI layer: marts aligned for Power BI connectivity.
- Start Postgres:
docker compose up -d postgres- Install dependencies:
pip install -r requirements.txt- Point the pipeline to Postgres (pipeline falls back to SQLite only when
DATABASE_URLis unset):
export DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analytics- Run the full demo pipeline and publish marts:
make run- Verify tables and row counts:
make verify-postgresIf DATABASE_URL is not set, the same make run command writes to local SQLite at data/processed/pipeline.db.
Sklearn remains the default baseline. PyTorch/TensorFlow are optional and only used when MODEL_BACKEND is set explicitly.
# PyTorch backend
pip install -r requirements-pt.txt
MODEL_BACKEND=pytorch python -m src.pipeline --demo
# TensorFlow backend
pip install -r requirements-tf.txt
MODEL_BACKEND=tensorflow python -m src.pipeline --demoPyTorch installation may vary by OS/CUDA; if needed, use the official PyTorch install command for your platform.
DATABASE_URLPIPELINE_DEMO_MODE=true|falseSPLIT_WEEK=7CURRENT_WEEK=<optional snapshot week override>HIGH_RISK_THRESHOLD=0.25RISK_SPIKE_THRESHOLD_PCT=0.10MODEL_BACKEND=sklearn|pytorch|tensorflowSTORAGE_BACKEND=local|s3AWS_REGION=us-east-1S3_BUCKET=<your-bucket>S3_PREFIX=oulad-artifacts
Run these after the pipeline to confirm full-history weekly marts:
select min(week), max(week), count(distinct week) from student_risk_daily;
select week, count(*) from student_risk_daily group by week order by week limit 20;
select min(week), max(week), count(*) from course_summary_daily;
select week, count(*) from course_summary_daily group by week order by week limit 20;Expected behavior on real/demo OULAD-like data: count(distinct week) > 1 for student_risk_daily.
- In Power BI Desktop, select Get Data → PostgreSQL database.
- Server:
localhost. - Database:
oulad_analytics. - Credentials: username
oulad, passwordoulad. - Choose Import mode for quickest demo setup.
- Select these tables:
student_risk_dailycourse_summary_dailyexperiment_resultsalert_log
- Click Load and build visuals (examples: risk trend by
weekwith legendrun_date, module heatmap by weeklyhigh_risk_rate, alert timeline byrun_ts).
Set these env vars: STORAGE_BACKEND, AWS_REGION, S3_BUCKET, S3_PREFIX.
When STORAGE_BACKEND=s3, the pipeline uploads only small/stable artifacts:
outputs/metrics_latest.jsonoutputs/shap_top_features.jsonoutputs/marts/*.csvoutputs/alerts/*.mdreports/*.mdreports/*.csvoutputs/artifacts_manifest.json
The manifest acts as run audit evidence (run_id, timestamp, model backend, db mode, file sizes, and storage URIs).
Example:
export STORAGE_BACKEND=s3
export AWS_REGION=us-east-1
export S3_BUCKET=my-oulad-artifacts
export S3_PREFIX=prod
python -m src.pipeline --demoAWS credentials are intentionally not stored in this repo. Use IAM roles, AWS SSO, or a local AWS profile.
GitHub Actions (.github/workflows/daily_pipeline.yml) runs on push, nightly schedule, and manual dispatch. It performs compile/lint/test checks and runs the demo pipeline before uploading artifacts.
src/
config.py
storage.py
pipeline.py
utils/logging.py
etl/{extract.py,transform.py,load.py}
features/build_features.py
model/{train.py,predict.py,evaluate.py,explain.py}
marts/build_marts.py
alerts/alert.py
experiments/ab_simulation.py
db/{schema.sql,marts.sql}
docker/Dockerfile
docker-compose.yml
.github/workflows/daily_pipeline.yml
tests/test_pipeline_smoke.py
Makefile
- Replace offline simulation with live randomized intervention experiments.
- Add model drift monitoring, retraining cadence, and model registry.
- Extend security controls (RLS/IAM) and data contracts for enterprise rollout.
- Docker Desktop
- Python 3.10+ virtual environment
dbt-postgres- Power BI Desktop
docker compose up -d postgrespip install -r requirements.txt
pip install dbt-postgresmacOS/Linux
export DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analyticsWindows PowerShell
$env:DATABASE_URL = "postgresql://oulad:oulad@localhost:5432/oulad_analytics"Windows CMD
set DATABASE_URL=postgresql://oulad:oulad@localhost:5432/oulad_analyticspython scripts/ingest_raw_postgres.pypython -m src.pipelineDemo mode remains available when needed:
python -m src.pipeline --democd dbt_oulad
dbt run
dbt test- Connector: PostgreSQL database
- Server:
localhost - Port:
5432 - Database:
oulad_analytics - Username:
oulad - Password:
oulad
Load only mart tables:
mart.mart_student_risk_weeklymart.mart_course_summary_weekly
Optional metrics table:
ml.student_risk_scores
If Power BI prompts for encryption on local Docker Postgres, choose unencrypted connection for local development.
make postgres-up
make ingest-raw
make pipeline-ml
make dbt-run
make dbt-test
make verify-postgresdocker compose up -d postgres
python scripts/ingest_raw_postgres.py
python -m src.pipeline
cd dbt_oulad
dbt run
dbt test