Skip to content

DSAIB Python for Data Science - Final Project - Lad team

Notifications You must be signed in to change notification settings

leottawa/Churn-Predictions

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Churn Prediction (10-Day Horizon)

Predict users likely to churn within the next 10 days. The workflow builds leakage-safe daily features, handles imbalance with undersampling, and trains an XGBoost model tuned for high recall and F1 on the churn class. SHAP is used to trim the feature set without losing performance.

Tracked files

  • v8/preprocess.py — Feature engineering pipeline. Builds a user–day grid, shifts targets to a 10-day horizon, and engineers engagement/session rollups (7–10/30-day windows), ratios (ads per play, thumbs-up rate, plays/session), recency gaps, active streaks, calendar flags, plan-change signals, playlist/friend totals, and tenure. Entry point: UserChurnPreprocessor(train_path=..., window_days=10).run().
  • v8/exploration.ipynb — EDA. Reviews schema, missingness (length/artist/song), class imbalance (~6% churn), numerical correlations/outliers, weekly/daily seasonality, and sanity checks (e.g., no events after cancellation).
  • v8/feature_selection.ipynb — SHAP-based reduction. Fits XGBoost on full features, ranks importance, then greedily tests top-k subsets; F1 stabilizes around 66 features while keeping recall.
  • v8/pred.ipynb — Modeling and inference. Randomized search for XGBoost hyperparameters with 5-fold CV; each train fold is undersampled to 50/50 while val folds stay imbalanced. Uses the selected feature subset, scores test data, averages last 10 days with recency weighting, and applies a 0.4 threshold for the churn flag/submission.
  • .gitignore — Ignores everything except the files above (and this README).

Data assumptions

  • Input files (ignored by git): data/train.parquet, data/test.parquet, containing 50 days of user event logs with columns like userId, ts, page, sessionId, song, artist, length, registration, etc.
  • Target: label = 1 if a user's churn date (first Cancellation Confirmation) occurs within the next 10 days of a given day.
  • Imbalance: churn rate ~6%; the pipeline prioritizes recall to avoid missing churners.

Preprocessing pipeline (v8/preprocess.py)

  • Load & normalize timestamps; cast userId to string; add ts_dt, day, and registration_dt.
  • Build user span (first/last activity) and expand to a full user–day calendar so inactivity is explicit.
  • Add churn date (min Cancellation Confirmation per user) and 10-day horizon target.
  • Membership: latest daily level (paid/free), forward-filled per user.
  • Engagement features: daily counts for plays, thumbs up/down, ads, help/error, logout, sessions, play time, searches, unique songs/artists; 7–10 day and 30 day rolling sums.
  • Ratios: ads per play, thumbs-up rate, plays per session (short and long windows).
  • Recency & streaks: days since last event/play; active streak length.
  • Session features: mean/max session length per day plus rolling means (short/long).
  • Temporal flags: day of week, weekend flag, day of month.
  • Plan-change signals: daily counts of submitted/confirmed upgrades/downgrades, plus rolling sums (short/long).
  • Playlist/friend activity: daily counts, rolling sums, cumulative totals.
  • Tenure: days since registration (per user).

Feature selection (v8/feature_selection.ipynb)

  • Train XGBoost on full features, compute SHAP values, rank features.
  • Greedy search over top-k lists; F1 peaks around k≈66 while preserving recall.

Modeling & inference (v8/pred.ipynb)

  • CV setup: 5-fold stratified; each train fold undersampled to 50/50; validation remains imbalanced to mirror reality.
  • Hyperparameters: randomized search across depth, learning rate, estimators, min child weight, subsample/colsample, gamma, lambda (200 candidates).
  • Objective/metrics: maximize F1 on churn class; track recall and AUC (~0.94 internally). Precision is lower by design due to high-recall focus.
  • Prediction logic: score test data on the selected features; average churn probabilities over the last 10 days with higher weight on recent days; threshold at 0.4 for final churn flag/submission.

Usage (local)

  1. Place train.parquet / test.parquet in data/ (ignored by git).
  2. Build features:
    python -c "from v8.preprocess import UserChurnPreprocessor; UserChurnPreprocessor('data/train.parquet', window_days=10).run().to_parquet('data/train_features.parquet')"
  3. Open notebooks:
    • v8/exploration.ipynb for EDA context.
    • v8/feature_selection.ipynb to reproduce SHAP-based trimming.
    • v8/pred.ipynb to tune XGBoost and produce predictions/submission.

Results snapshot

  • Internal imbalanced split: high recall with AUC ~0.94; precision intentionally lower given the high-recall objective.
  • Balanced Kaggle-style eval shows lower accuracy due to distribution mismatch (real-world imbalance vs. balanced benchmark).

About

DSAIB Python for Data Science - Final Project - Lad team

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published