Predict users likely to churn within the next 10 days. The workflow builds leakage-safe daily features, handles imbalance with undersampling, and trains an XGBoost model tuned for high recall and F1 on the churn class. SHAP is used to trim the feature set without losing performance.
v8/preprocess.py— Feature engineering pipeline. Builds a user–day grid, shifts targets to a 10-day horizon, and engineers engagement/session rollups (7–10/30-day windows), ratios (ads per play, thumbs-up rate, plays/session), recency gaps, active streaks, calendar flags, plan-change signals, playlist/friend totals, and tenure. Entry point:UserChurnPreprocessor(train_path=..., window_days=10).run().v8/exploration.ipynb— EDA. Reviews schema, missingness (length/artist/song), class imbalance (~6% churn), numerical correlations/outliers, weekly/daily seasonality, and sanity checks (e.g., no events after cancellation).v8/feature_selection.ipynb— SHAP-based reduction. Fits XGBoost on full features, ranks importance, then greedily tests top-k subsets; F1 stabilizes around 66 features while keeping recall.v8/pred.ipynb— Modeling and inference. Randomized search for XGBoost hyperparameters with 5-fold CV; each train fold is undersampled to 50/50 while val folds stay imbalanced. Uses the selected feature subset, scores test data, averages last 10 days with recency weighting, and applies a 0.4 threshold for the churn flag/submission..gitignore— Ignores everything except the files above (and this README).
- Input files (ignored by git):
data/train.parquet,data/test.parquet, containing 50 days of user event logs with columns likeuserId,ts,page,sessionId,song,artist,length,registration, etc. - Target: label = 1 if a user's churn date (first
Cancellation Confirmation) occurs within the next 10 days of a given day. - Imbalance: churn rate ~6%; the pipeline prioritizes recall to avoid missing churners.
- Load & normalize timestamps; cast
userIdto string; addts_dt,day, andregistration_dt. - Build user span (first/last activity) and expand to a full user–day calendar so inactivity is explicit.
- Add churn date (min
Cancellation Confirmationper user) and 10-day horizon target. - Membership: latest daily level (paid/free), forward-filled per user.
- Engagement features: daily counts for plays, thumbs up/down, ads, help/error, logout, sessions, play time, searches, unique songs/artists; 7–10 day and 30 day rolling sums.
- Ratios: ads per play, thumbs-up rate, plays per session (short and long windows).
- Recency & streaks: days since last event/play; active streak length.
- Session features: mean/max session length per day plus rolling means (short/long).
- Temporal flags: day of week, weekend flag, day of month.
- Plan-change signals: daily counts of submitted/confirmed upgrades/downgrades, plus rolling sums (short/long).
- Playlist/friend activity: daily counts, rolling sums, cumulative totals.
- Tenure: days since registration (per user).
- Train XGBoost on full features, compute SHAP values, rank features.
- Greedy search over top-k lists; F1 peaks around k≈66 while preserving recall.
- CV setup: 5-fold stratified; each train fold undersampled to 50/50; validation remains imbalanced to mirror reality.
- Hyperparameters: randomized search across depth, learning rate, estimators, min child weight, subsample/colsample, gamma, lambda (200 candidates).
- Objective/metrics: maximize F1 on churn class; track recall and AUC (~0.94 internally). Precision is lower by design due to high-recall focus.
- Prediction logic: score test data on the selected features; average churn probabilities over the last 10 days with higher weight on recent days; threshold at 0.4 for final churn flag/submission.
- Place
train.parquet/test.parquetindata/(ignored by git). - Build features:
python -c "from v8.preprocess import UserChurnPreprocessor; UserChurnPreprocessor('data/train.parquet', window_days=10).run().to_parquet('data/train_features.parquet')" - Open notebooks:
v8/exploration.ipynbfor EDA context.v8/feature_selection.ipynbto reproduce SHAP-based trimming.v8/pred.ipynbto tune XGBoost and produce predictions/submission.
- Internal imbalanced split: high recall with AUC ~0.94; precision intentionally lower given the high-recall objective.
- Balanced Kaggle-style eval shows lower accuracy due to distribution mismatch (real-world imbalance vs. balanced benchmark).