Skip to content

Farxida/predict-future-sales

Repository files navigation

Predict Future Sales — Kaggle

Tests Python 3.11 License MIT

Time-series sales forecasting on the Kaggle Predict Future Sales competition (1C Russia, 60 shops × 22K items). LightGBM with hand-engineered lag features, hierarchical mean encodings, and momentum signals.

Final result: Public LB 0.87399 · Private LB 0.88034 (RMSE, target clipped to [0, 20]). Improvement over session baseline: −13.5 % RMSE (1.013 → 0.874).


Highlights

  • Public LB 0.87399 · Private 0.88034 — single LightGBM, 47 features, 751 trees
  • Validation fix0.7502 RMSE (filtered val, lied vs LB 1.013) → 0.7933 RMSE on full block-33, tracks LB ±0.01
  • 10 attempts logged: best 0.87399, worst 1.46139. 2 wins, 5 worse, 2 broken (groupby.apply bugs)
  • Negative findings quantified: Optuna HPO −0.027 LB · multi-seed avg −0.010 · 3-model stack −0.014
  • Pipeline: 8 M training rows, ~5 min end-to-end on M-series Mac
  • 9 unit tests in CI + reproducible from python -m src.train

Result Trajectory

LB Progression

The path was not linear. After fixing the validation, the cleaned baseline (30 features, native categorical handling, clipped lags) already beat the starting point by 0.135. Adding lag-12 (year-over-year November signal) plus trend/momentum features pushed it further to 0.87399. Optuna, multi-seed averaging, stacking, and rolling features all hurt in this run — captured below as failure cases.

Attempts Sorted


Final Architecture

flowchart TB
    A[sales_train.csv<br/>2.9M daily rows] --> B[Clean<br/>outliers, shop dedup, clip 0–20]
    B --> C[Cartesian grid<br/>shop × item × month]
    C --> D[Features<br/>lags 1/2/3/6/12 · hierarchical mean<br/>encodings · trend / momentum]
    D --> E[LightGBM<br/>47 features · 751 trees<br/>native categorical]
    E --> F[Predict + clip<br/>submission.csv]
Loading

Key Results

Metric Value
Public LB RMSE 0.87399
Private LB RMSE 0.88034
Validation RMSE (block 33, all pairs) 0.7933
Best LightGBM iteration 751
Number of features 47
Training rows 8.0 M
Validation rows 214 K
Test rows 214 K
Training time (M-series Mac) ~5 min

Top features (importance by gain)

Feature Importance

item_cnt_month_lag1 dominates — last month's sales is by far the strongest predictor. The next tier is composed of mean encodings (lag_mean, item_mean_lag1) and identity features (item_category_id, item_id). The full importance file is at reports/figures/feature_importance.csv.


Experiment Log — what worked, what didn't

# Approach Features Public LB vs baseline Verdict
start Old quick model with filtered validation 45 1.013 Validation was lying
1 KubaMichalczyk-style + zeroing masks 48 > 1.0 worse digital_mask zeroed live items
2 + RMSE-optimal calibration offset 48 > 1.0 worse Bias fix doesn't help with regime mismatch
3 Cleaned baseline (honest val + clip lags) 30 0.87836 −13.5 % ✅ Honest val, native categorical, clip lags
4 + lag_6, lag_12, trend, momentum 47 0.87399 −13.7 % Winner: YoY signal + trend features
5 + seasonal / last-sale via groupby.apply 53 1.42878 broken groupby.apply().shift() corrupted index
6 + yoy_trend + lag_mean_6 49 0.87918 worse yoy_trend was just noise (lag_1 − lag_12)
7 Optuna HPO (30 trials on val block 33) 47 0.90132 worse Tuned for October artefacts, fails on November
8 Multi-seed averaging (5 seeds) 47 0.88407 worse Models too correlated (same params)
9 Stack v2: Optuna + baseline + XGBoost 47 0.88810 worse Meta-learner trained on May–Oct, test = Nov
10 Stack v3: + rolling means + CatBoost 50 1.46139 broken Rolling features corrupted somehow

Best: attempt 4 (47 features). Improvement over session start: 1.013 → 0.87399 = −13.7 % RMSE.


Reproduce

git clone https://github.com/Farxida/predict-future-sales
cd predict-future-sales

# 1. Get Kaggle data (see data/README.md)
kaggle competitions download -c competitive-data-science-predict-future-sales -p data/
unzip data/competitive-data-science-predict-future-sales.zip -d data/

# 2. Install
pip install -r requirements.txt

# 3. Train (builds cache + submission)
python -m src.train

# 4. Submit
kaggle competitions submit -c competitive-data-science-predict-future-sales \
  -f submissions/submission.csv -m "LightGBM 47 features"

End-to-end runtime: ~5 minutes on M-series Mac, longer on cloud.


Tech Stack

Layer Tools
Modelling LightGBM 4.x (gradient boosting, native categorical handling)
Feature engineering pandas, numpy, hand-rolled lag/mean encodings
Persistence Parquet cache for the 8M-row feature matrix
Validation Time-aware split: train < block 33, val = block 33, test = block 34
Visualization matplotlib
Testing pytest
Reproducibility conda + pinned requirements

Project Structure
.
├── src/
│   ├── train.py                  # full pipeline: features → LightGBM → submission
│   └── utils.py                  # data loading, shop dedup, RMSE, submission writer
├── notebooks/                    # exploration: 01_eda, 02_baseline (full outputs preserved)
├── data/                         # gitignored; see data/README.md for setup
├── submissions/
│   └── submission.csv            # the LB 0.87399 prediction file
├── reports/figures/              # charts shown in this README
├── tests/                        # pytest smoke tests
├── requirements.txt
├── README.md
└── LICENSE

License

MIT

About

Kaggle Predict Future Sales — Public LB 0.87399 / Private 0.88034. LightGBM with 47 hand-engineered features. Documents 9 attempts including 7 rejected experiments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors