Time-series sales forecasting on the Kaggle Predict Future Sales competition (1C Russia, 60 shops × 22K items). LightGBM with hand-engineered lag features, hierarchical mean encodings, and momentum signals.
Final result: Public LB 0.87399 · Private LB 0.88034 (RMSE, target clipped to [0, 20]). Improvement over session baseline: −13.5 % RMSE (1.013 → 0.874).
- Public LB 0.87399 · Private 0.88034 — single LightGBM, 47 features, 751 trees
- Validation fix —
0.7502 RMSE(filtered val, lied vs LB 1.013) →0.7933 RMSEon full block-33, tracks LB ±0.01 - 10 attempts logged: best 0.87399, worst 1.46139. 2 wins, 5 worse, 2 broken (groupby.apply bugs)
- Negative findings quantified: Optuna HPO −0.027 LB · multi-seed avg −0.010 · 3-model stack −0.014
- Pipeline: 8 M training rows, ~5 min end-to-end on M-series Mac
- 9 unit tests in CI + reproducible from
python -m src.train
The path was not linear. After fixing the validation, the cleaned baseline (30 features, native categorical handling, clipped lags) already beat the starting point by 0.135. Adding lag-12 (year-over-year November signal) plus trend/momentum features pushed it further to 0.87399. Optuna, multi-seed averaging, stacking, and rolling features all hurt in this run — captured below as failure cases.
flowchart TB
A[sales_train.csv<br/>2.9M daily rows] --> B[Clean<br/>outliers, shop dedup, clip 0–20]
B --> C[Cartesian grid<br/>shop × item × month]
C --> D[Features<br/>lags 1/2/3/6/12 · hierarchical mean<br/>encodings · trend / momentum]
D --> E[LightGBM<br/>47 features · 751 trees<br/>native categorical]
E --> F[Predict + clip<br/>submission.csv]
| Metric | Value |
|---|---|
| Public LB RMSE | 0.87399 |
| Private LB RMSE | 0.88034 |
| Validation RMSE (block 33, all pairs) | 0.7933 |
| Best LightGBM iteration | 751 |
| Number of features | 47 |
| Training rows | 8.0 M |
| Validation rows | 214 K |
| Test rows | 214 K |
| Training time (M-series Mac) | ~5 min |
item_cnt_month_lag1 dominates — last month's sales is by far the strongest predictor. The next tier is composed of mean encodings (lag_mean, item_mean_lag1) and identity features (item_category_id, item_id). The full importance file is at reports/figures/feature_importance.csv.
| # | Approach | Features | Public LB | vs baseline | Verdict |
|---|---|---|---|---|---|
| start | Old quick model with filtered validation | 45 | 1.013 | — | Validation was lying |
| 1 | KubaMichalczyk-style + zeroing masks | 48 | > 1.0 | worse | digital_mask zeroed live items |
| 2 | + RMSE-optimal calibration offset | 48 | > 1.0 | worse | Bias fix doesn't help with regime mismatch |
| 3 | Cleaned baseline (honest val + clip lags) | 30 | 0.87836 | −13.5 % | ✅ Honest val, native categorical, clip lags |
| 4 | + lag_6, lag_12, trend, momentum | 47 | 0.87399 | −13.7 % | ✅ Winner: YoY signal + trend features |
| 5 | + seasonal / last-sale via groupby.apply | 53 | 1.42878 | broken | groupby.apply().shift() corrupted index |
| 6 | + yoy_trend + lag_mean_6 | 49 | 0.87918 | worse | yoy_trend was just noise (lag_1 − lag_12) |
| 7 | Optuna HPO (30 trials on val block 33) | 47 | 0.90132 | worse | Tuned for October artefacts, fails on November |
| 8 | Multi-seed averaging (5 seeds) | 47 | 0.88407 | worse | Models too correlated (same params) |
| 9 | Stack v2: Optuna + baseline + XGBoost | 47 | 0.88810 | worse | Meta-learner trained on May–Oct, test = Nov |
| 10 | Stack v3: + rolling means + CatBoost | 50 | 1.46139 | broken | Rolling features corrupted somehow |
Best: attempt 4 (47 features). Improvement over session start: 1.013 → 0.87399 = −13.7 % RMSE.
git clone https://github.com/Farxida/predict-future-sales
cd predict-future-sales
# 1. Get Kaggle data (see data/README.md)
kaggle competitions download -c competitive-data-science-predict-future-sales -p data/
unzip data/competitive-data-science-predict-future-sales.zip -d data/
# 2. Install
pip install -r requirements.txt
# 3. Train (builds cache + submission)
python -m src.train
# 4. Submit
kaggle competitions submit -c competitive-data-science-predict-future-sales \
-f submissions/submission.csv -m "LightGBM 47 features"End-to-end runtime: ~5 minutes on M-series Mac, longer on cloud.
| Layer | Tools |
|---|---|
| Modelling | LightGBM 4.x (gradient boosting, native categorical handling) |
| Feature engineering | pandas, numpy, hand-rolled lag/mean encodings |
| Persistence | Parquet cache for the 8M-row feature matrix |
| Validation | Time-aware split: train < block 33, val = block 33, test = block 34 |
| Visualization | matplotlib |
| Testing | pytest |
| Reproducibility | conda + pinned requirements |
Project Structure
.
├── src/
│ ├── train.py # full pipeline: features → LightGBM → submission
│ └── utils.py # data loading, shop dedup, RMSE, submission writer
├── notebooks/ # exploration: 01_eda, 02_baseline (full outputs preserved)
├── data/ # gitignored; see data/README.md for setup
├── submissions/
│ └── submission.csv # the LB 0.87399 prediction file
├── reports/figures/ # charts shown in this README
├── tests/ # pytest smoke tests
├── requirements.txt
├── README.md
└── LICENSE
MIT


