Predict Future Sales — Kaggle

Time-series sales forecasting on the Kaggle Predict Future Sales competition (1C Russia, 60 shops × 22K items). LightGBM with hand-engineered lag features, hierarchical mean encodings, and momentum signals.

Final result: Public LB 0.87399 · Private LB 0.88034 (RMSE, target clipped to [0, 20]). Improvement over session baseline: −13.5 % RMSE (1.013 → 0.874).

Highlights

Public LB 0.87399 · Private 0.88034 — single LightGBM, 47 features, 751 trees
Validation fix — 0.7502 RMSE (filtered val, lied vs LB 1.013) → 0.7933 RMSE on full block-33, tracks LB ±0.01
10 attempts logged: best 0.87399, worst 1.46139. 2 wins, 5 worse, 2 broken (groupby.apply bugs)
Negative findings quantified: Optuna HPO −0.027 LB · multi-seed avg −0.010 · 3-model stack −0.014
Pipeline: 8 M training rows, ~5 min end-to-end on M-series Mac
9 unit tests in CI + reproducible from python -m src.train

Result Trajectory

The path was not linear. After fixing the validation, the cleaned baseline (30 features, native categorical handling, clipped lags) already beat the starting point by 0.135. Adding lag-12 (year-over-year November signal) plus trend/momentum features pushed it further to 0.87399. Optuna, multi-seed averaging, stacking, and rolling features all hurt in this run — captured below as failure cases.

Final Architecture

flowchart TB
    A[sales_train.csv<br/>2.9M daily rows] --> B[Clean<br/>outliers, shop dedup, clip 0–20]
    B --> C[Cartesian grid<br/>shop × item × month]
    C --> D[Features<br/>lags 1/2/3/6/12 · hierarchical mean<br/>encodings · trend / momentum]
    D --> E[LightGBM<br/>47 features · 751 trees<br/>native categorical]
    E --> F[Predict + clip<br/>submission.csv]

Key Results

Metric	Value
Public LB RMSE	0.87399
Private LB RMSE	0.88034
Validation RMSE (block 33, all pairs)	0.7933
Best LightGBM iteration	751
Number of features	47
Training rows	8.0 M
Validation rows	214 K
Test rows	214 K
Training time (M-series Mac)	~5 min

Top features (importance by gain)

item_cnt_month_lag1 dominates — last month's sales is by far the strongest predictor. The next tier is composed of mean encodings (lag_mean, item_mean_lag1) and identity features (item_category_id, item_id). The full importance file is at reports/figures/feature_importance.csv.

Experiment Log — what worked, what didn't

#	Approach	Features	Public LB	vs baseline	Verdict
start	Old quick model with filtered validation	45	1.013	—	Validation was lying
1	KubaMichalczyk-style + zeroing masks	48	> 1.0	worse	digital_mask zeroed live items
2	+ RMSE-optimal calibration offset	48	> 1.0	worse	Bias fix doesn't help with regime mismatch
3	Cleaned baseline (honest val + clip lags)	30	0.87836	−13.5 %	✅ Honest val, native categorical, clip lags
4	+ lag_6, lag_12, trend, momentum	47	0.87399	−13.7 %	✅ Winner: YoY signal + trend features
5	+ seasonal / last-sale via groupby.apply	53	1.42878	broken	`groupby.apply().shift()` corrupted index
6	+ yoy_trend + lag_mean_6	49	0.87918	worse	yoy_trend was just noise (lag_1 − lag_12)
7	Optuna HPO (30 trials on val block 33)	47	0.90132	worse	Tuned for October artefacts, fails on November
8	Multi-seed averaging (5 seeds)	47	0.88407	worse	Models too correlated (same params)
9	Stack v2: Optuna + baseline + XGBoost	47	0.88810	worse	Meta-learner trained on May–Oct, test = Nov
10	Stack v3: + rolling means + CatBoost	50	1.46139	broken	Rolling features corrupted somehow

Best: attempt 4 (47 features). Improvement over session start: 1.013 → 0.87399 = −13.7 % RMSE.

Reproduce

git clone https://github.com/Farxida/predict-future-sales
cd predict-future-sales

# 1. Get Kaggle data (see data/README.md)
kaggle competitions download -c competitive-data-science-predict-future-sales -p data/
unzip data/competitive-data-science-predict-future-sales.zip -d data/

# 2. Install
pip install -r requirements.txt

# 3. Train (builds cache + submission)
python -m src.train

# 4. Submit
kaggle competitions submit -c competitive-data-science-predict-future-sales \
  -f submissions/submission.csv -m "LightGBM 47 features"

End-to-end runtime: ~5 minutes on M-series Mac, longer on cloud.

Tech Stack

Layer	Tools
Modelling	LightGBM 4.x (gradient boosting, native categorical handling)
Feature engineering	pandas, numpy, hand-rolled lag/mean encodings
Persistence	Parquet cache for the 8M-row feature matrix
Validation	Time-aware split: train < block 33, val = block 33, test = block 34
Visualization	matplotlib
Testing	pytest
Reproducibility	conda + pinned requirements

Project Structure

.
├── src/
│   ├── train.py                  # full pipeline: features → LightGBM → submission
│   └── utils.py                  # data loading, shop dedup, RMSE, submission writer
├── notebooks/                    # exploration: 01_eda, 02_baseline (full outputs preserved)
├── data/                         # gitignored; see data/README.md for setup
├── submissions/
│   └── submission.csv            # the LB 0.87399 prediction file
├── reports/figures/              # charts shown in this README
├── tests/                        # pytest smoke tests
├── requirements.txt
├── README.md
└── LICENSE

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predict Future Sales — Kaggle

Highlights

Result Trajectory

Final Architecture

Key Results

Top features (importance by gain)

Experiment Log — what worked, what didn't

Reproduce

Tech Stack

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.github/workflows		.github/workflows
data		data
notebooks		notebooks
reports/figures		reports/figures
scripts		scripts
src		src
submissions		submissions
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Predict Future Sales — Kaggle

Highlights

Result Trajectory

Final Architecture

Key Results

Top features (importance by gain)

Experiment Log — what worked, what didn't

Reproduce

Tech Stack

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages