fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation by drussellmrichie · Pull Request #321 · larsiusprime/openavmkit

drussellmrichie · 2026-04-14T16:17:02Z

Problem

When a model group has very few training samples (e.g. <200), the Optuna tuner can select num_leaves values in the thousands. With ~101 training samples and num_leaves=1514, LightGBM effectively memorises the training folds — CV MAPE looks fine because the model can overfit each fold's tiny training set, but out-of-sample performance collapses.

Concrete example from Philadelphia AVM work: residential_mf_large has ~101 training sales. The tuner found num_leaves=1514, which degraded the ratio-study COD from ~40 to ~55 (IAAO standard ≤ 20 for residential). Manually fixing num_leaves=15 recovered COD to ~40.

The same issue can affect min_data_in_leaf: with a range of [20, 500], the upper bound can exceed the entire training fold size, making all splits illegal on the validation side.

Fix

Before constructing the Optuna search space, compute the approximate training-fold size:

n_train_per_fold = int(len(X) * (n_splits - 1) / n_splits)
max_num_leaves = max(8, min(2048, n_train_per_fold // 4))
max_min_data_in_leaf = max(2, min(500, n_train_per_fold // 4))

The // 4 rule means each leaf covers at least ~4 samples on average — a conservative but reasonable floor for a regression tree. Both upper bounds are clamped to the original maximums (2048 and 500), so large datasets are unaffected.

A diagnostic print is emitted under verbose=True when the cap takes effect.

Behaviour at key dataset sizes

n (total)	n_train_per_fold (k=5)	max_num_leaves	max_min_data_in_leaf
101	80	20	20
300	240	60	60
500	400	100	100
2000	1600	400	400
10000+	8000+	2000 → clamped to 2048	500

For n ≥ ~2560 the num_leaves cap has no effect (reaches the original 2048 ceiling). For n ≥ ~2000 the min_data_in_leaf cap reaches 500 (original ceiling).

Test plan

Verify existing tests pass (no breakage on normal-sized datasets)
Smoke-test with a thin group (n < 200): confirm num_leaves in saved params is ≤ n // 5

🤖 Generated with Claude Code

…r thin datasets With small model groups (e.g. <200 training samples), the Optuna tuner can select num_leaves values in the thousands — severe memorisation that yields artificially low CV MAPE but collapses out-of-sample performance. For example, with ~101 training samples the tuner found num_leaves=1514, degrading ratio-study COD from ~40 to ~55. Fix: before building the search space, compute n_train_per_fold ≈ n * (k-1)/k and cap num_leaves at max(8, n_train_per_fold // 4) and min_data_in_leaf at max(2, n_train_per_fold // 4). For large datasets (n >> 8192) the caps are above the original upper bounds and have no effect. For thin datasets the caps prevent the tuner from selecting tree complexities that cannot generalise. A verbose warning is printed when the cap takes effect.

github-actions · 2026-04-14T16:17:14Z

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:

I affirm that this contributor has signed the CLA

Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You can retrigger this bot by commenting recheck in this Pull Request.}_{Posted by the CLA Assistant Lite bot.}

Commits all patches from C:\projects\philly_open_avmkit\patches\ as a persistent local commit so they survive branch switches. Previously these patches existed only as working-directory edits and were lost when the fix/tuning-thin-dataset-search-space PR branch was checked out cleanly. Patches applied: - benchmark.py: _SMRContribContext/_DS classes; do_contributions threading; _write_model_results slim pkl + _model_features.json sidecar; per-group ind_vars override via group_overrides - data.py: cKDTree import fix (scipy >=1.14); astype(int) cast after .loc - modeling.py: positional reset_index+concat to avoid 421k^2 cartesian join - pipeline.py: finalize_models run_* params; compute_model_contributions(); two-checkpoint SHAP resume flow - sales_scrutiny_study.py: astype(str) on model_group before concatenation - shap_analysis.py: numpy array truth-value fix; missing-feature warning+filter - tuning.py: thin-dataset guard + dataset-scaled search-space caps (also in fbcad5a as upstream PR larsiusprime#321) - utilities/cache.py: ArrowExtensionArray .sum() fix via .eq() + int() - utilities/stats.py: median-impute NaN before sklearn/statsmodels fits

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation#321

fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation#321
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/tuning-thin-dataset-search-space

drussellmrichie commented Apr 14, 2026

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drussellmrichie commented Apr 14, 2026

Problem

Fix

Behaviour at key dataset sizes

Test plan

Uh oh!

github-actions Bot commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant