fix(tuning): cap LightGBM search space for thin datasets to prevent memorisation#321
Open
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
Open
Conversation
…r thin datasets With small model groups (e.g. <200 training samples), the Optuna tuner can select num_leaves values in the thousands — severe memorisation that yields artificially low CV MAPE but collapses out-of-sample performance. For example, with ~101 training samples the tuner found num_leaves=1514, degrading ratio-study COD from ~40 to ~55. Fix: before building the search space, compute n_train_per_fold ≈ n * (k-1)/k and cap num_leaves at max(8, n_train_per_fold // 4) and min_data_in_leaf at max(2, n_train_per_fold // 4). For large datasets (n >> 8192) the caps are above the original upper bounds and have no effect. For thin datasets the caps prevent the tuner from selecting tree complexities that cannot generalise. A verbose warning is printed when the cap takes effect.
Contributor
|
Thank you for your contribution. I affirm that this contributor has signed the CLA Russell Richie seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. |
drussellmrichie
pushed a commit
to drussellmrichie/openavmkit
that referenced
this pull request
Apr 15, 2026
Commits all patches from C:\projects\philly_open_avmkit\patches\ as a persistent local commit so they survive branch switches. Previously these patches existed only as working-directory edits and were lost when the fix/tuning-thin-dataset-search-space PR branch was checked out cleanly. Patches applied: - benchmark.py: _SMRContribContext/_DS classes; do_contributions threading; _write_model_results slim pkl + _model_features.json sidecar; per-group ind_vars override via group_overrides - data.py: cKDTree import fix (scipy >=1.14); astype(int) cast after .loc - modeling.py: positional reset_index+concat to avoid 421k^2 cartesian join - pipeline.py: finalize_models run_* params; compute_model_contributions(); two-checkpoint SHAP resume flow - sales_scrutiny_study.py: astype(str) on model_group before concatenation - shap_analysis.py: numpy array truth-value fix; missing-feature warning+filter - tuning.py: thin-dataset guard + dataset-scaled search-space caps (also in fbcad5a as upstream PR larsiusprime#321) - utilities/cache.py: ArrowExtensionArray .sum() fix via .eq() + int() - utilities/stats.py: median-impute NaN before sklearn/statsmodels fits
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
When a model group has very few training samples (e.g. <200), the Optuna tuner can select
num_leavesvalues in the thousands. With ~101 training samples andnum_leaves=1514, LightGBM effectively memorises the training folds — CV MAPE looks fine because the model can overfit each fold's tiny training set, but out-of-sample performance collapses.Concrete example from Philadelphia AVM work:
residential_mf_largehas ~101 training sales. The tuner foundnum_leaves=1514, which degraded the ratio-study COD from ~40 to ~55 (IAAO standard ≤ 20 for residential). Manually fixingnum_leaves=15recovered COD to ~40.The same issue can affect
min_data_in_leaf: with a range of[20, 500], the upper bound can exceed the entire training fold size, making all splits illegal on the validation side.Fix
Before constructing the Optuna search space, compute the approximate training-fold size:
The
// 4rule means each leaf covers at least ~4 samples on average — a conservative but reasonable floor for a regression tree. Both upper bounds are clamped to the original maximums (2048 and 500), so large datasets are unaffected.A diagnostic
printis emitted underverbose=Truewhen the cap takes effect.Behaviour at key dataset sizes
For n ≥ ~2560 the
num_leavescap has no effect (reaches the original 2048 ceiling). For n ≥ ~2000 themin_data_in_leafcap reaches 500 (original ceiling).Test plan
num_leavesin saved params is ≤n // 5🤖 Generated with Claude Code