Skip to content

fix: impute NaN before variable-selection steps in stats.py#313

Open
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/stats-nan-imputation-variable-selection
Open

fix: impute NaN before variable-selection steps in stats.py#313
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/stats-nan-imputation-variable-selection

Conversation

@drussellmrichie
Copy link
Copy Markdown

Problem

sklearn (ElasticNet) and statsmodels (OLS/VIF) raise errors when input features contain NaN values. This is triggered when a dataset uses LightGBM native NaN-handling -- e.g. sparse binary indicators like has_garage where NaN means "not recorded" -- and the variable-selection pre-pass runs before LightGBM training.

Errors seen:

  • ValueError: Input X contains NaN (ElasticNet / sklearn)
  • MissingDataError: exog contains inf or nans (statsmodels OLS)

Fix

Add median imputation of NaN to the top of each of the four variable-selection functions:

  • calc_elastic_net_regularization
  • calc_p_values_recursive_drop
  • calc_t_values_recursive_drop
  • calc_vif_recursive_drop

Imputation is scoped to these pre-passes only: LightGBM training still receives the real NaN values and handles them natively at each split. A UserWarning is emitted listing the affected columns. Median imputation is a neutral choice for a variable-selection screen and does not bias which variables survive the screen.

sklearn (ElasticNet) and statsmodels (OLS/VIF) raise errors when input
features contain NaN values. This is triggered in practice when a dataset
uses LightGBM's native NaN-handling (e.g. sparse binary indicators like
"has_garage" or "has_fireplace" where NaN means "not recorded") and runs
the variable-selection pre-pass before LightGBM training.

The fix adds median imputation of NaN to the top of each of the four
variable-selection functions:
  - calc_elastic_net_regularization
  - calc_p_values_recursive_drop
  - calc_t_values_recursive_drop
  - calc_vif_recursive_drop

Imputation is scoped to these pre-passes only: LightGBM training still
receives the real NaN values and handles them natively at each split.
A UserWarning is emitted listing the affected columns so the user are aware.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:


I affirm that this contributor has signed the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant