Skip to content

fix: skip non-numeric columns in calc_r2#318

Open
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/calc-r2-non-numeric-columns
Open

fix: skip non-numeric columns in calc_r2#318
drussellmrichie wants to merge 1 commit intolarsiusprime:masterfrom
drussellmrichie:fix/calc-r2-non-numeric-columns

Conversation

@drussellmrichie
Copy link
Copy Markdown

Bug

calc_r2 in openavmkit/utilities/stats.py calls data[var].astype(float) on every variable in ind_vars without first checking whether the column is numeric. Any string/categorical column triggers:

ValueError: could not convert string to float: '100A'

This crash occurs in practice when ind_vars includes columns like luc (land use code) or bldg_com_struct (commercial structure type), which are legitimate categorical features for LightGBM models but cannot be cast to float for OLS R² computation.

Fix

Add a pd.api.types.is_numeric_dtype guard immediately after the existing ill-posed-model check (the len(data) < 3 or nunique() < 2 block). Non-numeric variables now receive NaN R² / adj-R² / coef_sign and are skipped cleanly via continue — consistent with the function's existing error-handling pattern for ill-posed models.

Before / After

Before:

        if len(data) < 3 or data[var].nunique() < 2:
            ...
            continue  # skip ill-posed models

        X = sm.add_constant(data[var].astype(float), has_constant='add')
        # ^ raises ValueError for string columns

After:

        if len(data) < 3 or data[var].nunique() < 2:
            ...
            continue  # skip ill-posed models

        # Skip non-numeric columns (e.g. string categoricals); .astype(float) would raise ValueError
        if not pd.api.types.is_numeric_dtype(data[var]):
            results["variable"].append(var)
            results["r2"].append(float("nan"))
            results["adj_r2"].append(float("nan"))
            results["coef_sign"].append(float("nan"))
            continue

        X = sm.add_constant(data[var].astype(float), has_constant='add')

Notes

  • pd.api.types.is_numeric_dtype is already available (pandas is a core dependency).
  • The NaN result for non-numeric vars is semantically correct — R² is undefined for categorical predictors in a simple OLS framework. Callers that rank variables by R² will naturally sort these to the bottom.
  • No behavior change for numeric columns.

calc_r2 called data[var].astype(float) on every variable without
checking whether the column is numeric. Any string/categorical column
(e.g. luc, bldg_com_struct) raised:

    ValueError: could not convert string to float: '100A'

Add a pd.api.types.is_numeric_dtype guard immediately after the
existing ill-posed-model check. Non-numeric vars now get NaN R² and
are skipped cleanly, consistent with the rest of the function's
error-handling pattern.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 1, 2026

Thank you for your contribution.
Please sign our CLA at the following link:
Click here to sign the CLA.
A maintainer will verify your signature and confirm it here by commenting with the following sentence:


I affirm that this contributor has signed the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant