Skip to content

Issue with Hedonic model when running lightgbm with categorical variables #282

@connorschwartz

Description

@connorschwartz

When running a Hedonic model with lightgbm with categoricals in my dataset, I get this error:

Traceback (most recent call last):
File "C:\pro_housing_pittsburgh\openavmkit\notebooks\pipeline\03-mo
del.py", line 62, in
results = from_checkpoint("3-model-02-finalize-models", finalize_models,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1156, in from_checkpoint
return openavmkit.checkpoint.from_checkpoint(path, func, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\checkpoint.py
", line 43, in from_checkpoint
result = func(**params)
^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1677, in finalize_models
openavmkit.benchmark.run_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 937, in run_models
mg_results = _run_models(
^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 4418, in _run_models
_run_hedonic_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 3581, in _run_hedonic_models
results = _predict_one_model(
^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 1830, in _predict_one_model
results = predict_lightgbm(ds, lgbm_model, timing, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 3378, in predict_lightgbm
y_pred_test = safe_predict(
^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 1732, in safe_predict
return callable(X, **params)
^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 4771, in predict
return predictor.predict(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 1162, in predict
data = _data_from_pandas(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 855, in _data_from_pandas
raise ValueError("train and valid dataset categorical_feature do not match."
)
ValueError: train and valid dataset categorical_feature do not match.

I put some print statements in the code and found that my categorical variables are being passed to the predict() method as String objects instead of Categoricals. This is being introduced during the data split for hedonic regression. I believe we just need to run this _clean_categoricals() method for lightgbm in addition to catboost (I tried making this change locally and it seems to have worked):

df_sales = _clean_categoricals(df_sales, fields_cat, settings)

I was previously running both lightgbm and catboost in sequence, but I recently removed catboost for performance reasons and started seeing this issue at that point - so this issue might only be there if we aren't also running catboost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions