When running a Hedonic model with lightgbm with categoricals in my dataset, I get this error:
Traceback (most recent call last):
File "C:\pro_housing_pittsburgh\openavmkit\notebooks\pipeline\03-mo
del.py", line 62, in
results = from_checkpoint("3-model-02-finalize-models", finalize_models,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1156, in from_checkpoint
return openavmkit.checkpoint.from_checkpoint(path, func, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\checkpoint.py
", line 43, in from_checkpoint
result = func(**params)
^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1677, in finalize_models
openavmkit.benchmark.run_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 937, in run_models
mg_results = _run_models(
^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 4418, in _run_models
_run_hedonic_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 3581, in _run_hedonic_models
results = _predict_one_model(
^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 1830, in _predict_one_model
results = predict_lightgbm(ds, lgbm_model, timing, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 3378, in predict_lightgbm
y_pred_test = safe_predict(
^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 1732, in safe_predict
return callable(X, **params)
^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 4771, in predict
return predictor.predict(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 1162, in predict
data = _data_from_pandas(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 855, in _data_from_pandas
raise ValueError("train and valid dataset categorical_feature do not match."
)
ValueError: train and valid dataset categorical_feature do not match.
I put some print statements in the code and found that my categorical variables are being passed to the predict() method as String objects instead of Categoricals. This is being introduced during the data split for hedonic regression. I believe we just need to run this _clean_categoricals() method for lightgbm in addition to catboost (I tried making this change locally and it seems to have worked):
|
df_sales = _clean_categoricals(df_sales, fields_cat, settings) |
I was previously running both lightgbm and catboost in sequence, but I recently removed catboost for performance reasons and started seeing this issue at that point - so this issue might only be there if we aren't also running catboost.
When running a Hedonic model with lightgbm with categoricals in my dataset, I get this error:
Traceback (most recent call last):
File "C:\pro_housing_pittsburgh\openavmkit\notebooks\pipeline\03-mo
del.py", line 62, in
results = from_checkpoint("3-model-02-finalize-models", finalize_models,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1156, in from_checkpoint
return openavmkit.checkpoint.from_checkpoint(path, func, params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\checkpoint.py
", line 43, in from_checkpoint
result = func(**params)
^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\pipeline.py",
line 1677, in finalize_models
openavmkit.benchmark.run_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 937, in run_models
mg_results = _run_models(
^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 4418, in _run_models
_run_hedonic_models(
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 3581, in _run_hedonic_models
results = _predict_one_model(
^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\benchmark.py"
, line 1830, in _predict_one_model
results = predict_lightgbm(ds, lgbm_model, timing, verbose)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 3378, in predict_lightgbm
y_pred_test = safe_predict(
^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\openavmkit\modeling.py",
line 1732, in safe_predict
return callable(X, **params)
^^^^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 4771, in predict
return predictor.predict(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 1162, in predict
data = _data_from_pandas(
^^^^^^^^^^^^^^^^^^
File "C:\pro_housing_pittsburgh\openavmkit\venv\Lib\site-packages\l
ightgbm\basic.py", line 855, in _data_from_pandas
raise ValueError("train and valid dataset categorical_feature do not match."
)
ValueError: train and valid dataset categorical_feature do not match.
I put some print statements in the code and found that my categorical variables are being passed to the predict() method as String objects instead of Categoricals. This is being introduced during the data split for hedonic regression. I believe we just need to run this _clean_categoricals() method for lightgbm in addition to catboost (I tried making this change locally and it seems to have worked):
openavmkit/openavmkit/benchmark.py
Line 1154 in 9c82b27
I was previously running both lightgbm and catboost in sequence, but I recently removed catboost for performance reasons and started seeing this issue at that point - so this issue might only be there if we aren't also running catboost.