Adds support to specify categorical features in lgbm learner #197

fberanizo · 2022-05-20T00:53:19Z

Status

READY

Todo list

Documentation
Tests added and passed

Background context

Older versions of LigthGBM used to allow passing categorical_featurethrough the argument param, but recent versions raise the following warning:

UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))

It is dubious (for me) if the warning is misleading and the option still works. A contributor at lightgbm said that the correct way to pass them is through the Dataset object.

Description of the changes proposed in the pull request

Adds a new option categorical_features to lgbm_classification_learner. It allows a list of column names that should be treated as categorical features. Further instructions are found in https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

LightGBM can offer a good accuracy when using native categorical features. Not like simply one-hot coding, LightGBM can find the optimal split of categorical features. Such an optimal split can provide the much better accuracy than one-hot coding solution. You can learn about this option in: https://github.com/microsoft/LightGBM/blob/master/docs/Advanced-Topics.rst#categorical-feature-support https://github.com/Microsoft/LightGBM/blob/v3.3.1/docs/Parameters.rst

It is a Union[List[str], str]

codecov-commenter · 2022-05-20T01:15:58Z

Codecov Report

Merging #197 (bf4dc15) into master (aeaa36c) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #197   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files          32       32           
  Lines        1928     1928           
  Branches      258      258           
=======================================
  Hits         1817     1817           
  Misses         76       76           
  Partials       35       35

Impacted Files	Coverage Δ
src/fklearn/training/classification.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeaa36c...bf4dc15. Read the comment docs.

src/fklearn/training/classification.py

tests/training/test_classification.py

Previously, the source was the underlying numpy.array, but in order to allow categorical_feature='auto' we need to pass a DataFrame.

fberanizo · 2022-05-30T17:53:28Z

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting).

Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)

jmoralez · 2022-05-30T19:09:05Z

I'd say this change is for the better because it's the same behavior as using LightGBM directly. However, taking a closer look at the code I see that when predicting the values attribute is used as well:

fklearn/src/fklearn/training/regression.py

Line 487 in a3de09a

col_dict = {prediction_column: bst.predict(new_df[features].values)}

fklearn/src/fklearn/training/classification.py

Line 70 in a3de09a

pred = clf.predict_proba(new_df[features].values)

So it'll definitely cause some headaches if we don't change it there as well. Yet for SHAP the dataframe is used:

fklearn/src/fklearn/training/regression.py

Line 492 in a3de09a

shap_values = list(explainer.shap_values(new_df[features]))

I think it'd be best to use the dataframe everywhere to not cause any surprises on the user, and using values isn't always more efficient than the dataframe. Also the dataframe allows to use the categorical features in their "raw" form, i.e. if we leave the .values there the user will always have to convert them to integer codes.

tiagorm · 2022-07-21T15:08:08Z

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting). Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)

Agree with the above comments, this change itself looks good but is better to review the .values usages

Uses the DataFrame everywhere it's possible.

Also adds a unittest.

fberanizo · 2022-08-30T12:27:36Z

Hi guys!
Sorry for the (very) late response.
Recently, @isphus1973 and @hellenlima reached out to me asking about this feature, and I finally spared some time to finish it.

The recent commits:

Added the same changes to lgbm_regression_learner
Replaced occurrences of df.values with df, as you suggested

isphus1973

Fantastic

fberanizo added 3 commits May 19, 2022 21:32

Updates unittests to add categorical_features option

c3f221f

Fix typing annotation of categorical_features

bf4dc15

It is a Union[List[str], str]

fberanizo force-pushed the add-lgbm-categorical-features-support branch from 63961f4 to bf4dc15 Compare May 20, 2022 01:09

fberanizo marked this pull request as ready for review May 20, 2022 01:16

fberanizo requested a review from a team as a code owner May 20, 2022 01:16

fberanizo changed the title ~~Add lgbm categorical features support~~ Adds support to specify categorical features in lgbm learner May 20, 2022

jmoralez suggested changes May 20, 2022

View reviewed changes

src/fklearn/training/classification.py Outdated Show resolved Hide resolved

tests/training/test_classification.py Outdated Show resolved Hide resolved

fberanizo added 3 commits May 20, 2022 15:14

Changes lgbm.Dataset source to a pandas.DataFrame

4556c80

Previously, the source was the underlying numpy.array, but in order to allow categorical_feature='auto' we need to pass a DataFrame.

Adds a test case to ensure categorical_feature is used by lgbm

020b1d3

Modifies asserts to check all trees

e8a16e3

fberanizo requested a review from jmoralez May 20, 2022 22:36

jmoralez previously approved these changes May 23, 2022

View reviewed changes

jmoralez self-requested a review May 30, 2022 19:16

tiagorm previously approved these changes Jul 21, 2022

View reviewed changes

tiagorm self-requested a review July 21, 2022 14:59

fberanizo added 2 commits August 29, 2022 20:51

Removes occurrences of DataFrame.values (ndarray)

03e84e0

Uses the DataFrame everywhere it's possible.

Applies the changes to lgbm Regressor

5a498d9

Also adds a unittest.

fberanizo dismissed stale reviews from tiagorm and jmoralez via 5a498d9 August 29, 2022 23:57

fberanizo added the review-request Waiting to be reviewed label Aug 30, 2022

isphus1973 approved these changes Aug 30, 2022

View reviewed changes

nicolas-behar mentioned this pull request Sep 2, 2022

Including LGBM parameters to lgbm_classification_learner #211

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds support to specify categorical features in lgbm learner #197

Adds support to specify categorical features in lgbm learner #197

fberanizo commented May 20, 2022 •

edited

Loading

codecov-commenter commented May 20, 2022

fberanizo commented May 30, 2022

jmoralez commented May 30, 2022

tiagorm commented Jul 21, 2022

fberanizo commented Aug 30, 2022 •

edited

Loading

isphus1973 left a comment

Adds support to specify categorical features in lgbm learner #197

Are you sure you want to change the base?

Adds support to specify categorical features in lgbm learner #197

Conversation

fberanizo commented May 20, 2022 • edited Loading

Status

Todo list

Background context

Description of the changes proposed in the pull request

codecov-commenter commented May 20, 2022

Codecov Report

fberanizo commented May 30, 2022

jmoralez commented May 30, 2022

tiagorm commented Jul 21, 2022

fberanizo commented Aug 30, 2022 • edited Loading

isphus1973 left a comment

Choose a reason for hiding this comment

fberanizo commented May 20, 2022 •

edited

Loading

fberanizo commented Aug 30, 2022 •

edited

Loading