Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds support to specify categorical features in lgbm learner #197

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

fberanizo
Copy link
Member

@fberanizo fberanizo commented May 20, 2022

Status

READY

Todo list

  • Documentation
  • Tests added and passed

Background context

Older versions of LigthGBM used to allow passing categorical_featurethrough the argument param, but recent versions raise the following warning:

UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
Please use categorical_feature argument of the Dataset constructor to pass this parameter.
  .format(key))

It is dubious (for me) if the warning is misleading and the option still works. A contributor at lightgbm said that the correct way to pass them is through the Dataset object.

Description of the changes proposed in the pull request

Adds a new option categorical_features to lgbm_classification_learner. It allows a list of column names that should be treated as categorical features. Further instructions are found in https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst

LightGBM can offer a good accuracy when using native categorical features.
Not like simply one-hot coding, LightGBM can find the optimal split of categorical features.
Such an optimal split can provide the much better accuracy than one-hot coding solution.

You can learn about this option in:
https://github.com/microsoft/LightGBM/blob/master/docs/Advanced-Topics.rst#categorical-feature-support
https://github.com/Microsoft/LightGBM/blob/v3.3.1/docs/Parameters.rst
It is a Union[List[str], str]
@fberanizo fberanizo force-pushed the add-lgbm-categorical-features-support branch from 63961f4 to bf4dc15 Compare May 20, 2022 01:09
@codecov-commenter
Copy link

Codecov Report

Merging #197 (bf4dc15) into master (aeaa36c) will not change coverage.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #197   +/-   ##
=======================================
  Coverage   94.24%   94.24%           
=======================================
  Files          32       32           
  Lines        1928     1928           
  Branches      258      258           
=======================================
  Hits         1817     1817           
  Misses         76       76           
  Partials       35       35           
Impacted Files Coverage Δ
src/fklearn/training/classification.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aeaa36c...bf4dc15. Read the comment docs.

@fberanizo fberanizo marked this pull request as ready for review May 20, 2022 01:16
@fberanizo fberanizo requested a review from a team as a code owner May 20, 2022 01:16
@fberanizo fberanizo changed the title Add lgbm categorical features support Adds support to specify categorical features in lgbm learner May 20, 2022
src/fklearn/training/classification.py Outdated Show resolved Hide resolved
tests/training/test_classification.py Outdated Show resolved Hide resolved
Previously, the source was the underlying numpy.array, but in order
to allow categorical_feature='auto' we need to pass a DataFrame.
@fberanizo fberanizo requested a review from jmoralez May 20, 2022 22:36
jmoralez
jmoralez previously approved these changes May 23, 2022
@fberanizo
Copy link
Member Author

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting).

Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)

@jmoralez
Copy link

I'd say this change is for the better because it's the same behavior as using LightGBM directly. However, taking a closer look at the code I see that when predicting the values attribute is used as well:

col_dict = {prediction_column: bst.predict(new_df[features].values)}

pred = clf.predict_proba(new_df[features].values)

So it'll definitely cause some headaches if we don't change it there as well. Yet for SHAP the dataframe is used:
shap_values = list(explainer.shap_values(new_df[features]))

I think it'd be best to use the dataframe everywhere to not cause any surprises on the user, and using values isn't always more efficient than the dataframe. Also the dataframe allows to use the categorical features in their "raw" form, i.e. if we leave the .values there the user will always have to convert them to integer codes.

@jmoralez jmoralez self-requested a review May 30, 2022 19:16
tiagorm
tiagorm previously approved these changes Jul 21, 2022
@tiagorm tiagorm self-requested a review July 21, 2022 14:59
@tiagorm
Copy link

tiagorm commented Jul 21, 2022

I'm worried that the change below might alter the behavior of existing models (due to the use of pandas + categorical_feature='auto' setting). Is this still a reasonable change?

-  dtrain = lgbm.Dataset(df[features].values, label=df[target], feature_name=list(map(str, features)), weight=weights,
-                          silent=True)
+    dtrain = lgbm.Dataset(df[features], label=df[target], feature_name=list(map(str, features)), weight=weights,
+                          silent=True, categorical_feature=categorical_features)

Agree with the above comments, this change itself looks good but is better to review the .values usages

Uses the DataFrame everywhere it's possible.
@fberanizo fberanizo dismissed stale reviews from tiagorm and jmoralez via 5a498d9 August 29, 2022 23:57
@fberanizo fberanizo added the review-request Waiting to be reviewed label Aug 30, 2022
@fberanizo
Copy link
Member Author

fberanizo commented Aug 30, 2022

Hi guys!
Sorry for the (very) late response.
Recently, @isphus1973 and @hellenlima reached out to me asking about this feature, and I finally spared some time to finish it.

The recent commits:

  • Added the same changes to lgbm_regression_learner
  • Replaced occurrences of df.values with df, as you suggested

Copy link

@isphus1973 isphus1973 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
review-request Waiting to be reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants