Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a TableVectorizer specialization for HistGradientBoosting #866

Closed
jeromedockes opened this issue Dec 15, 2023 · 2 comments
Closed
Labels
enhancement New feature or request

Comments

@jeromedockes
Copy link
Member

Problem Description

by default the TableVectorizer one-hot encodes low-cardinality categories. In the common case that a scikit-learn HistGradientBoosting{Regressor,Classifier} is used as the downstream estimator, that is suboptimal.
The best thing to do is to rely on the gradient boosting estimator's built-in handling of categories.

In scikit-learn 1.3 we need to transform them with an OrdinalEncoder and let the estimator know which features are categorical with categorical_features=["column_x", "column_y"]

In scikit-learn 1.4 we need to make sure those features have a categorical dtype in the transformer's output but otherwise leave them as they are, and the estimator will recognize and encode them appropriately if it is initialized with categorical_features="from_dtype"

Feature Description

maybe a subclass of TableVectorizer that has different default parameters.
In a first step we can address the scikit-learn >= 1.4 case which is the easiest.
In older scikit-learn versions the user needs to set categorical_features on the gradient boosting estimator, which is not something the tablevectorizer can do

Alternative Solutions

No response

Additional Context

No response

@jeromedockes jeromedockes added the enhancement New feature or request label Dec 15, 2023
@GaelVaroquaux
Copy link
Member

GaelVaroquaux commented Dec 15, 2023 via email

@TheooJ
Copy link
Contributor

TheooJ commented Jun 18, 2024

Implemented in #926

@TheooJ TheooJ closed this as completed Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants