Adding a TableVectorizer specialization for HistGradientBoosting #866

jeromedockes · 2023-12-15T15:05:46Z

Problem Description

by default the TableVectorizer one-hot encodes low-cardinality categories. In the common case that a scikit-learn HistGradientBoosting{Regressor,Classifier} is used as the downstream estimator, that is suboptimal.
The best thing to do is to rely on the gradient boosting estimator's built-in handling of categories.

In scikit-learn 1.3 we need to transform them with an OrdinalEncoder and let the estimator know which features are categorical with categorical_features=["column_x", "column_y"]

In scikit-learn 1.4 we need to make sure those features have a categorical dtype in the transformer's output but otherwise leave them as they are, and the estimator will recognize and encode them appropriately if it is initialized with categorical_features="from_dtype"

Feature Description

maybe a subclass of TableVectorizer that has different default parameters.
In a first step we can address the scikit-learn >= 1.4 case which is the easiest.
In older scikit-learn versions the user needs to set categorical_features on the gradient boosting estimator, which is not something the tablevectorizer can do

Alternative Solutions

No response

Additional Context

No response

The text was updated successfully, but these errors were encountered:

GaelVaroquaux · 2023-12-15T15:54:27Z

Excellent! I would also suggest to use the MinHashEncoder instead of the GapEncoder for the high-cardinality strings. It is faster and leads to better predictions with tree-based models.

TheooJ · 2024-06-18T18:00:09Z

Implemented in #926

jeromedockes added the enhancement New feature or request label Dec 15, 2023

jeromedockes mentioned this issue Jun 7, 2024

FEA Add tabular_learner factory function #926

Merged

1 task

TheooJ closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a TableVectorizer specialization for HistGradientBoosting #866

Adding a TableVectorizer specialization for HistGradientBoosting #866

jeromedockes commented Dec 15, 2023

GaelVaroquaux commented Dec 15, 2023 via email

TheooJ commented Jun 18, 2024

Adding a TableVectorizer specialization for HistGradientBoosting #866

Adding a TableVectorizer specialization for HistGradientBoosting #866

Comments

jeromedockes commented Dec 15, 2023

Problem Description

Feature Description

Alternative Solutions

Additional Context

GaelVaroquaux commented Dec 15, 2023 via email

TheooJ commented Jun 18, 2024