You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
by default the TableVectorizer one-hot encodes low-cardinality categories. In the common case that a scikit-learn HistGradientBoosting{Regressor,Classifier} is used as the downstream estimator, that is suboptimal.
The best thing to do is to rely on the gradient boosting estimator's built-in handling of categories.
In scikit-learn 1.3 we need to transform them with an OrdinalEncoder and let the estimator know which features are categorical with categorical_features=["column_x", "column_y"]
In scikit-learn 1.4 we need to make sure those features have a categorical dtype in the transformer's output but otherwise leave them as they are, and the estimator will recognize and encode them appropriately if it is initialized with categorical_features="from_dtype"
Feature Description
maybe a subclass of TableVectorizer that has different default parameters.
In a first step we can address the scikit-learn >= 1.4 case which is the easiest.
In older scikit-learn versions the user needs to set categorical_features on the gradient boosting estimator, which is not something the tablevectorizer can do
Alternative Solutions
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered:
Excellent!
I would also suggest to use the MinHashEncoder instead of the GapEncoder for the high-cardinality strings. It is faster and leads to better predictions with tree-based models.
Problem Description
by default the TableVectorizer one-hot encodes low-cardinality categories. In the common case that a scikit-learn HistGradientBoosting{Regressor,Classifier} is used as the downstream estimator, that is suboptimal.
The best thing to do is to rely on the gradient boosting estimator's built-in handling of categories.
In scikit-learn 1.3 we need to transform them with an OrdinalEncoder and let the estimator know which features are categorical with
categorical_features=["column_x", "column_y"]
In scikit-learn 1.4 we need to make sure those features have a categorical dtype in the transformer's output but otherwise leave them as they are, and the estimator will recognize and encode them appropriately if it is initialized with
categorical_features="from_dtype"
Feature Description
maybe a subclass of TableVectorizer that has different default parameters.
In a first step we can address the scikit-learn >= 1.4 case which is the easiest.
In older scikit-learn versions the user needs to set
categorical_features
on the gradient boosting estimator, which is not something the tablevectorizer can doAlternative Solutions
No response
Additional Context
No response
The text was updated successfully, but these errors were encountered: