Request: selecting `variables` through a user-supplied function #589

david-cortes · 2023-01-13T10:10:29Z

Transformers in this library take an argument variables which is expected to be a list of column names.

Oftentimes, one has variables that follow some natural grouping, and would want to apply a given transformer to all variables that match some naming pattern. It's relatively easy to do this when there is a single modeling pipeline by creating python variables with their names, but oftentimes one wants to try for example the same transformer pipeline with different groups of features, or slight variations of e.g. earlier transformations, etc. and thus the exact list of variables would vary from one run to another, and the transformers would need to be re-defined.

Would be helpful if the transformers could also accept variables as a function that would be applied to the column names and return True or False as indicators of whether the transformer applies to each variable or not.

The text was updated successfully, but these errors were encountered:

solegalli · 2023-01-13T15:35:41Z

Hey @david-cortes

This sounds like a very specific case. I am not sure how wide-spread its use would be.

Would you be able to provide an example? I can't really picture the scenario.

Thank you

david-cortes · 2023-01-13T16:23:13Z

A quick example for now: suppose I have a data frame with numeric features that have missing values, and I want to process it as follows:

Add a missing indicator for each column.
Fill missing values with the mean.
Add squared versions of each column.

In this case, the binary missing indicator columns should not get squared, since the output will be the same as the input, and one way would be by having the first transformer name those with a given suffix and then let the last transformer select columns without the suffix.

You might then say that one can simply pass the column names directly to the last transformer, but then suppose that I want to try two different models using different subsets of the features, or that I want to apply them to two datasets sharing similar contents (e.g. data from 1-30 days ago and data from 31-60 days ago, which might have similar but not entirely equal column names).

kylegilde · 2023-03-19T16:16:55Z

ColumnTransformer and make_column_selector support using callables to select columns.

david-cortes · 2023-03-19T20:10:23Z

ColumnTransformer and make_column_selector support using callables to select columns.

But those transformers from scikit-learn oftentimes force conversions between DataFrames and matrices, which is undesirable for the kind of transformations that feature_engine does.

ClaudioSalvatoreArcidiacono · 2023-06-06T11:08:37Z

If you want ColumnTransformer to return a Dataframe you can do it using the method set_output

For example:

from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import MinMaxScaler
import pandas as pd   
X = pd.DataFrame({
    "documents": ["First item", "second one here", "Is this the last?"],
    "width": [3, 4, 5],
})  
# "documents" is a string which configures ColumnTransformer to
# pass the documents column as a 1d array to the FeatureHasher
ct = ColumnTransformer(
    [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
     ("num_preprocess", MinMaxScaler(), ["width"])],
      # This parameter ensures that original feature names are kept also in output DataFrame
      verbose_feature_names_out=False
)
# Ensures that a DataFrame is returned by transform
ct.set_output("pandas")

X_trans = ct.fit_transform(X)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Request: selecting `variables` through a user-supplied function #589

Request: selecting `variables` through a user-supplied function #589

david-cortes commented Jan 13, 2023

solegalli commented Jan 13, 2023

david-cortes commented Jan 13, 2023

kylegilde commented Mar 19, 2023

david-cortes commented Mar 19, 2023

ClaudioSalvatoreArcidiacono commented Jun 6, 2023

Request: selecting variables through a user-supplied function #589

Request: selecting variables through a user-supplied function #589

Comments

david-cortes commented Jan 13, 2023

solegalli commented Jan 13, 2023

david-cortes commented Jan 13, 2023

kylegilde commented Mar 19, 2023

david-cortes commented Mar 19, 2023

ClaudioSalvatoreArcidiacono commented Jun 6, 2023

Request: selecting `variables` through a user-supplied function #589

Request: selecting `variables` through a user-supplied function #589