Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: selecting variables through a user-supplied function #589

Open
david-cortes opened this issue Jan 13, 2023 · 5 comments
Open

Request: selecting variables through a user-supplied function #589

david-cortes opened this issue Jan 13, 2023 · 5 comments

Comments

@david-cortes
Copy link
Contributor

Transformers in this library take an argument variables which is expected to be a list of column names.

Oftentimes, one has variables that follow some natural grouping, and would want to apply a given transformer to all variables that match some naming pattern. It's relatively easy to do this when there is a single modeling pipeline by creating python variables with their names, but oftentimes one wants to try for example the same transformer pipeline with different groups of features, or slight variations of e.g. earlier transformations, etc. and thus the exact list of variables would vary from one run to another, and the transformers would need to be re-defined.

Would be helpful if the transformers could also accept variables as a function that would be applied to the column names and return True or False as indicators of whether the transformer applies to each variable or not.

@solegalli
Copy link
Collaborator

Hey @david-cortes

This sounds like a very specific case. I am not sure how wide-spread its use would be.

Would you be able to provide an example? I can't really picture the scenario.

Thank you

@david-cortes
Copy link
Contributor Author

A quick example for now: suppose I have a data frame with numeric features that have missing values, and I want to process it as follows:

  • Add a missing indicator for each column.
  • Fill missing values with the mean.
  • Add squared versions of each column.

In this case, the binary missing indicator columns should not get squared, since the output will be the same as the input, and one way would be by having the first transformer name those with a given suffix and then let the last transformer select columns without the suffix.

You might then say that one can simply pass the column names directly to the last transformer, but then suppose that I want to try two different models using different subsets of the features, or that I want to apply them to two datasets sharing similar contents (e.g. data from 1-30 days ago and data from 31-60 days ago, which might have similar but not entirely equal column names).

@kylegilde
Copy link
Contributor

ColumnTransformer and make_column_selector support using callables to select columns.

@david-cortes
Copy link
Contributor Author

ColumnTransformer and make_column_selector support using callables to select columns.

But those transformers from scikit-learn oftentimes force conversions between DataFrames and matrices, which is undesirable for the kind of transformations that feature_engine does.

@ClaudioSalvatoreArcidiacono
Copy link
Contributor

If you want ColumnTransformer to return a Dataframe you can do it using the method set_output

For example:

from sklearn.feature_extraction import FeatureHasher
from sklearn.preprocessing import MinMaxScaler
import pandas as pd   
X = pd.DataFrame({
    "documents": ["First item", "second one here", "Is this the last?"],
    "width": [3, 4, 5],
})  
# "documents" is a string which configures ColumnTransformer to
# pass the documents column as a 1d array to the FeatureHasher
ct = ColumnTransformer(
    [("text_preprocess", FeatureHasher(input_type="string"), "documents"),
     ("num_preprocess", MinMaxScaler(), ["width"])],
      # This parameter ensures that original feature names are kept also in output DataFrame
      verbose_feature_names_out=False
)
# Ensures that a DataFrame is returned by transform
ct.set_output("pandas")

X_trans = ct.fit_transform(X)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants