Hacky Idea: ifelse in a pipeline #908

koaning · 2024-05-25T13:29:40Z

Problem Description

I am running benchmarks on many datasets. When the dataset contains a column called "date" then I am interested in running a different pipeline.

At the moment I fixed this by doing this:

class ConditionalDateFeaturizer(BaseEstimator, TransformerMixin):
    def __init__(self) -> None:
        self.spline_tfm = SplineTransformer(n_knots=12, extrapolation="periodic")
    
    def fit(self, X, y):
        if 'date' in X.columns:
            self.pipeline_ = make_union(
                make_pipeline(
                    SelectCols('date'),
                    FunctionTransformer(datetime_feats),
                    SplineTransformer(n_knots=12, extrapolation="periodic")
                ),
                TableVectorizer()
            )
        else:
            self.pipeline_ = TableVectorizer()
        return self.pipeline_.fit(X, y)
    
    def transform(self, X, y=None):
        return self.pipeline_.transform(X)

I wonder, could skrub maybe offer a nicer way to do stuff like this?

Feature Description

I don't know if we want this, but for large scale model search across multiple datasets you might want this. I also don't know if this is easy to generalise but I figured at least mentioning it in an issue here.

Alternative Solutions

The custom estimator also works, but it can get hacky quite quick once I want to repeat this pattern for other types of column features.

Additional Context

No response

jeromedockes · 2024-05-27T13:31:23Z

With the upcoming "Recipe" (or "PipeBuilder" or whatever its name will be), it
will be easy to apply a transformation to only some columns.
For example you would be able to do something like this:

>>> import pandas as pd
>>> import numpy as np
>>> from sklearn.base import BaseEstimator

>>> from skrub._pipe_builder import PipeBuilder
>>> from skrub import selectors as s
>>> from skrub import TableVectorizer

>>> class DatetimeSplines(BaseEstimator):
...     "dummy placeholder"
...     def fit_transform(self, X, y=None):
...         return self.transform(X)
... 
...     def transform(self, X):
...         print(f"\ntransform: {X.columns.tolist()}\n")
...         values = np.ones(X.shape[0])
...         return pd.DataFrame({"spline_0": values, "spline_1": values})

>>> pipe = (
...     PipeBuilder()
...     .apply(DatetimeSplines(), cols=s.all() & "date")
...     .apply(TableVectorizer())
... ).get_pipeline()


>>> df = pd.DataFrame({
...     "date": ["2020-01-02", "2021-04-03"],
...     "temp": [10.1, 17.5]
... })

The column "date" gets transformed by the spline transformer:

>>> pipe.fit_transform(df)

transform: ['date']

   temp  spline_0  spline_1
0  10.1       1.0       1.0
1  17.5       1.0       1.0

When there is no column matching the selector, the spline transformer is not applied:

>>> df = pd.DataFrame({
...     "not_date": ["2020-01-02", "2021-04-03"],
...     "temp": [10.1, 17.5]
... })

>>> pipe.fit_transform(df)
   not_date_year  not_date_month  not_date_day  not_date_total_seconds  temp
0         2020.0             1.0           2.0            1577923200.0  10.1
1         2021.0             4.0           3.0            1617408000.0  17.5

Does that more or less address the problem you are facing?

jeromedockes · 2024-05-27T13:32:37Z

Having a conditional transformer might be useful when something more general than selecting columns is needed though, such as "apply a PCA if there are more than 200 columns"

jeromedockes · 2024-05-27T13:43:50Z

However, if the important part is not really the name "date" but rather applying
the spline transformer to datetime columns only, you might already be able to
use the TableVectorizer's datetime_transformer parameter? By passing your
transformer instead of the default DatetimeEncoder.

(note the snippet below does not run on the main branch but it does on that of PR #902)

import pandas as pd
import numpy as np
from sklearn.base import BaseEstimator

from skrub import TableVectorizer

class DatetimeSplines(BaseEstimator):
    "dummy placeholder"
    def fit_transform(self, X, y=None):
        return self.transform(X)

    def transform(self, X):
        print(f"\ntransform: {X.columns.tolist()}\n")
        values = np.ones(X.shape[0])
        return pd.DataFrame({"spline_0": values, "spline_1": values})

>>> vectorizer = TableVectorizer(datetime_transformer=DatetimeSplines())

>>> df = pd.DataFrame({
...     "date": ["2020-01-02", "2021-04-03"],
...     "temp": [10.1, 17.5]
... })


>>> vectorizer.fit_transform(df)

transform: ['date']

   spline_0  spline_1  temp
0       1.0       1.0  10.1
1       1.0       1.0  17.5


>>> df = pd.DataFrame({
...     "not_date": ["blue", "red"],
...     "temp": [10.1, 17.5]
... })

>>> vectorizer.fit_transform(df)
   not_date_red  temp
0           0.0  10.1
1           1.0  17.5

koaning · 2024-05-27T13:45:36Z

Does that more or less address the problem you are facing?

I think it does, just one thing. How would the DateTimeSplines featurizer know which columns to select/ignore. Does the date column name need to be passed into the estimator? There may also be more than one date column in the dataframe.

koaning · 2024-05-27T13:48:23Z

However, if the important part is not really the name "date" but rather applying
the spline transformer to datetime columns only

Do we want to assume that the user ran their dataframe code or do we want our library to infer that on their behalf? I am partially asking because polars/pandas handle the date stuff slightly differently. But I am also wondering about categorical types. Do we only one-hot encode columns that are categorical?

jeromedockes · 2024-05-27T13:52:21Z

for selecting all datetime columns you could use the skrub.selectors.any_date() selector -- I just need to update the PipeBuilder branch with the current state of PR #902 and I'll show a snippet

jeromedockes · 2024-05-27T13:54:59Z

Do we want to assume that the user ran their dataframe code or do we want our library to infer that on their behalf? I am partially asking because polars/pandas handle the date stuff slightly differently. But I am also wondering about categorical types. Do we only one-hot encode columns that are categorical?

I think we will have the TableVectorizer which tries to guess on your behalf, and the PipeBuilder which allows to build your own pipeline with more control over the different choices.

The TableVectorizer will one-hot encode anything that is strings or Categorical with a low cardinality. It will also try to parse strings as datetimes and apply the datetime_encoder if it succeeds

jeromedockes · 2024-05-27T14:09:23Z

If you wanted to manually control your pipeline you could do something like:

import pandas as pd
import numpy as np

from skrub import ToDatetime
from skrub import selectors as s
from skrub._pipe_builder import PipeBuilder
from skrub._on_each_column import SingleColumnTransformer

class DatetimeSplines(SingleColumnTransformer):
    "dummy placeholder"
    def fit_transform(self, col, y=None):
        return self.transform(col)

    def transform(self, col):
        name = col.name
        print(f" ==> transform: {name}")
        values = np.ones(len(col))
        return pd.DataFrame({f"{name}_spline_0": values, f"{name}_spline_1": values})


pipe = (
    PipeBuilder()
    .apply(ToDatetime(), allow_reject=True)
    .apply(DatetimeSplines(), cols=s.any_date())
).get_pipeline()

>>> df = pd.DataFrame({
...     "A": ["2020-01-02", "2021-04-03"],
...     "B": [10.1, 17.5],
...     "C": ["2020-01-02T00:01:02", "2021-04-03T10:11:12"],
...     "D": ["red", "blue"],
... })
>>> df
            A     B                    C     D
0  2020-01-02  10.1  2020-01-02T00:01:02   red
1  2021-04-03  17.5  2021-04-03T10:11:12  blue

>>> pipe.fit_transform(df)
 ==> transform: A
 ==> transform: C
   A_spline_0  A_spline_1     B  C_spline_0  C_spline_1     D
0         1.0         1.0  10.1         1.0         1.0   red
1         1.0         1.0  17.5         1.0         1.0  blue

jeromedockes · 2024-05-27T14:13:37Z

allow_reject means "let the ToDatetime transformer decide if it should be applied to the column or not (and reject those that don't look like dates). (by default it is false)

jeromedockes · 2024-05-27T14:23:14Z

But if you want something completely automatic, eg that you are running on many datasets that you don't inspect manually, then you're probably better off using the TableVectorizer and let it do the preprocessing and those choices for you.
it will apply all those processing steps:

check input dataframe
- fit_transform:
  - convert arrays to dataframes
  - ensure column names are strings
  - ensure column names are unique
  - check dataframe is not a pandas sparse dataframe
  - ensure dataframe is not lazy
- transform:
  - same checks as fit_transform
  - check dataframe library is the same as in fit
  - check column names are the same as in fit
clean null strings
- replace "N/A", "" etc with actual nulls
to datetime
- try to parse strings as datetimes
- ensure consistent output dtype (resolution + timezone awareness + timezone)
to float
- try to convert anything but dates and categorical to float32
- ensure consistent output dtype
clean categories (pandas)
- ensure categories are strings stored with object dtype
- ensure categorical columns don't contain pd.NA
- ensure consistent output dtype
convert all remaining columns to string
convert pandas StringDtype to object & remove pd.NA
apply the user-defined transformers
- low_cardinality_transformer (low-cardinality strings and categorical): by default one-hot encode, but you could use eg ToCategorical to take advantage of the HistGradientBoostingRegressor's categorical_features='from_dtype' option
- high_cardinality_transformer (high-cardinality strings and categorical): by default GapEncoder, MinHashEncoder can be a good choice
- datetime_encoder (Dates & Datetimes -- including those that have been parsed from strings during preprocessing): by default DatetimeEncoder, you could replace it by the custom encoder with splines
- numeric_encoder (numbers): by default, passthrough
try to convert all outputs to float32

koaning added the enhancement New feature or request label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hacky Idea: ifelse in a pipeline #908

Hacky Idea: ifelse in a pipeline #908

koaning commented May 25, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

koaning commented May 27, 2024

koaning commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024 •

edited

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

Hacky Idea: ifelse in a pipeline #908

Hacky Idea: ifelse in a pipeline #908

Comments

koaning commented May 25, 2024

Problem Description

Feature Description

Alternative Solutions

Additional Context

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

koaning commented May 27, 2024

koaning commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024 • edited

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024

jeromedockes commented May 27, 2024 •

edited