Ways to do RFE with skorch? #951

asmyoo · 2023-04-04T22:37:34Z

asmyoo
Apr 4, 2023

Hello,

I have a relatively small dataset and was wondering if there was a way to do backward sequential feature selection using skorch?

Sorry I know this is a fairly niche question and probably outside the scope of this package. I was just hopeful that maybe someone has done this and had a suggested package that works well with skorch for this. I am also definitely open to other ways of doing feature selection that work well with skorch

I'm used to using imblearn when using sklearn classifiers since it can select the number of features with the "parsimonious" setting/get detailed info using the get_metric_dict() function, but imblearn requires 2D data. I also tried sklearn's SequentialFeatureSelector which has fewer settings, but it also seems to require 2D data

BenjaminBossan · 2023-04-05T08:46:06Z

BenjaminBossan
Apr 5, 2023
Maintainer

(I converted the issue into a Discussion, I hope that's okay)

To better understand your question, the issue is not that skorch would not generally work with imblearn or SequentialFeatureSelector. The issue is that the data you work with is not a 2d array, which is expected by imblearn or sklearn, right?

I'm not aware of any existing solutions. If I had this problem, here are some things I would try (in order of how practical I think they are):

Turn the data into 2d data. Obviously, this doesn't really work with all types of data. E.g. if you have images, what would it mean to select a feature, a pixel? If this is an option for you depends on your data.
Wrap the data to "fool" imblearn/sklearn into thinking it's valid, e.g. by using your own Dataset class (skorch accepts Datasets as X input). This requires you to understand where the 2d assumption is made and how to sidestep it. Just as an example, the SliceDict in skorch makes it possible to use GridSearchCV with X being a dict, even though sklearn would normally accept it.
Write your own feature selection code based on the existing one. This is the most difficult, as it requires reading the code of the classes/functions you want to use, copy it, and make changes so that it works with your data.
If you think your case is common enough and/or could be easily fixed in those packages, open an issue there and ask them to make the adjustment.

I'm used to using imblearn when using sklearn classifiers since it can select the number of features with the "parsimonious" setting/get detailed info using the get_metric_dict() function

I couldn't find anything on this, could you provide a link? I thought imblearn was supposed to help with imbalanced classes, not with feature selection.

0 replies

asmyoo · 2023-04-05T22:59:18Z

asmyoo
Apr 5, 2023
Author

Hi,

Thank you so much for the thoughtful response! I'll try to see if I can fool sklearn using the Dataset class. I am using time series data so not sure how feasible this would be, but I haven't looked into it before so hopefully this will solve my problem!

So sorry I totally mixed up the package name I was using - I meant mlxtend: https://rasbt.github.io/mlxtend/api_subpackages/mlxtend.feature_selection/#sequentialfeatureselector

Thanks again for your help I appreciate it!

2 replies

BenjaminBossan Apr 6, 2023
Maintainer

So sorry I totally mixed up the package name I was using - I meant mlxtend

Got it, thx.

I'll try to see if I can fool sklearn using the Dataset class.

The problems you'll run into are probably twofold. On the one hand, sklearn (and probably mlxtend to a lesser degree) does a few validations on the input data to ensure it's 2d which is probably where your initial error comes from. Your custom data class would need to pretend it's 2d, e.g. by having an .ndim attribute that returns 2. Interactive debugging until your array passes the check_array check, or whatever causes the error, is your friend here.

On the other hand, you'll have to implement a way to select subsets of features, i.e. something like X_new = X[:, features]. How to implement this depends on your data.

I am using time series data so not sure how feasible this would be, but I haven't looked into it before so hopefully this will solve my problem!

Time series data might not lend itself well to this problem. E.g. I think it makes little sense to say, let's drop t-6 and t-41 because they contribute little to performance. But that might not be what you mean, it really depends on the use case.

asmyoo Apr 6, 2023
Author

Thank you so much, that makes a lot of sense. I appreciate it!

AvedikEkiz · 2023-07-26T06:36:20Z

AvedikEkiz
Jul 26, 2023

Hi! Had you any success with it? I'm currently working with EEG data which dimensinality is (number channels x number of time points) and i'm trying to implement wrapper feature (channel) selection method based on ATCNet network which takes input data of dimensionality described above. As far as I could do I adapted some code from internet to that purpose

from itertools import combinations
class SequentialBackwardSearch():
    '''
    Instantiate with Estimator and given number of features
    '''
    def __init__(self, k_features):
        #self.estimator = estimator
        self.k_features = k_features
         
    def fit(self, X, y):
        dim = X_train.shape[1]
        
        self.module = ATCNet(n_channels = dim,
                                     n_classes = 3,
                                     input_size_s = 3.0,
                                     sfreq = 500, tcn_activation=nn.ELU(),n_windows = 5)
        
        self.wraped = NeuralNetClassifier(model,
                                          max_epochs=10,
                                          criterion=nn.CrossEntropyLoss(),
                                          lr=0.02,
                                          iterator_train__shuffle=True,
                                          callbacks=[EpochScoring(scoring='accuracy', on_train=True, name='train_acc')],
                                          device = 'cuda')
        
        self.indices_ = tuple(range(dim))
        self.subsets_ = [self.indices_]
        score = self._calc_score(X, y, self.indices_,estimator = self.wraped)
        self.scores_ = [score]
        del.self.module
        del.self.wraped
        '''
        Iterate through all the dimensions until k_features is reached
        At the end of loop, dimension count is reduced by 1
        '''
        while dim > self.k_features:
            scores = []
            subsets = []

            self.module = ATCNet(n_channels = dim,
                                     n_classes = 3,
                                     input_size_s = 3.0,
                                     sfreq = 500, tcn_activation=nn.ELU(),n_windows = 5)
        
            self.wraped = NeuralNetClassifier(model,
                                          max_epochs=10,
                                          criterion=nn.CrossEntropyLoss(),
                                          lr=0.02,
                                          iterator_train__shuffle=True,
                                          callbacks=[EpochScoring(scoring='accuracy', on_train=True, name='train_acc')],
                                          device = 'cuda')
            '''
            Iterate through different combinations of features, train the model,
            record the score
            '''
            for p in combinations(self.indices_, r=dim - 1):
                score = self._calc_score(X, y, p, self.wraped)
                scores.append(score)
                subsets.append(p)
            #
            # Get the index of best score
            #
            best_score_index = np.argmax(scores)
            #
            # Record the best score
            #
            self.scores_.append(scores[best_score_index])
            #
            # Get the indices of features which gave best score
            #
            self.indices_ = subsets[best_score_index]
            #
            # Record the indices of features for best score
            #
            self.subsets_.append(self.indices_)
            dim -= 1 # Dimension is reduced by 1
            print(self.scores_)
     
    '''
    Transform training, test data set to the data set
    havng features which gave best score
    '''
    def transform(self, X):
        return X[:, self.indices_]
     
    '''
    Train models with specific set of features
    indices - indices of features
    '''
    def _calc_score(self, X, y, indices, estimator):

        self.scores__ = []
        self.cv = ShuffleSplit(2, test_size=0.3, random_state=42)
        self.scores__ = cross_val_score(estimator, X, y, cv=self.cv)
        score = self.scores__.mean()
        return score

but the problem is that model doenst reinitilize (doenst reset weights), maybe anyone got suggetions about it?

3 replies

BenjaminBossan Jul 26, 2023
Maintainer

I haven't checked your code in detail, but have a question: Where does model come from. Is it an initialized nn.Module?

AvedikEkiz Jul 26, 2023

Yes I found that mistake, I must fill first argument with self.module, but as I understood there is no way to declare model outside the function and cleare it inside?

BenjaminBossan Jul 26, 2023
Maintainer

but as I understood there is no way to declare model outside the function and cleare it inside?

Not really, because skorch does not know how to clear the parameters, as there is no standard way of doing that. This is why the preferred method is to pass an uninitialized nn.Module like so:

class MyModule(nn.Module):
    ...

net = NeuralNetClassifier(MyModule, ...)

If you have some kind of pretrained model, you can load it inside of __init__ of the MyModule class (here is a more complete example). Alternatively, you can also pass a function that returns that model:

def load_model():
    # model loading code here, returns an initialized nn.Module

net = NeuralNetClassifier(load_model, ...)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ways to do RFE with skorch? #951

{{title}}

Replies: 3 comments 5 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Ways to do RFE with skorch? #951

asmyoo Apr 4, 2023

Replies: 3 comments · 5 replies

BenjaminBossan Apr 5, 2023 Maintainer

asmyoo Apr 5, 2023 Author

BenjaminBossan Apr 6, 2023 Maintainer

asmyoo Apr 6, 2023 Author

AvedikEkiz Jul 26, 2023

BenjaminBossan Jul 26, 2023 Maintainer

AvedikEkiz Jul 26, 2023

BenjaminBossan Jul 26, 2023 Maintainer

asmyoo
Apr 4, 2023

Replies: 3 comments 5 replies

BenjaminBossan
Apr 5, 2023
Maintainer

asmyoo
Apr 5, 2023
Author

BenjaminBossan Apr 6, 2023
Maintainer

asmyoo Apr 6, 2023
Author

AvedikEkiz
Jul 26, 2023

BenjaminBossan Jul 26, 2023
Maintainer

BenjaminBossan Jul 26, 2023
Maintainer