Generic preprocessing module #79

matteocao · 2022-05-08T07:34:46Z

Is your feature request related to a problem? Please describe.

The pain is that, most often, plain datasets are not in the right input format or do not have the designed statistical caracterisrtics. Furthermore, standard techniques like data augmentation, need to be implemented
Describe the solution you'd like

We build an API class (AbstractClass ) for the preprocessing -- a generic one.

It should look similar to this one:

from abc import ABC, abstractmethod

class AbstractPreprocessing(ABC):
    """The abstract class to define the interface of preprocessing
    """
    @abstractmethod
    def __call__(self, *args, **kwargs):
        """This method deals with datum-wise transformations"""
        pass

    @abstractmethod
    def fit_to_data(self, *args, **kwargs):
        """This method deals with getting dataset-level information"""
        pass

Each of the methods shall be implemented, as it will be called automatically inside the Dataset classes:

the output of __getitem__ will be transformed by item_transform. the data inside item_transform that are needed to perform the transformation, will be stored in self. The methods dataset_level_data and batch_level_data will be called only once, before the first time that __getitem__ is called.
the big advantage of this "on the fly" approach is that it will save a lot of memory -- and hopefully the transformations are not to heavy to compute
In case the transformations are computationally very heavy, then it would be advisable first to transform all the data (not via this class! just create a new dataset) and then use that for the next steps.

Describe alternatives you've considered

Only doing point 3 above (without 1 and 2), however I find it is always possible to only use that approach and it is much easier to implement and is less bind to the generic pipeline

Additional context

The text was updated successfully, but these errors were encountered:

raphaelreinauer · 2022-05-08T09:09:37Z

Thanks, Matteo, for the suggestion.
I think I understand what the dataset_level_data is for. But I'm not sure what batch_level_data is for.
Could you give an example of what would go in batch_level_data?

When you train a model with preprocessed data, the transformations you apply are crucial for the model; hence you want to have a way of storing them and loading them later on. This is especially important when you want to deploy your model in a production environment, where you will not have access to the preprocessing transformations.
This can be quickly done by inheriting from the Huggingface feature extractor mixin class and adding a https://huggingface.co/docs/transformers/main_classes/feature_extractor.
Another advantage of inheriting from that class is that developers are already familiar with it, and they will be able to understand your code more easily. It would be great to be compatible with the Huggingface API since developers can efficiently other models from the library and plug them into your framework.

raphaelreinauer · 2022-05-08T22:59:25Z

The preprocessing transforms in gtda.diagrams.preprocessing don't allow for normalizing the data as well as filtering out the k-most persistent points.
I also looked at the implementation of the filtration by thresholding in gtda.homology._utils
https://github.com/giotto-ai/giotto-tda/blob/8d09a39403ca11b50605bf466c1aa9f4f3876e5f/gtda/diagrams/_utils.py#L80
and it seems like their implementation does not work for extended persistence diagrams and one-hot encoded homology dimensions.
I also don't understand the implementation; it looks much more complicated than it needs to be.

matteocao added the enhancement New feature or request label May 8, 2022

matteocao self-assigned this May 8, 2022

matteocao added this to the Giotto-deep release milestone May 8, 2022

matteocao added the mid priority The issue is to be solved when possible, as this is a relevant addition to the library label May 8, 2022

This was referenced May 10, 2022

Template for Preprocessing Pipeline #85

Closed

refactoring datasets #86

Merged

matteocao linked a pull request May 12, 2022 that will close this issue

refactoring datasets #86

Merged

9 tasks

matteocao added top priority The issue is to be solved as soon as possible, as it may block the usage of the library and removed mid priority The issue is to be solved when possible, as this is a relevant addition to the library labels May 17, 2022

matteocao closed this as completed in #86 May 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic preprocessing module #79

Generic preprocessing module #79

matteocao commented May 8, 2022 •

edited

Loading

raphaelreinauer commented May 8, 2022 •

edited

Loading

raphaelreinauer commented May 8, 2022

Generic preprocessing module #79

Generic preprocessing module #79

Comments

matteocao commented May 8, 2022 • edited Loading

raphaelreinauer commented May 8, 2022 • edited Loading

raphaelreinauer commented May 8, 2022

matteocao commented May 8, 2022 •

edited

Loading

raphaelreinauer commented May 8, 2022 •

edited

Loading