Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generic preprocessing module #79

Closed
matteocao opened this issue May 8, 2022 · 2 comments · Fixed by #86
Closed

Generic preprocessing module #79

matteocao opened this issue May 8, 2022 · 2 comments · Fixed by #86
Assignees
Labels
enhancement New feature or request top priority The issue is to be solved as soon as possible, as it may block the usage of the library

Comments

@matteocao
Copy link
Contributor

matteocao commented May 8, 2022

Is your feature request related to a problem? Please describe.

The pain is that, most often, plain datasets are not in the right input format or do not have the designed statistical caracterisrtics. Furthermore, standard techniques like data augmentation, need to be implemented
Describe the solution you'd like

We build an API class (AbstractClass ) for the preprocessing -- a generic one.

It should look similar to this one:

from abc import ABC, abstractmethod

class AbstractPreprocessing(ABC):
    """The abstract class to define the interface of preprocessing
    """
    @abstractmethod
    def __call__(self, *args, **kwargs):
        """This method deals with datum-wise transformations"""
        pass

    @abstractmethod
    def fit_to_data(self, *args, **kwargs):
        """This method deals with getting dataset-level information"""
        pass

Each of the methods shall be implemented, as it will be called automatically inside the Dataset classes:

  1. the output of __getitem__ will be transformed by item_transform. the data inside item_transform that are needed to perform the transformation, will be stored in self. The methods dataset_level_data and batch_level_data will be called only once, before the first time that __getitem__ is called.
  2. the big advantage of this "on the fly" approach is that it will save a lot of memory -- and hopefully the transformations are not to heavy to compute
  3. In case the transformations are computationally very heavy, then it would be advisable first to transform all the data (not via this class! just create a new dataset) and then use that for the next steps.

Describe alternatives you've considered

Only doing point 3 above (without 1 and 2), however I find it is always possible to only use that approach and it is much easier to implement and is less bind to the generic pipeline

Additional context

@matteocao matteocao added the enhancement New feature or request label May 8, 2022
@raphaelreinauer
Copy link
Collaborator

raphaelreinauer commented May 8, 2022

Thanks, Matteo, for the suggestion.
I think I understand what the dataset_level_data is for. But I'm not sure what batch_level_data is for.
Could you give an example of what would go in batch_level_data?

When you train a model with preprocessed data, the transformations you apply are crucial for the model; hence you want to have a way of storing them and loading them later on. This is especially important when you want to deploy your model in a production environment, where you will not have access to the preprocessing transformations.
This can be quickly done by inheriting from the Huggingface feature extractor mixin class and adding a https://huggingface.co/docs/transformers/main_classes/feature_extractor.
Another advantage of inheriting from that class is that developers are already familiar with it, and they will be able to understand your code more easily. It would be great to be compatible with the Huggingface API since developers can efficiently other models from the library and plug them into your framework.

@matteocao matteocao self-assigned this May 8, 2022
@matteocao matteocao added this to the Giotto-deep release milestone May 8, 2022
@matteocao matteocao added the mid priority The issue is to be solved when possible, as this is a relevant addition to the library label May 8, 2022
@raphaelreinauer
Copy link
Collaborator

The preprocessing transforms in gtda.diagrams.preprocessing don't allow for normalizing the data as well as filtering out the k-most persistent points.
I also looked at the implementation of the filtration by thresholding in gtda.homology._utils
https://github.com/giotto-ai/giotto-tda/blob/8d09a39403ca11b50605bf466c1aa9f4f3876e5f/gtda/diagrams/_utils.py#L80
and it seems like their implementation does not work for extended persistence diagrams and one-hot encoded homology dimensions.
I also don't understand the implementation; it looks much more complicated than it needs to be.

This was referenced May 10, 2022
@matteocao matteocao linked a pull request May 12, 2022 that will close this issue
9 tasks
@matteocao matteocao added top priority The issue is to be solved as soon as possible, as it may block the usage of the library and removed mid priority The issue is to be solved when possible, as this is a relevant addition to the library labels May 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request top priority The issue is to be solved as soon as possible, as it may block the usage of the library
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants