-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Template for Preprocessing Pipeline #85
Comments
The idea is very relevant, however I think the
Besides this, the other methods are all reasonable! I will come up with a complete version |
They can also be classes implementing the call method. This works because of duck typing. |
The point is that you have more than one method you need to call in different occasions. |
My suggestion would be to have a Transform base class implementing The pipeline class can concatenate a list of Transform objects into a single pipeline which is the same as concatenating a list of callables. So this would be the interface:
see: |
Could you please elaborate on what you mean by "There is not one single method that can be called"? |
This second proposal is much closer to what I have already built. In particular, a few more comments:
the normalisation has to be fit (i.e. the mean and stddev are to be computed) on the output of thee embeddings!! Before that, when the data is just made of strings, having a normalisation does not really make sense. |
ad 1. I think it would make sense to also have the save/load-functionality for the Pipeline that would internally save/load each Transform in the pipeline. Or is the intention, that there should be a way of just saving a pipeline configuration (ie. creating json- representations of arguments, etc.) to easily recreate it later? ad 2. A Pipeline could also implement the fit/call-functionality to add extra convenience. However, having the same type for both "normal" Transforms and Pipelines leads to ambiguities, which might be misunderstood. In my opinion it would make more sense to keep the different types - Transforms should be used as they are meant to be and a Pipeline can already be implemented as almost the same thing, but with a different interface (a pipeline specific that still supports the fit/call-functionality). Fitting a datset to a whole pipeline makes a lot of sense to me. ad 3. This is very similar to my point "ad 2." - clarity and being precise prevents us from making a lot of mistakes. Therefore I suggest to have fixed private names for both fit and transform functions in a base Transform class, to be called exactly like this so we can avoid any mix-ups and which allows us to write robust code. |
thanks for the answer @raphaelreinauer ! I think pipelines should be transformations because I have in mind the Sklearn paradigm, in which the pipelines also have the same methods of a sklearn transformer: this entails that you can work with either (transforms or pipelines) without changing anything. Than, I would not implement |
This one is solved. See:
|
Description: When dealing with persistence diagrams as input to machine learning models one wants a generic way to deal with data processing.
One such way is creating a Pipeline. This can be done by creating a generic Pipeline implemented with a chain of transformations. With that design, each transformation takes data as input and returns an output.
Additional transformations can be easily added by registering them with a register method. Users can easily add new transformations by simply registering them with the pipeline such that the implementation of does not have to change. Furthermore, the pipeline has to be easily storable as a JSON file such that it can be loaded later for inference. This is crucial since the trained model heavily depends on the transformations performed on the input data.
Template implementation:
Sample usage:
The text was updated successfully, but these errors were encountered: