-
Notifications
You must be signed in to change notification settings - Fork 0
Preprocessing - What do we need? What do we have? #1
Comments
What do we have right now from that list?
|
I wonder how open Rust language team(s) would be to a |
That's a good list! I'm not sure if it necessarily fits in 'preprocessing', but I would add tools for model selection:
If all estimator models implemented the same traits, we could use the same cross-validation framework over arbitrary learners. Included in this could be the idea of a common set of classification / regression metrics for model evaluation -- again, I'm not sure if this is exactly 'preprocessing' but does definitely cut across multiple areas of concern. Another possible thing to add: pipeline management. I haven't used the sklearn pipeline tools personally, but some mechanism to let users easily pipe a set of transformation and an estimator together might be useful. |
For missing data representation, I feel like that should be handled at the DataFrame level (especially since dataframe will likely be at least partially backed by Arrow, which already does this via null bitmask), and imputation handled in the preprocessing library. The representation does get a bit tricky. I implemented a simple masked array in
Related, this NumPy missing-data proposal from 2011 is an interesting read reviewing a lot of the issues when implementing missing data. |
Good list for starting point.
Current ndarray-linalg lacks components to implement PCA. As scikit-learn document says, we need both full-SVD and truncated SVD for implementing various type of PCA, but ndarray-linalg does not have truncated SVD. I think these are linalg and not limited to ML. ndarray-linalg can accept them. |
Implementing the randomized truncated SVD solver would be quite useful. In scikit-learn it's the default solver for PCA and TruncatedSVD and is based on the paper by Halko, et al., 2009. I think in practice that will often be faster for ML applications than a full SVD solver. Another topic is the support of sparse data. TruncatedSVD in scikit-learn is often used on sparse data. In rust such solver could be implemented e.g. on top of the sprs crate. |
I think that type is already in the Rust language - it's
Definitely - defining a
Gotcha - I think it makes sense to spec out exactly what we need for each of those algorithms and then we can start working on implementing the required primitives. |
I managed to have a look at the NumPy document - if my understanding is correct, the |
Hey, I just wanted to add that MFCC/MFSC are common preprocessing steps for machine learning in the context of audio processing. If you want to build an ASR system then this decorrelates your pitch and formant functions and reduces the data complexity. They are also used in room classification, instrument detection, actually anything that has to do with natural sound sources. A crate which does the windowing, transformation etc. would be great! |
I am not very familiar with the problem space @bytesnake - could you provide some references and resources we can have a look at? |
Here are some introductions to MFCCs
I wrote a MFCC library for a class recently, you can find it here https://github.com/bytesnake/mfcc |
Hey there, Super interested in talking about ML in Rust ! |
Context: see rust-ml/discussion#1.
This is meant to be a list of functionality we want to implement (a roadmap?) - I have refrained from including more sophisticated methods, limiting to what I believe to be a set of "core" routines we should absolutely offer.
For each piece of functionality I'd like to document what is already available in the Rust ecosystem.
This is meant to be a WIP list, so feel free to chip in @jblondin and edit/add things I might have missed.
The text was updated successfully, but these errors were encountered: