Integration & synergy with vaex-ml #515

JovanVeljanoski · 2021-03-12T16:33:57Z

JovanVeljanoski
Mar 12, 2021

Great job on this library! Looks great and performs great on the (few) tests I've done. Great work!

I recently noticed #399. Thank you for the kind words. Last night I finally had a chance to try how river would integrate with vaex. I've not spent any time looking at the source, fully on the user side for now. I noticed that some models now support both online and mini-batch learning that got me really exited.

So I did my "benchmark" test, which is substituting the SGDRegressor of scikit-learn with the LinearModel of river in this notebook.

Using the latest version of vaex (4.0.0) - the scikit-learn version of the notebook takes 3.6 minutes to train a model on 1 billion rows/samples, 15 features.

The river version takes 2.9 minutes to train the linear model (same data, same hardware, same environment). So I would say this is substantial improvement in performance! Again Great job! I've not tested the inference time yet.

In my mind, vaex as a (very large data) dataframe and river for streaming/online ML libraries go very well together: vaex dataframes inplement df.to_pandas_df(chunk_size=10_000) which returns a generator yielding pandas dataframes where the length is specified by the chunk_size argument. This method also does pre-fetching so the next batch of data is ready by the time it is needed by river. And from river side, well, you get a "stream" of pandas dataframes to learn on!

With all this in mind, I would very much like to (time permitting) write a small "plug-in" on the vaex side which would enable users to use the river models in vaex-ml (so taking advantage of the vaex graph/pipeline/expression system). But before that I wanted to share this with you and get your thoughts and hopefully blessings. I am happy to contribute examples when this is done for both sides.

Also (time permitting) if you are interested in some kind of vaex dataframe connector, so users can use a vaex dataframe as a source of river pipeline, I'd be happy to take a look at that (something similar to what I did in creme-ml some time ago).

Cheers,
J.

MaxHalford · 2021-03-13T01:28:51Z

MaxHalford
Mar 13, 2021
Maintainer

Hey @JovanVeljanoski!

This is cool in so many ways. I've been planning to do this exploration for so-long but never got around it. Thanks for doing this experiment!

It's clear that there's a fruitful relation to build between both our projects. It's cool because as you say there's not much work to do, as you produce pandas dataframes and we consume them. It's almost too good to be true!

I am happy to contribute examples when this is done for both sides.

Clearly this is what need to do with regards to users. We need an example notebook that we could duplicate in both our documentation websites.

Also (time permitting) if you are interested in some kind of vaex dataframe connector, so users can use a vaex dataframe as a source of river pipeline, I'd be happy to take a look at that (something similar to what I did in creme-ml some time ago).

What exactly does vaex's IncrementalPredictor do? Is it "just" a for loop over pandas batches? If so then our philosophy is to not hide this for loop behind a class. We prefer to let users do the for loop themselves as it provides more flexibility. In this case, we could just have to improve the stream.iter_vaex function so users can do:

for x_batch, y_batch in stream.iter_vaex(vaex_df, batch_size=10_000):
    model.learn_many(x_batch, y_batch)

Apart from this I have a couple of candid questions for you:

Do you support pandas sparse dataframes? @raphaelsty recently implemented mini-batch versions of Naive Bayes models. They accept full as well as sparse dataframes.
If we are to build a tighter relationship, I feel that there is some overlap between your ml package and our stuff. What you think about us taking this off your shoulders? We could implement and maintain mini-batch versions of the stuff you have so it's in a single place. For instance I think users would find it confusing that both River and vaex.ml have a StandardScaler class. What's your take on this?

This is exciting!

1 reply

JovanVeljanoski Mar 13, 2021
Author

Hi! Thanks for the reply.

Regarding examples.. i struggle to find interesting (or any) datasets that are big enough and publicly available without too many strings attached.. if you think of something (with at least a few hundreds million samples) please let me know!

So the IncrementalPredictor does two main things:

the training loop as you mentioned with some settings for convenience;
creates a lazy function out of the models' predict-like method which essentially adds the predictions as another column in the dataframe and to the internal computational graph vaex is building.
serializes the model so it can be exported as part of the vaex state file (think of it as a model artifact)

About the training loop - our opinions might differ here a bit. I absolutely value the freedom of writing my own training loop (like river allows for example), as it allows me to do.. whatever I want. However, in most practical scenarios, as a user, I strongly value writing as little code as possible and getting a decent chunk of the functionality out of the box. I want to be able to write a solution that is decent, with the expected or most important options served for me ready to use, and then improve or customize if required.

I do not see why these two approaches need to be at odds with one another however. I guess maintaining a decent set of wrappers is extra burden, but as long as the libraries do not hinder users from doing their own loops, it is all good.

As for the duplication of efforts in the vaex-ml packge vs river (vs various other libraries that possibly exist out there..) -
this is definitively interesting and worth exploring. Actually we tried this very think with scikit-learn, maybe you will find this PR interesting scikit-learn/scikit-learn#14963. If something similar can be done with river, and it we can agree on an api that works for both worlds, this would be interesting thing to consider. We would need to take into account the graph / state that vaex builds as part of the process tho. That is something we find very valuable and want to further improve. Another consideration is performance, but if the operations are intercepted and done by vaex, than it should be comparable. I would also very much like to hear @maartenbreddels opinion on this.

In the meantime - for the model wrapper I had in mind.. in principle nothing should stop you from putting a full river pipeline in there instead of just the estimator, if the whole pipeline can be executed in mini-batches.

This is not possible now of course (out of the box), but it is not too much work to do some pre-training with vaex, where the model is a river model, and then deploy it river-style (predict one, fit one). Something worth considering for actual industry applications.

We did use to have some (experimental) support for one of the sparse formats of scipy i believe, but it might have been removed as it was not mature enough. @maartenbreddels knows more about this so I will let him answer it!

I hope this helps a bit!

maartenbreddels · 2021-03-15T14:15:52Z

maartenbreddels
Mar 15, 2021

Hi @MaxHalford,

really happy to see the river project, and also, thanks for the kind words in #399.
I think vaex-ml and river are complementary, vaex-ml is mainly about integrating existing libraries into the vaex ecosystem, we try not to implement new algorithms.
Happy to see learn_many, which should be efficient enough in most cases. I have some ideas to avoid memory copies, but I can imagine that will not be the bottleneck in many cases.
Regard sparse data, we do support NumPy CSR matrices, but this is support is not that well tested. Apache Arrow also has sparse support, but we're not using that ATM.
In any case, if there is an important use case, or paying customer, this is something we can improve upon.

@JovanVeljanoski started vaexio/vaex#1256 so I think the integration from our side should be merged and released soon. I think we should also have an example in our docs.

I think it would be great to have an example in your docs as well, possibly using the future stream.iter_vaex support. In any case, please ping us when you do!

Regards,

Maarten

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Integration & synergy with vaex-ml #515

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Integration & synergy with vaex-ml #515

Uh oh!

JovanVeljanoski Mar 12, 2021

Replies: 2 comments · 1 reply

Uh oh!

MaxHalford Mar 13, 2021 Maintainer

Uh oh!

JovanVeljanoski Mar 13, 2021 Author

Uh oh!

maartenbreddels Mar 15, 2021

JovanVeljanoski
Mar 12, 2021

Replies: 2 comments 1 reply

MaxHalford
Mar 13, 2021
Maintainer

JovanVeljanoski Mar 13, 2021
Author

maartenbreddels
Mar 15, 2021