Integration & synergy with vaex-ml #515
Replies: 2 comments 1 reply
-
Hey @JovanVeljanoski! This is cool in so many ways. I've been planning to do this exploration for so-long but never got around it. Thanks for doing this experiment! It's clear that there's a fruitful relation to build between both our projects. It's cool because as you say there's not much work to do, as you produce pandas dataframes and we consume them. It's almost too good to be true!
Clearly this is what need to do with regards to users. We need an example notebook that we could duplicate in both our documentation websites.
What exactly does vaex's for x_batch, y_batch in stream.iter_vaex(vaex_df, batch_size=10_000):
model.learn_many(x_batch, y_batch) Apart from this I have a couple of candid questions for you:
This is exciting! |
Beta Was this translation helpful? Give feedback.
-
Hi @MaxHalford, really happy to see the river project, and also, thanks for the kind words in #399. @JovanVeljanoski started vaexio/vaex#1256 so I think the integration from our side should be merged and released soon. I think we should also have an example in our docs. I think it would be great to have an example in your docs as well, possibly using the future Regards, Maarten |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @MaxHalford & co,
Great job on this library! Looks great and performs great on the (few) tests I've done. Great work!
I recently noticed #399. Thank you for the kind words. Last night I finally had a chance to try how river would integrate with vaex. I've not spent any time looking at the source, fully on the user side for now. I noticed that some models now support both online and mini-batch learning that got me really exited.
So I did my "benchmark" test, which is substituting the
SGDRegressor
of scikit-learn with theLinearModel
of river in this notebook.Using the latest version of vaex (4.0.0) - the scikit-learn version of the notebook takes 3.6 minutes to train a model on 1 billion rows/samples, 15 features.
The river version takes 2.9 minutes to train the linear model (same data, same hardware, same environment). So I would say this is substantial improvement in performance! Again Great job! I've not tested the inference time yet.
In my mind, vaex as a (very large data) dataframe and river for streaming/online ML libraries go very well together: vaex dataframes inplement
df.to_pandas_df(chunk_size=10_000)
which returns a generator yielding pandas dataframes where the length is specified by thechunk_size
argument. This method also does pre-fetching so the next batch of data is ready by the time it is needed by river. And from river side, well, you get a "stream" of pandas dataframes to learn on!With all this in mind, I would very much like to (time permitting) write a small "plug-in" on the vaex side which would enable users to use the river models in vaex-ml (so taking advantage of the vaex graph/pipeline/expression system). But before that I wanted to share this with you and get your thoughts and hopefully blessings. I am happy to contribute examples when this is done for both sides.
Also (time permitting) if you are interested in some kind of vaex dataframe connector, so users can use a vaex dataframe as a source of river pipeline, I'd be happy to take a look at that (something similar to what I did in creme-ml some time ago).
Cheers,
J.
Beta Was this translation helpful? Give feedback.
All reactions