Metric evaluation in stream learning - train test split #843

DennisHerell · 2022-02-14T07:44:37Z

DennisHerell
Feb 14, 2022

Hi! First of all, thanks to the developer for this awesome library. I'm not sure if this is the correct place to ask, as my question is not directly related to River, but I couldn't find a good answer elsewhere. I'm trying to compare the performance of batch learning vs stream learning, that is, I expect the performance of stream learning to be better on datasets with significant concept drift. However, I still have a lack of understanding regarding stream learning and I hope I can get my answer in this community.

My question is, do we split the dataset into train-test like in normal batch learning? I don't think there's a reason not to split them, otherwise I can't get a metric that shows my model performance on real operation. But all of the example in here as well as some papers about incremental learning that I've read doesn't really show that the dataset is being split. If it's not split, how can I make a good comparison between batch learning and stream learning?

I also have some question regarding river evaluate module. What is the difference between using progressive_val_score and making a for loop to predict each input and updating the metric from its output? I notice the evaluate module is much faster, but I'm not sure how to use it with pandas dataframe.

Lastly, how does incremental learning work in practice? It can't learn from unlabelled data(for supervised learning), correct? So, after training the model, it is not updated anymore right?

Thanks

Answered by MaxHalford

Feb 14, 2022

Hello there and welcome!

do we split the dataset into train-test like in normal batch learning?

Usually, no. What you describe is akin to cross-validation, and is a batch machine learning concept. In an online setting, you usually do progressive validation. The consequence is that batch and online models are not really comparable.

If it's not split, how can I make a good comparison between batch learning and stream learning?

You could do cross-validation to evaluate an online model. If you're doing that, you're just treating batch as a special case of online.

What is the difference between using progressive_val_score and making a for loop to predict each input and updating the metri…

View full answer

MaxHalford · 2022-02-14T07:59:57Z

MaxHalford
Feb 14, 2022
Maintainer

Hello there and welcome!

do we split the dataset into train-test like in normal batch learning?

Usually, no. What you describe is akin to cross-validation, and is a batch machine learning concept. In an online setting, you usually do progressive validation. The consequence is that batch and online models are not really comparable.

If it's not split, how can I make a good comparison between batch learning and stream learning?

You could do cross-validation to evaluate an online model. If you're doing that, you're just treating batch as a special case of online.

What is the difference between using progressive_val_score and making a for loop to predict each input and updating the metric from its output?

That is essentially what progressive_val_score is doing. Given a model, for each sample, it makes a prediction, updates the model, updates the metric, and moves on to the next sample.

Lastly, how does incremental learning work in practice? It can't learn from unlabelled data(for supervised learning), correct? So, after training the model, it is not updated anymore right?

Incremental supervised models need labels to learn, yes. But there are also incremental unsupervised models.

I hope that helps!

3 replies

DennisHerell Feb 14, 2022
Author

Thanks for the quick response! I did further reading from your answer (and come across your blog https://maxhalford.github.io/blog/online-learning-evaluation/#progressive-validation which is really helpful, thanks!).

I understand that batch learning and stream learning is not really comparable, but if needed to you mentioned this

You could do cross-validation to evaluate an online model. If you're doing that, you're just treating batch as a special case of online.

Can you explain more about this? How do I cross-validate on the online model?
This is what I've done previously (which isn't really a good way, just what I've done while not really knowing how to compare the two) on the stream model, where I use it similarly to ordinary train-test split

For the batch learning method, I also use the same way of splitting into train & test dataset, train on the train dataset, and get validation accuracy from the test dataset.

MaxHalford Feb 14, 2022
Maintainer

Can you explain more about this? How do I cross-validate on the online model?

It's rather simple really. You split your data into train and test. You loop through the train data and update your model. Then you loop through the test data and collect predictions and don't update the model. That's all there is to it.

DennisHerell Feb 14, 2022
Author

I see, then what I did is acceptable. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric evaluation in stream learning - train test split #843

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Metric evaluation in stream learning - train test split #843

DennisHerell Feb 14, 2022

Replies: 1 comment · 3 replies

MaxHalford Feb 14, 2022 Maintainer

DennisHerell Feb 14, 2022 Author

MaxHalford Feb 14, 2022 Maintainer

DennisHerell Feb 14, 2022 Author

DennisHerell
Feb 14, 2022

Replies: 1 comment 3 replies

MaxHalford
Feb 14, 2022
Maintainer

DennisHerell Feb 14, 2022
Author

MaxHalford Feb 14, 2022
Maintainer

DennisHerell Feb 14, 2022
Author