Unbalanced Data Streaming #445

Haoyu-R · 2021-01-19T16:09:12Z

Haoyu-R
Jan 19, 2021

Hi Team,

Thanks for this great library. I just found it today and am going to give it a try. However, I had a small question: since we cannot control the order of data stream we receive in the online learning fashion (like doing shuffle in offline fashion), what if we keep receiving data from certain class, will this make the model defect? Or is there some mechanisms under the hood to provide this?

Cheers.

Answered by raphaelsty

Jan 19, 2021

Hello,

The model should be influenced if a class is over-represented in the stream.

I don't know if this is exactly what you are looking for, but there is an imblearn module.

For exemple the HardSamplingClassifier class allows you to store the n latest observations that are the most difficult to predict and re-train the model on these data with some probability.

https://riverml.xyz/latest/api/imblearn/HardSamplingClassifier/

Also take a look at imblearn.RandomSampler, imblearn.RandomOverSampler and imblearn.RandomUnderSampler which may help you to get the desired data distribution.

The drift module may also help you to detect any trend updates.

Raphaël

View full answer

raphaelsty · 2021-01-19T16:25:16Z

raphaelsty
Jan 19, 2021
Maintainer

Hello,

The model should be influenced if a class is over-represented in the stream.

I don't know if this is exactly what you are looking for, but there is an imblearn module.

For exemple the HardSamplingClassifier class allows you to store the n latest observations that are the most difficult to predict and re-train the model on these data with some probability.

https://riverml.xyz/latest/api/imblearn/HardSamplingClassifier/

Also take a look at imblearn.RandomSampler, imblearn.RandomOverSampler and imblearn.RandomUnderSampler which may help you to get the desired data distribution.

The drift module may also help you to detect any trend updates.

Raphaël

4 replies

MaxHalford Jan 19, 2021
Maintainer

Hi @Haoyu-R, everything Raphaël said is correct. You can read this blog post if you want to understand a bit more how the undersampling works. Also note that I ported the same to PyTorch in this package.

Haoyu-R Jan 19, 2021
Author

Hi Raphaël @raphaelsty and Max @MaxHalford ,

Thanks for the rapid reply. I just came up with an interesting case for further illustrating my question:

Think about we want to train a ML model on power constraint device to recognize different patterns in the data collected by on-board sensor. Because the memory of this device is limited, so we can only train the model in online-ml fashion. How we train the model is we put the device into different scenarios (we can refer this to different classes) on by one. That is, train the model for one class using a lot of data streamed in this scenario (or for the same class), and switch to next scenario/class and do the same... Why we do this is because switching back and forth between different scenarios is quiet cumbersome.

I'm not sure how will this effect the performance of the model compared to random shuffled datastream. But I will try to implement this using river and have a look. I am looking forward to your further input.

Cheers.

Haoyu-R Jan 19, 2021
Author

Hi Max Halfod @MaxHalford,

Thanks for the information. I will definitely have a look tonight. Looking forward to use river! ;-)

Cheers.

MaxHalford Jan 19, 2021
Maintainer

I'm not sure how will this effect the performance of the model compared to random shuffled datastream. But I will try to implement this using river and have a look. I am looking forward to your further input.

The performance will definitely be degraded, trust me. I definitely recommend shuffling the data in some way. Note that we have a shuffle method for streaming data, but that might not alleviate the fact that you're showing the classes sequentially.

HarshaAsh · 2021-05-05T09:03:35Z

HarshaAsh
May 5, 2021

There are two ways to give weights to different classes to handle unbalanced datasets.

Sampling techniques
Modifying the loss function

On sampling techniques, we have upsampling, downsampling, hard sampling, and combined sampling techniques. This can be implemented natively with the river package under imblearn.

The other technique is to modify the loss function. There are loss functions that are made for streaming natively in river, like the cross-entropy loss, in which we have a parameter to get the class weights. https://riverml.xyz/dev/api/optim/losses/CrossEntropy/. Similarly, for batch (Neural Nets) making loss functions: https://towardsdatascience.com/handling-class-imbalanced-data-using-a-loss-specifically-made-for-it-6e58fd65ffab.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbalanced Data Streaming #445

{{title}}

Replies: 2 comments 4 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Unbalanced Data Streaming #445

Haoyu-R Jan 19, 2021

Replies: 2 comments · 4 replies

raphaelsty Jan 19, 2021 Maintainer

MaxHalford Jan 19, 2021 Maintainer

Haoyu-R Jan 19, 2021 Author

Haoyu-R Jan 19, 2021 Author

MaxHalford Jan 19, 2021 Maintainer

HarshaAsh May 5, 2021

Haoyu-R
Jan 19, 2021

Replies: 2 comments 4 replies

raphaelsty
Jan 19, 2021
Maintainer

MaxHalford Jan 19, 2021
Maintainer

Haoyu-R Jan 19, 2021
Author

Haoyu-R Jan 19, 2021
Author

MaxHalford Jan 19, 2021
Maintainer

HarshaAsh
May 5, 2021