Unbalanced Data Streaming #445
-
Hi Team, Thanks for this great library. I just found it today and am going to give it a try. However, I had a small question: since we cannot control the order of data stream we receive in the online learning fashion (like doing shuffle in offline fashion), what if we keep receiving data from certain class, will this make the model defect? Or is there some mechanisms under the hood to provide this? Cheers. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 4 replies
-
Hello, The model should be influenced if a class is over-represented in the stream. I don't know if this is exactly what you are looking for, but there is an For exemple the https://riverml.xyz/latest/api/imblearn/HardSamplingClassifier/ Also take a look at The Raphaël |
Beta Was this translation helpful? Give feedback.
-
There are two ways to give weights to different classes to handle unbalanced datasets.
On sampling techniques, we have upsampling, downsampling, hard sampling, and combined sampling techniques. This can be implemented natively with the river package under imblearn.
The other technique is to modify the loss function. There are loss functions that are made for streaming natively in river, like the cross-entropy loss, in which we have a parameter to get the class weights. https://riverml.xyz/dev/api/optim/losses/CrossEntropy/. Similarly, for batch (Neural Nets) making loss functions: https://towardsdatascience.com/handling-class-imbalanced-data-using-a-loss-specifically-made-for-it-6e58fd65ffab. |
Beta Was this translation helpful? Give feedback.
Hello,
The model should be influenced if a class is over-represented in the stream.
I don't know if this is exactly what you are looking for, but there is an
imblearn
module.For exemple the
HardSamplingClassifier
class allows you to store then
latest observations that are the most difficult to predict and re-train the model on these data with some probability.https://riverml.xyz/latest/api/imblearn/HardSamplingClassifier/
Also take a look at
imblearn.RandomSampler
,imblearn.RandomOverSampler
andimblearn.RandomUnderSampler
which may help you to get the desired data distribution.The
drift
module may also help you to detect any trend updates.Raphaël