Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stratified sampling #1

Open
morinim opened this issue May 20, 2017 · 0 comments
Open

Stratified sampling #1

morinim opened this issue May 20, 2017 · 0 comments

Comments

@morinim
Copy link
Owner

morinim commented May 20, 2017

Once you split up the data into train, validation and test set, chances are close to 100% that your already skewed data becomes even more unbalanced for at least one of the three resulting sets.

Think about it: let’s say your data set contains 1000 records and of those 20 are labelled as “fraud”. As soon as you split up the data set into train, validation and test set, it is possible that the train set contains the majority of the “fraud”-records or maybe even all of them. Although not as likely, the same could happen for the validation or test set, which is even worse because then the machine learning algorithm has no chance at all to learn the hidden patterns of “fraud”-records.

This is why you shouldn’t use random sampling when constructing the three different data sets, but stratified sampling instead. It assures that the train, validation and test sets are well balanced. Therewith the already existing problem of skewed classes is not intensified, which is what you want when creating high quality models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant