-
Notifications
You must be signed in to change notification settings - Fork 445
Classification dataset cleaning #2900
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classification dataset cleaning #2900
Conversation
@Samoed, could we merge this pull request? |
For me looks good, I requested review for second opinion |
This is great! Could you point me to the script you used to clean these datasets? I wasn't able to find them in this PR or in the linked PR. |
@isaac-chung, I didn't add this to the repository because the code is somewhat tangled and redundant in some places. Additionally, it currently supports only classification (except for multilingual tasks). I plan to improve it. Here is the link to the Gist with the code. |
Great stuff! Let's commit this to the script folder as is, and then improve on it. At the end of the day, this code is used to make changes to the public datasets, so it should be (and should have been) recorded. The script folder is not super cleaned up anyway, so we shouldn't let perfect get in the way of good. Also, that way others can contribute too if they wanted to for other tasks. |
@isaac-chung, I agree with you, I have added my script. |
I am continuing the work on data cleaning. This pull request is a follow-up to the work started in #2632, data cleaning has been performed for all datasets in the classification section, except for multilingual ones. I will handle multilingual datasets in a separate pull request. The same cleaning rules described in the comment #2632 (comment) are applied here.
Additionally, I have introduced a new filter, filter_one_sample_labels, which is useful in rare cases. This filter is applied to datasets that lack a test sample and contain classes with only one example. Such classes are removed because it is not possible to split the sample into train and test sets.
The
RuNLUIntentClassification
task has been moved to themultilingual
folder, as it is multilingual, but it was located in therus
folder.Here is a comparison of scores for all modified datasets on the
intfloat/multilingual-e5-small
andsentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
models.You can find the table with the original sizes of the datasets here.
The table showing the changes made to each dataset is available here.
I am adding this as a separate file in the gist because the table is too large to include in a comment.
If you add a model or a dataset, please add the corresponding checklist: