Skip to content

Classification dataset cleaning #2900

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

AlexeyVatolin
Copy link
Contributor

@AlexeyVatolin AlexeyVatolin commented Jul 13, 2025

I am continuing the work on data cleaning. This pull request is a follow-up to the work started in #2632, data cleaning has been performed for all datasets in the classification section, except for multilingual ones. I will handle multilingual datasets in a separate pull request. The same cleaning rules described in the comment #2632 (comment) are applied here.

Additionally, I have introduced a new filter, filter_one_sample_labels, which is useful in rare cases. This filter is applied to datasets that lack a test sample and contain classes with only one example. Such classes are removed because it is not possible to split the sample into train and test sets.

The RuNLUIntentClassification task has been moved to the multilingual folder, as it is multilingual, but it was located in the rus folder.

Here is a comparison of scores for all modified datasets on the intfloat/multilingual-e5-small and sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 models.

You can find the table with the original sizes of the datasets here.

The table showing the changes made to each dataset is available here.


I am adding this as a separate file in the gist because the table is too large to include in a comment.

If you add a model or a dataset, please add the corresponding checklist:

@AlexeyVatolin AlexeyVatolin marked this pull request as ready for review July 13, 2025 17:21
@AlexeyVatolin
Copy link
Contributor Author

@Samoed, could we merge this pull request?

@Samoed
Copy link
Member

Samoed commented Jul 15, 2025

For me looks good, I requested review for second opinion

@isaac-chung
Copy link
Collaborator

This is great! Could you point me to the script you used to clean these datasets? I wasn't able to find them in this PR or in the linked PR.

@AlexeyVatolin
Copy link
Contributor Author

@isaac-chung, I didn't add this to the repository because the code is somewhat tangled and redundant in some places. Additionally, it currently supports only classification (except for multilingual tasks). I plan to improve it. Here is the link to the Gist with the code.

@isaac-chung
Copy link
Collaborator

isaac-chung commented Jul 16, 2025

Great stuff! Let's commit this to the script folder as is, and then improve on it. At the end of the day, this code is used to make changes to the public datasets, so it should be (and should have been) recorded. The script folder is not super cleaned up anyway, so we shouldn't let perfect get in the way of good.

Also, that way others can contribute too if they wanted to for other tasks.

@AlexeyVatolin
Copy link
Contributor Author

@isaac-chung, I agree with you, I have added my script.

@isaac-chung isaac-chung merged commit aef1e33 into embeddings-benchmark:main Jul 19, 2025
9 checks passed
@AlexeyVatolin AlexeyVatolin deleted the classification_dataset_cleaning_iter_2 branch July 19, 2025 09:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants