Classification dataset cleaning #2900

AlexeyVatolin · 2025-07-13T14:06:27Z

I am continuing the work on data cleaning. This pull request is a follow-up to the work started in #2632, data cleaning has been performed for all datasets in the classification section, except for multilingual ones. I will handle multilingual datasets in a separate pull request. The same cleaning rules described in the comment #2632 (comment) are applied here.

Additionally, I have introduced a new filter, filter_one_sample_labels, which is useful in rare cases. This filter is applied to datasets that lack a test sample and contain classes with only one example. Such classes are removed because it is not possible to split the sample into train and test sets.

The RuNLUIntentClassification task has been moved to the multilingual folder, as it is multilingual, but it was located in the rus folder.

Here is a comparison of scores for all modified datasets on the intfloat/multilingual-e5-small and sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 models.

You can find the table with the original sizes of the datasets here.

The table showing the changes made to each dataset is available here.

I am adding this as a separate file in the gist because the table is too large to include in a comment.

If you add a model or a dataset, please add the corresponding checklist:

AlexeyVatolin · 2025-07-15T15:04:26Z

@Samoed, could we merge this pull request?

Samoed · 2025-07-15T22:30:36Z

For me looks good, I requested review for second opinion

isaac-chung · 2025-07-16T15:26:37Z

This is great! Could you point me to the script you used to clean these datasets? I wasn't able to find them in this PR or in the linked PR.

AlexeyVatolin · 2025-07-16T20:04:33Z

@isaac-chung, I didn't add this to the repository because the code is somewhat tangled and redundant in some places. Additionally, it currently supports only classification (except for multilingual tasks). I plan to improve it. Here is the link to the Gist with the code.

isaac-chung · 2025-07-16T20:13:55Z

Great stuff! Let's commit this to the script folder as is, and then improve on it. At the end of the day, this code is used to make changes to the public datasets, so it should be (and should have been) recorded. The script folder is not super cleaned up anyway, so we shouldn't let perfect get in the way of good.

Also, that way others can contribute too if they wanted to for other tasks.

AlexeyVatolin · 2025-07-17T18:19:34Z

@isaac-chung, I agree with you, I have added my script.

AlexeyVatolin added 4 commits July 13, 2025 13:57

Classification dataset cleaning

82f151a

Update pull request number

4f00671

Fix metadata test

df10d5b

fix formatting

f1cfe3b

AlexeyVatolin marked this pull request as ready for review July 13, 2025 17:21

Samoed approved these changes Jul 14, 2025

View reviewed changes

Samoed requested review from isaac-chung and KennethEnevoldsen July 15, 2025 22:29

add script for cleaning

3308f87

isaac-chung approved these changes Jul 17, 2025

View reviewed changes

isaac-chung merged commit aef1e33 into embeddings-benchmark:main Jul 19, 2025
9 checks passed

AlexeyVatolin deleted the classification_dataset_cleaning_iter_2 branch July 19, 2025 09:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Classification dataset cleaning #2900

Classification dataset cleaning #2900

Uh oh!

AlexeyVatolin commented Jul 13, 2025 •

edited

Loading

Uh oh!

AlexeyVatolin commented Jul 15, 2025

Uh oh!

Samoed commented Jul 15, 2025

Uh oh!

isaac-chung commented Jul 16, 2025

Uh oh!

AlexeyVatolin commented Jul 16, 2025

Uh oh!

isaac-chung commented Jul 16, 2025 •

edited

Loading

Uh oh!

AlexeyVatolin commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

Classification dataset cleaning #2900

Classification dataset cleaning #2900

Uh oh!

Conversation

AlexeyVatolin commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexeyVatolin commented Jul 15, 2025

Uh oh!

Samoed commented Jul 15, 2025

Uh oh!

isaac-chung commented Jul 16, 2025

Uh oh!

AlexeyVatolin commented Jul 16, 2025

Uh oh!

isaac-chung commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexeyVatolin commented Jul 17, 2025

Uh oh!

Uh oh!

Uh oh!

AlexeyVatolin commented Jul 13, 2025 •

edited

Loading

isaac-chung commented Jul 16, 2025 •

edited

Loading