[EVAL] Add TUMLU benchmark #577

gaydmi · 2025-02-19T15:46:54Z

Hello!
We just released the benchmark for Turkic languages. Does it make sense if I add it to lighteval?

Evaluation short description

Why is this evaluation interesting?
First native-language MMLU benchmark for low-resource Turkic languages.
How is it used in the community?
Just released, MC high-school exam questions

Evaluation metadata

Provide all available

Paper url: https://arxiv.org/abs/2502.11020
Github url: https://github.com/ceferisbarov/TUMLU
Dataset url:

clefourrier · 2025-02-19T16:21:16Z

cc @hynky1999 could interest you I feel!

clefourrier · 2025-02-19T16:21:41Z

Is the dataset already on Hugging Face?

gaydmi · 2025-02-19T16:24:14Z

@clefourrier Not really (in gated repos), but everything is in github already.

clefourrier · 2025-02-19T16:38:04Z

Gated sounds fine, can you share the path?

hynky1999 · 2025-02-21T15:27:17Z

Hi, I think it would be very nice addition, we already have TurkishMMLU (which I think is is also part of your dataset right ?)

See https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2133

To add it we would need following:

Have translation literals for the languages you want to add: (https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2133)
Add the dataset to hub
Replace the TurkishMMLU with your dataset

Do you think you could do that? cc @gaydmi

ceferisbarov · 2025-02-23T08:01:26Z

@gaydmi Thank you for bringing this up!

@hynky1999 I have a question. Our dataset can be split into subsets in three ways: (a) make each language a subset, (b) make each subject a subset, (c) make each language-subject combination a subset. Which one would you suggest? I could not find any similar examples in the repo.

gaydmi · 2025-02-24T22:18:42Z

@hynky1999 Hi, yes, working on it!
@ceferisbarov I personally think option (c) is the best, so we could just add new languages with their tasks.
Like in here: https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/multilingual/tasks.py#L2617

hynky1999 · 2025-02-24T22:26:08Z

I would say ideally use subset for languages and then add column to identify the actuall task subset. You can then use hf_filter arg on task

ceferisbarov · 2025-02-25T20:11:29Z

Both options sound good to me. I have added the dataset to Hugging Face:

https://huggingface.co/datasets/jafarisbarov/TUMLU-mini

@gaydmi let me know if I can help in any other way.

hynky1999 · 2025-02-26T12:53:59Z

Awesome, cc @gaydmi happy to review the PR once ready

gaydmi added the new task label Feb 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EVAL] Add TUMLU benchmark #577

[EVAL] Add TUMLU benchmark #577

gaydmi commented Feb 19, 2025

clefourrier commented Feb 19, 2025

clefourrier commented Feb 19, 2025

gaydmi commented Feb 19, 2025

clefourrier commented Feb 19, 2025

hynky1999 commented Feb 21, 2025

ceferisbarov commented Feb 23, 2025

gaydmi commented Feb 24, 2025 •

edited

Loading

hynky1999 commented Feb 24, 2025

ceferisbarov commented Feb 25, 2025

hynky1999 commented Feb 26, 2025

[EVAL] Add TUMLU benchmark #577

[EVAL] Add TUMLU benchmark #577

Comments

gaydmi commented Feb 19, 2025

Evaluation short description

Evaluation metadata

clefourrier commented Feb 19, 2025

clefourrier commented Feb 19, 2025

gaydmi commented Feb 19, 2025

clefourrier commented Feb 19, 2025

hynky1999 commented Feb 21, 2025

ceferisbarov commented Feb 23, 2025

gaydmi commented Feb 24, 2025 • edited Loading

hynky1999 commented Feb 24, 2025

ceferisbarov commented Feb 25, 2025

hynky1999 commented Feb 26, 2025

gaydmi commented Feb 24, 2025 •

edited

Loading