Skip to content

CLI Tool for results dataframe on leaderboard #2454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

ayush1298
Copy link
Collaborator

@ayush1298 ayush1298 commented Mar 28, 2025

This PR is related to addition of CLI Tool for inspection and creation of dataframe of results as suggested in issue #2174 .
closes #2174 .

To use it command is:

mteb create-table  --results <results_path> --models model1 model2 --benchmark benchmark_name1 benchmark_name2 --aggregation-level <aggregation-level > --output <output_file>
  1. results_path: Path to the results folder. Works with local results repo or the official MTEB results repository
  2. Compare multiple models side by side across different tasks.
  3. benchmark_name: Name of the benchmark (optional)
  4. Three aggregation levels:
  • subset: Results for each subset within each split for each task
  • split: Results aggregated over subsets for each split for each task
  • task: Results aggregated over subsets and splits for each task
  1. Multiple output formats: CSV, Excel, or Markdown (based on file extension)
  2. output_path: Path to save the output tables

Sample example to run these is for openai/text-embedding-3-large, openai/text-embedding-3-small models is using following command:

mteb create-table --results https://github.com/embeddings-benchmark/results \
                 --models "openai/text-embedding-3-large" "openai/text-embedding-3-small" \
                 --benchmark "MTEB(Multilingual, v1)" \
                 --aggregation-level subset \
                 --output comparison_table4.csv

Based on the aggregation, we get different results. As full csv file is very large, I am sharing below results of these 2 models on AmazonCounterfactualClassification task having test and validation splits and 4 subsets: de,en,en-ext,ja, so as to verify that aggregation is actually working.

subset aggregation level Screenshot 2025-03-29 at 12 09 49 AM
split aggregation level Screenshot 2025-03-29 at 12 11 08 AM
task aggregation level Screenshot 2025-03-29 at 12 11 31 AM

Code Quality

  • Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

  • Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

  • New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
  • Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great addition!

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Mar 28, 2025

Update PR description with CLI command to run, sample example and results and their verifications

@ayush1298 ayush1298 requested a review from Samoed March 29, 2025 09:56
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can you create test for this like test_create_meta_from_existing?

mteb/tests/test_cli.py

Lines 137 to 138 in 43adb0c

def test_create_meta_from_existing(existing_readme_name: str, gold_readme_name: str):
"""Test create_meta function directly as well as through the command line interface"""

@Samoed
Copy link
Member

Samoed commented Mar 29, 2025

I tried to run this command, but it is not working with local dirs. I tried to run with mteb create-table --results v2/ --output example.md, but it tried to load all results from results repo
v2.zip

@ayush1298
Copy link
Collaborator Author

I tried to run this command, but it is not working with local dirs. I tried to run with mteb create-table --results v2/ --output example.md, but it tried to load all results from results repo v2.zip

With local dir, there is some issue ig. As I have also got the same thing. I will try to find it out. You can try with github url as given in PR description. That was working as expected.

@ayush1298
Copy link
Collaborator Author

@Samoed, Its working now with local path as well. Also, in tests. for testing it with local results repo, should I do results of both these models under tests/results?

@ayush1298 ayush1298 requested a review from Samoed March 31, 2025 13:15
@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen Can you please review this PR?

@Samoed Samoed requested review from KennethEnevoldsen and removed request for Samoed March 31, 2025 19:04
Copy link
Member

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Copy link
Contributor

@KennethEnevoldsen KennethEnevoldsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR.

Currently it seems like we haven't really added a meaningful interface in python.

I think that there is a few use cases that we can see:

  1. I have run a model and want to see how well it compared to another model
  2. I make a PR to embeddings-benchmark/results and want to create a markdown table of performances

There is probably more, but I think it is worth spending some more time on getting this one right.

Feel free to just outline how you think it might look like in a test.

@pytest.mark.parametrize("benchmark", ["MTEB(eng, v1)", "MTEB(Multilingual, v1)"])
@pytest.mark.parametrize("aggregation_level", ["subset", "split", "task"])
@pytest.mark.parametrize("output_format", ["csv", "md", "xlsx"])
def test_create_table(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love an additional test that doesn't go through the CLI.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KennethEnevoldsen, without CLI how to test, because we are mainly dependent on output file that CLI is producing.

@Samoed
Copy link
Member

Samoed commented Apr 2, 2025

Currently it seems like we haven't really added a meaningful interface in python.

I've created a table with create_comparison_table, and it worked fine. Do you want to create a table from a dict?

I make a PR to embeddings-benchmark/results and want to create a markdown table of performances

I think this table can be created using the current CLI. Just the model name needs to be specified.

I have run a model and want to see how well it compared to another model

You want to compare results from different directories?

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 2, 2025

But there is no documentation on the python interface and no tests for functionality

You want to compare results from different directories?

Taking a user perspective: I just evaluated my model (let's say a fine-tune of e5) and now want to compare the it with the original model. I don't care what directory something is in.

@Samoed
Copy link
Member

Samoed commented Apr 2, 2025

Taking a user perspective: I just evaluated my model (let's say a fine-tune of e5) and now want to compare the it with the original model. I don't care what directory something is in.

I think this can be done with additional command (or flag to current command) to compare current results with leaderboard or tasks from folder, but maybe this can be done in other PR

@ayush1298
Copy link
Collaborator Author

  1. I make a PR to embeddings-benchmark/results and want to create a markdown table of performances

Are you saying that PR is not merged and user wants to check results before merging?

But there is no documentation on the python interface and no tests for functionality

Are you saying about adding documentation related to this, I have added how to use it at:

mteb/mteb/cli.py

Lines 79 to 89 in 89db9ac

## Creating Comparison Tables
To create comparison tables between models based on various aggregation levels (task, split, or subset), use the `mteb create-table` command. For example:
```bash
mteb create-table --results results/ \
--models "intfloat/multilingual-e5-small" "intfloat/multilingual-e5-base" \
--benchmark "MTEB(eng, v1)" \
--aggregation-level task \
--output comparison_table.csv
"""
. And shared some of results in PR description. Also, if you want me to add this in readme.md or in separate file, then I will add that at end once we have finalized with rest of things.

Taking a user perspective: I just evaluated my model (let's say a fine-tune of e5) and now want to compare the it with the original model. I don't care what directory something is in.

I am assuming that if fine-tune e5 is also run and results are produced for it, but they are not uploaded to results repo. Then, results will be saved under mteb/results. So, either user can add this as results path or else as @Samoed suggested we can set an additional argument something which tells if model results are present in results repo or not. We can set default to true, and if user passes this as false, then we will set results path as mteb/results instead of default results repo. By these I think we can also avoid results argument which takes results_path by setting its default to results repo (Github URL). 1 more thing that can be done is in load_results or maybe in separate function, we should check directly if results folder is present locally(Don't know how to do that exactly). Because, if we can do that, then if it exists then we will not clone results repo again. And if it not exists, then in same function, we will add functionality to clone the repo directly.
Now, user just have to give name of the models, and can get the tables.

@KennethEnevoldsen
Copy link
Contributor

Hmm, I will skip many of the comments here. This is not to be rude, but rather to focus on what I think is the main problem.

While I agree that this PR targets the Issue, it does it in a very direct way. This is a problem because then the CLI dictates the Python interface. This makes it inflexible and hard to test (we can only test the input/output).

Here is a few example of why I don't think create_comparison_table is a good python interface:

  • if I already have a mteb.load_results(...) I would have to recreate that to get to a dataframe
  • it is a different interface that load_results and it is not compatible (e.g. exclude_meta is forced)
  • If I want to check something in the results before I make the table that is impossible
  • If something is missing in the table, I have no idea why/when it was filtered out

So here is an alternative suggestion:

models = ... # desired models
tasks = mteb.get_benchmark("MTEB(eng, v2)")

benchmark_results = benchmark.load_results(models = models, tasks = tasks)
df = benchmark_results.to_dataframe(aggregate = "task")

This might not be perfect, but I think it is definitely true. more flexible than the current implementation.

@ayush1298
Copy link
Collaborator Author

ayush1298 commented Apr 4, 2025

@KennethEnevoldsen, sorry for having too long comments previously.

if I already have a mteb.load_results(...) I would have to recreate that to get to a dataframe

Using implementation that you suggested, still user has to run mteb.load_results(...) atleast 1 time. And in current implementation using below code,

results_path = Path(results_repo)
    if results_path.exists() and results_path.is_dir():
        logger.info(f"Using local results repository at {results_path}")
        return results_path

if it run it once, then it will use that only and not load again when we use create_comparison_table again.

If I want to check something in the results before I make the table that is impossible

What kind of checking you are referring here(any example), I think our main aim behind this was, just to provide easy way of comparing results.

If something is missing in the table, I have no idea why/when it was filtered out

Am I missing something here, because in whole code we are not doing any specific filtering.

@Samoed
Copy link
Member

Samoed commented Apr 4, 2025

I've used this command for creating tables (thanks for this feature!) in #2415 and I think there can be some improvements:

  1. Add flag for transposing tables, because when I have 2-3 tasks and 5 models, it would be simpler to have tasks as headers.
  2. load_results is expecting all data in results folder, but I can have different names and this is a bit confusing to have custom_name/results just to correctly create table
    repo_directory = download_of_results(results_repo, download_latest=download_latest)
    model_paths = [p for p in (repo_directory / "results").glob("*") if p.is_dir()]

@KennethEnevoldsen
Copy link
Contributor

KennethEnevoldsen commented Apr 6, 2025

Using implementation that you suggested...

if it run it once, then it will use that only and not load again

Hmm, but yes, but if I already have it in memory, then I wouldn't. E.g.:

# analysis script
results = mteb.load_results(...)

# Some checks and analyses (e.g. task X is included, all tasks are run etc.)
# potentially custom filertering e.g. mteb version higher than X

# create table
create_comparison_table(...)
# requires respecification of argument
# does not guarantee that the results are the same as what is checked
# has to reload the results

Just to clarify, I think this is a good change, but I do want to standardize the way we convert the results object to tables (it should be the same code for the leaderboard, python API and CLI). So while it is good I don't think it is quite there yet.

If the suggested changes are unclear (which I can see that they might be). I can also write a suggested solution, and we can discuss it from there (but that will not be before Thursday)

@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen I understand some of the things, but it will be good if you can write suggested solution whenever you become free after thursday. I will work on this then.

@KennethEnevoldsen
Copy link
Contributor

Perfect, I have set a reminder :)

KennethEnevoldsen added a commit that referenced this pull request Apr 14, 2025
- Added `get_results_table`. I was considering renaming it to `to_dataframe` to align with `tasks.to_dataframe`. WDYT?
- Added a tests for ModelResults and BenchmarksResults
- Added a few utility functions where needed
- Added docstring throughout ModelResults and BenchmarksResults
- Added todo comment for missing aspects - mostly v2 - but we join_revisions seems like it could use an update before then.

Prerequisite for #2454:

@ayush1298 can I ask you to review this PR as well? I hope this give an idea of what I was hinting at. Sorry that it took a while. I wanted to make sure to get it right.
KennethEnevoldsen added a commit that referenced this pull request Apr 16, 2025
* fix: Added dataframe utilities to BenchmarkResults

- Added `get_results_table`. I was considering renaming it to `to_dataframe` to align with `tasks.to_dataframe`. WDYT?
- Added a tests for ModelResults and BenchmarksResults
- Added a few utility functions where needed
- Added docstring throughout ModelResults and BenchmarksResults
- Added todo comment for missing aspects - mostly v2 - but we join_revisions seems like it could use an update before then.

Prerequisite for #2454:

@ayush1298 can I ask you to review this PR as well? I hope this give an idea of what I was hinting at. Sorry that it took a while. I wanted to make sure to get it right.

* refactor to to_dataframe and combine common dependencies

* ibid

* fix revision joining after discussion with @x-tabdeveloping

* remove strict=True for zip() as it is a >3.9 feature

* updated mock cache
@ayush1298
Copy link
Collaborator Author

@KennethEnevoldsen After merge of #2542, Is their any work left that is to be done in this particular PR as I have saw that you have added Prerequisite for https://github.com/embeddings-benchmark/mteb/pull/2454 in the description of that PR.
Sorry for replying after many days, I was busy because of exams.

@Samoed
Copy link
Member

Samoed commented May 2, 2025

  1. I think you can solve these problems firstly CLI Tool for results dataframe on leaderboard #2454 (comment)
  2. And these comments left unresolved CLI Tool for results dataframe on leaderboard #2454 (review)

@ayush1298
Copy link
Collaborator Author

  1. I think you can solve these problems firstly CLI Tool for results dataframe on leaderboard #2454 (comment)
  2. And these comments left unresolved CLI Tool for results dataframe on leaderboard #2454 (review)

aren't all these gets solved after #2542? As kenneth wants instead of CLI, that should be a function like to_dataframe() which he implemented

@Samoed
Copy link
Member

Samoed commented May 2, 2025

No, they're not solved. #2542 is added some functions to simplify working with BenchmarkResults and not address these issues

@ayush1298
Copy link
Collaborator Author

No, they're not solved. #2542 is added some functions to simplify working with BenchmarkResults and not address these issues

So, will still want these CLI Tool, right?
And using those updated functionality from #2542, we want to simplify these CLI tool correct?
Should I continue in this PR only, or close and create new one?

@Samoed
Copy link
Member

Samoed commented May 2, 2025

So, will still want these CLI Tool, right?

Yes, and also python API

And using those updated functionality from #2542, we want to simplify these CLI tool correct?

Yes

Should I continue in this PR only, or close and create new one?

You can continue this PR

Samoed added a commit that referenced this pull request May 3, 2025
* SpeedTask add deprecated warning (#2493)

* Docs: Update README.md (#2494)

Update README.md

* fix transformers version for now (#2504)

* Fix typos (#2509)

* ci: refactor TaskMetadata eval langs test (#2501)

* refactor eval langs test

* function returns None

* add hard negaties tasks in _HISTORIC_DATASETS

* rename to ImageClustering folder (#2516)

rename folder

* Clean up trailing spaces citation (#2518)

* rename folder

* trailing spaces

* missed one

* [mieb] Memotion preprocessing code made more robust and readable (#2519)

* fix: validate lang code in ModelMeta (#2499)

* Update pyproject.toml (#2522)

* 1.36.38

Automatically generated by python-semantic-release

* Fix leaderboard version (#2524)

* fix gradio leaderboard run

* update docs

* Fix gte-multilingual-base embed_dim (#2526)

* [MIEB] Specify only the multilingual AggTask for MIEB-lite (#2539)

specify only the multilingual AggTask

* [mieb] fix hatefulmemes (#2531)

* fix hatefulmeme

* add to description and use polars instead

---------

Co-authored-by: Isaac Chung <[email protected]>

* Model conan (#2534)

* conan_models

* conan_models

* refactor code

* refactor code

---------

Co-authored-by: shyuli <[email protected]>

* fix: Update mteb.get_tasks with an exclude_aggregate parameter to exclude aggregate tasks (#2536)

* Implement task.is_aggregate check

* Add `mteb.get_tasks` parameter `include_aggregate` to exclude aggregate tasks if needed

* Update mteb.run with the new `task.is_aggregate` parameter

* Add tests

* Ran linter

* Changed logic to `exclude_aggregate`

* Updated from review comments

* Exclude aggregate by default false in get_tasks

* 1.36.39

Automatically generated by python-semantic-release

* docs: Add MIEB citation in benchmarks (#2544)

Add MIEB citation in benchmarks

* Add 2 new Vietnamese Retrieval Datasets (#2393)

* [ADD] 2 new Datasets

* [UPDATE] Change bibtext_citation for GreenNodeTableMarkdownRetrieval as TODO

* [UPDATE] Change bibtext_citation for ZacLegalTextRetrieval as TODO

* Update tasks table

* fix: CacheWrapper per task (#2467)

* feat: CacheWrapper per task

* refactor logic

* update documentation

---------

Co-authored-by: Florian Rottach <[email protected]>

* 1.36.40

Automatically generated by python-semantic-release

* misc: move MMTEB scripts and notebooks to separate repo (#2546)

move mmteb scripts and notebooks to separate repo

* fix: Update requirements in JinaWrapper (#2548)

fix: Update package requirements in JinaWrapper for einops and flash_attn

* 1.36.41

Automatically generated by python-semantic-release

* Docs: Add MIEB to README (#2550)

Add MIEB to README

* Add xlm_roberta_ua_distilled (#2547)

* defined model metadata for xlm_roberta_ua_distilled

* Update mteb/models/ua_sentence_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* included ua_sentence_models.py in overview.py

* applied linting, added missing fields in ModelMeta

* applied linting

---------

Co-authored-by: Roman Solomatin <[email protected]>

* fix me5 trainind data config to include xquad dataset (#2552)

* fix: me5 trainind data config to include xquad dataset

* Update mteb/models/e5_models.py

upddate: xquad key name

Co-authored-by: Roman Solomatin <[email protected]>

* fix: ME5_TRAINING_DATA format

---------

Co-authored-by: Roman Solomatin <[email protected]>

* feat: Added dataframe utilities to BenchmarkResults (#2542)

* fix: Added dataframe utilities to BenchmarkResults

- Added `get_results_table`. I was considering renaming it to `to_dataframe` to align with `tasks.to_dataframe`. WDYT?
- Added a tests for ModelResults and BenchmarksResults
- Added a few utility functions where needed
- Added docstring throughout ModelResults and BenchmarksResults
- Added todo comment for missing aspects - mostly v2 - but we join_revisions seems like it could use an update before then.

Prerequisite for #2454:

@ayush1298 can I ask you to review this PR as well? I hope this give an idea of what I was hinting at. Sorry that it took a while. I wanted to make sure to get it right.

* refactor to to_dataframe and combine common dependencies

* ibid

* fix revision joining after discussion with @x-tabdeveloping

* remove strict=True for zip() as it is a >3.9 feature

* updated mock cache

* 1.37.0

Automatically generated by python-semantic-release

* fix e5_R_mistral_7b (#2490)

* fix e5_R_mistral_7b

* change wrapper

* address comments

* Added kwargs for pad_token

* correct lang format

* address comments

* add revision

---------

Co-authored-by: Roman Solomatin <[email protected]>

* fix unintentional working of filters on leaderboard (#2535)

* fix unintentional working of filters on leaderboard

* address comments

* make lint

* address comments

* rollback unnecessary changes

* feat: UI Overhaul (#2549)

* Bumped gradio version to latest

* Added new Gradio table functionality to leaderboard

* Removed search bar

* Changed color scheme in plot to match the table

* Added new benchmark selector in sidebar

* Changed not activated button type to secondary

* Short-circuited callbacks that are based on language selection

* Re-added column width calculation since it got messed up

* Commented out gradient for per-task table as it slowed things down substantially

* Styling and layout updates

* Adjusted comments according to reviews

* Converted all print statements to logger.debug

* Removed pydantic version fix

* Ran linting

* Remove commented out code

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* Moved English,v1 to Legacy section

* Closed the benchmark sharing accordion by default

* Adjusted markdown blocks according to suggestions

* Ran linter

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* 1.38.0

Automatically generated by python-semantic-release

* add USER2 (#2560)

* add user2

* add training code

* update prompts

* Fix leaderboard entry for BuiltBench (#2563)

Fix leaderboard entry for BuiltBench (#2562)

Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>

* fix: jasper models embeddings having nan values (#2481)

* 1.38.1

Automatically generated by python-semantic-release

* fix frida datasets (#2565)

* Add relle (#2564)

* Add relle
* defined model metadata for relle

* Add mteb/models/relle_models.py

* Update mteb/models/relle_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* lint after commit

run after "make lint"

* Add into model_modules

Add model into model_modules and lint check

---------

Co-authored-by: Roman Solomatin <[email protected]>

* Backfill task metadata for metadata for GermanDPR and GermanQuAD (#2566)

* Add metadata for GermanDPR and GermanQuAD

* PR improvements

* Update tasks table

* Add  ModelMeta for CodeSearch-ModernBERT-Crow-Plus (#2570)

* Add files via upload

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update overview.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update mteb/models/shuu_model.py

Co-authored-by: Roman Solomatin <[email protected]>

---------

Co-authored-by: Roman Solomatin <[email protected]>

* Docs: Improve MIEB docs (#2569)

* Add missing annotations (#2498)

* Update tasks table

* move icon & name to benchmark dataclass (#2573)

* Remove the comments from ImageEncoder (#2579)

* fix: Add Encodechka benchmark (#2561)

* add tasks

* add benchmark

* fix imports

* update stsb split

* Update tasks table

* 1.38.2

Automatically generated by python-semantic-release

* fix FlagEmbedding package name (#2588)

* fix codecarbon version (#2587)

* Add MIEB image only benchmark (#2590)

* add vision only bench

* add description

* correct zs task modalities

* specify tasks param

* Add image only MIEB benchmark to LB left panel (#2596)

* Update benchmarks.py

* make lint

* add to left side bar

* update Doubao-1.5-Embedding (#2575)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* fix: Add WebSSL models (#2604)

* add 2 web SSL dino models

* add models from collection and revisions

* update memory_usage_mb and embed dim

* use automodel instead

* fix mieb citation (#2606)

* 1.38.3

Automatically generated by python-semantic-release

* Update Doubao-1.5-Embedding (#2611)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

* update link

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* CI: update benchmark table (#2609)

* update benchmark table

* fix table

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update Doubao-1.5-Embedding revision (#2613)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

* update link

* update revision

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* CI: fix table  (#2615)

* Update tasks & benchmarks tables

* fixes

* Update gradio version (#2558)

* Update gradio version

Closes #2557

* bump gradio

* fix: Removed missing dataset for MTEB(Multilingual) and bumped version

We should probably just have done this earlier to ensure that the multilingual benchamrk is runable.

* CI: fix infinitely committing issue (#2616)

* fix token

* try to trigger

* add token

* test ci

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* remove test lines

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* fix retrieval loader

* add descriptive stats

* Add ScandiSent dataset (#2620)

* add scandisent dataset

* add to init

* typo

* lint

* 1.38.4

Automatically generated by python-semantic-release

* Format all citations (#2614)

* Fix errors in bibtex_citation

* Format all bibtex_citation fields

* format benchmarks

* fix format

* Fix tests

* add formatting script

* fix citations

* update imports

* fix citations

* fix citations

* format citation

---------

Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: chenghao xiao <[email protected]>
Co-authored-by: Munot Ayush Sunil <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: E. Tolga Ayan <[email protected]>
Co-authored-by: lllsy12138 <[email protected]>
Co-authored-by: shyuli <[email protected]>
Co-authored-by: Siddharth M. Bhatia <[email protected]>
Co-authored-by: Bao Loc Pham <[email protected]>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Flo <[email protected]>
Co-authored-by: Florian Rottach <[email protected]>
Co-authored-by: Alexey Vatolin <[email protected]>
Co-authored-by: Olesksii Horchynskyi <[email protected]>
Co-authored-by: Pandaswag <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>
Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>
Co-authored-by: Youngjoon Jang <[email protected]>
Co-authored-by: 24September <[email protected]>
Co-authored-by: Jan Karaś <[email protected]>
Co-authored-by: Shuu <[email protected]>
Co-authored-by: namespace-Pt <[email protected]>
Co-authored-by: zhangpeitian <[email protected]>
isaac-chung added a commit that referenced this pull request May 3, 2025
* Update tasks table

* 1.36.26

Automatically generated by python-semantic-release

* Pass task name to all evaluators (#2389)

* pass task name to all tasks

* add test

* fix loader

* fix: renaming Zeroshot -> ZeroShot (#2395)

* fix: renaming Zeroshot -> ZeroShot

Adresses #2078

* rename 1

* rename 2

* format

* fixed error

* 1.36.27

Automatically generated by python-semantic-release

* fix: Update AmazonPolarityClassification license (#2402)

Update AmazonPolarityClassification.py

* fix b1ade name (#2403)

* 1.36.28

Automatically generated by python-semantic-release

* Minor style changes (#2396)

* fix: renaming Zeroshot -> ZeroShot

Adresses #2078

* fix: minor style changes

Adresses #2078

* rename 1

* rename 2

* format

* fixed error

---------

Co-authored-by: Isaac Chung <[email protected]>

* Added new dataset and tasks - ClusTREC-covid , clustering of thematic covid related scientific papers  (#2302)

* Clustrec covid new dataset and task

* fix

* fix

* fix

* fix

* fix

* descriptive stats

* change all mentions of clustrec-covidp2p to clustrec-covid

* change ' to "

* Update tasks table

* fix: Major updates to docs + make mieb dep optional (#2397)

* fix: renaming Zeroshot -> ZeroShot

Adresses #2078

* fix: minor style changes

Adresses #2078

* fix: Major updates to documentation

This PR does the following:
- This introduced other modalities more clearly in the documentation as well as make it easier to transition to a full on documentation site later.
- added minor code updates due to discovered inconsistencies in docs and code.
- Added the MMTEB citation where applicable
- makes the docs ready to move torchvision to an optional dependency

* Moved VISTA example

* rename 1

* rename 2

* format

* fixed error

* fix: make torchvision optional (#2399)

* fix: make torchvision optional

* format

* add docs

* minor fix

* remove transform from Any2TextMultipleChoiceEvaluator

---------

Co-authored-by: Isaac Chung <[email protected]>

* move Running SentenceTransformer model with prompts to usage

---------

Co-authored-by: Isaac Chung <[email protected]>

* 1.36.29

Automatically generated by python-semantic-release

* remove Arabic_Triplet_Matryoshka_V2.py (#2405)

* Min torchvision>0.2.1 (#2410)

matching torch>1.0.0

* fix: Add validation to model_name in `ModelMeta` (#2404)

* add test for name validation

* upd docs

* upd cohere name

* fix tests

* fix name for average_word_embeddings_komninos

* fix name for average_word_embeddings_komninos

* fix reranker test

* fix reranker test

* 1.36.30

Automatically generated by python-semantic-release

* [MIEB] "capability measured"-Abstask 1-1 matching refactor [1/3]: reimplement CV-Bench (#2414)

* refactor CV-Bench

* reimplement CV Bench

* remove abstask/evaluator/tests for Any2TextMultipleChoice

* rerun descriptive stats

* Update tasks table

* fix: Add option to remove benchmark from leaderboard (#2417)

fix: Add option to remove leaderboard from leaderboard

fixes #2413

This only removed the benchmark from the leaderboard but keep it in MTEB.

* 1.36.31

Automatically generated by python-semantic-release

* fix: Add VDR Multilingual Dataset (#2408)

* Added VDR Multilingual Dataset

* address comments

* make lint

* Formated Dataset for retrieval

* Update mteb/tasks/Retrieval/multilingual/VdrMultilingualRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update mteb/tasks/Retrieval/multilingual/VdrMultilingualRetrieval.py

Co-authored-by: Roman Solomatin <[email protected]>

* make lint

* corrected date

* fix dataset building

* move to image folder

---------

Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>

* Update tasks table

* 1.36.32

Automatically generated by python-semantic-release

* HOTFIX: pin setuptools (#2423)

* pin setuptools

* pin setuptools

* pin setuptools in makefile

* try ci

* fix ci

* remove speed from installs

* add __init__.py Clustering > kor folder,  And   edit __init__.py in Clustering folder (#2422)

* add PatentFnBClustering.py

* do make lint and revise

* rollback Makefile

* Update mteb/tasks/Clustering/kor/PatentFnBClustering.py

Co-authored-by: Roman Solomatin <[email protected]>

* klue_mrc_domain

* make lint

* klue_modified_clustering_dataset

* clustering & kor folder add __init.py

* clustering & kor folder add __init__.py

* task.py roll-back

* correct text_creation to sample_creation & delete form in MetaData

* correct task_subtype in TaskMetaData

* delete space

* edit metadata

* edit task_subtypes

---------

Co-authored-by: Roman Solomatin <[email protected]>

* Update tasks table

* Update speed dependencies with new setuptools release (#2429)

* add richinfoai models (#2427)

* add richinfoai models

add richinfoai models

* format codes by linter

format codes by linter

* Added Memory Usage column on leaderboard (#2428)

* docs: typos; Standardize spacing; Chronological order (#2436)

* Fix typos; add chrono order

* Fix spacing

* fix: Add model specific dependencies in pyproject.toml (#2424)

* Add model specific dependencies in pyproject.toml

* Update documentation

* 1.36.33

Automatically generated by python-semantic-release

* [MIEB] "capability measured"-Abstask 1-1 matching refactor [2/3]: reimplement r-Oxford and r-Paris (#2442)

* MutipleChoiceEvaluationMixin; reimplement r-Oxford and r-Paris; rerun stats

* modify benchmark list

* fix citation

* Update tasks table

* Error while evaluating MIRACLRetrievalHardNegatives: 'trust_remote_code' (#2445)

Fixes #2444

* Feat/searchmap preview (#2420)

* Added meta information about SearchMap_Preview model to the model_dir

* Added meta information about SearchMap_Preview model to the model_dir

* updated revision name

* Device loading and cuda cache cleaning step left out

* removed task instructions since it's not necessary

* changed sentence transformer loader to mteb default loader and passed instructions s model prompts

* Included searchmap to the models overview page

* Included searchmap to the models overview page

* added meta data information about where model was adpated from

* Update mteb/models/searchmap_models.py

* fix lint

* lint

---------

Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* Add Background Gradients in Summary and Task Table (#2392)

* Add Background Gradients in Summary and Task Table

* Remove warnings and add light green cmap

* Address comments

* Separate styling function

* address comments

* added comments

* add ops_moa_models (#2439)

* add ops_moa_models

* add custom implementations

* Simplify custom implementation and format the code

* support SentenceTransformers

* add training datasets

* Update mteb/models/ops_moa_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* update training_datasets

---------

Co-authored-by: kunka.xgw <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* leaderboard fix (#2456)

* ci: cache `~/.cache/huggingface` (#2464)

ci: cache ~/.cache/huggingface

Co-authored-by: sam021313 <[email protected]>

* [MIEB] "capability measured"-Abstask 1-1 matching refactor [3/3]: reimplement ImageCoDe (#2468)

* reimplement ImageCoDe with ImageTextPairClassification

* add missing stats file

* Update tasks table

* fix: Adds family of NeuML/pubmedbert-base-embedding models (#2443)

* feat: added pubmedbert model2vec models

* fix: attribute model_name

* fix: fixed commit hash for pubmed_bert model2vec models

* fix: changes requested in PR 2443

* fix: add nb_sbert model (#2339)

* add_nb_sbert_model

* Update nb_sbert.py

added n_parameters and release_date

* Update mteb/models/nb_sbert.py

Co-authored-by: Roman Solomatin <[email protected]>

* Update nb_sbert.py

fix: make lint

* added nb_sbert to overview.py + ran make lint

* Update nb_sbert.py

Fix error: Input should be a valid date or datetime, month value is outside expected range of 1-12

---------

Co-authored-by: Roman Solomatin <[email protected]>

* 1.36.34

Automatically generated by python-semantic-release

* suppress logging warnings on leaderboard (#2406)

* supress logging warnings

* remove loggers

* return blocks

* rename function

* fix gme models

* add server name

* update after merge

* fix ruff

* fix: E5 instruct now listed as sbert compatible (#2475)

Fixes #1442

* 1.36.35

Automatically generated by python-semantic-release

* [MIEB] rename VisionCentric to VisionCentricQA (#2479)

rename VisionCentric to VisionCentricQA

* ci: Run dataset loading only when pushing to main (#2480)

Update dataset_loading.yml

* fix table in tasks.md (#2483)

* Update tasks table

* fix: add prompt to NanoDBPedia (#2486)

* 1.36.36

Automatically generated by python-semantic-release

* Fix Task Lang Table (#2487)

* Fix Task Lang Table

* added tasks.md

* fix

* fix: Ignore datasets not available in tests (#2484)

* 1.36.37

Automatically generated by python-semantic-release

* [MIEB] align main metrics with leaderboard (#2489)

align main metrics with leaderboard

* typo in model name (#2491)

* SpeedTask add deprecated warning (#2493)

* Docs: Update README.md (#2494)

Update README.md

* fix transformers version for now (#2504)

* Fix typos (#2509)

* ci: refactor TaskMetadata eval langs test (#2501)

* refactor eval langs test

* function returns None

* add hard negaties tasks in _HISTORIC_DATASETS

* rename to ImageClustering folder (#2516)

rename folder

* Clean up trailing spaces citation (#2518)

* rename folder

* trailing spaces

* missed one

* [mieb] Memotion preprocessing code made more robust and readable (#2519)

* fix: validate lang code in ModelMeta (#2499)

* Update pyproject.toml (#2522)

* 1.36.38

Automatically generated by python-semantic-release

* Fix leaderboard version (#2524)

* fix gradio leaderboard run

* update docs

* Fix gte-multilingual-base embed_dim (#2526)

* [MIEB] Specify only the multilingual AggTask for MIEB-lite (#2539)

specify only the multilingual AggTask

* [mieb] fix hatefulmemes (#2531)

* fix hatefulmeme

* add to description and use polars instead

---------

Co-authored-by: Isaac Chung <[email protected]>

* Model conan (#2534)

* conan_models

* conan_models

* refactor code

* refactor code

---------

Co-authored-by: shyuli <[email protected]>

* fix: Update mteb.get_tasks with an exclude_aggregate parameter to exclude aggregate tasks (#2536)

* Implement task.is_aggregate check

* Add `mteb.get_tasks` parameter `include_aggregate` to exclude aggregate tasks if needed

* Update mteb.run with the new `task.is_aggregate` parameter

* Add tests

* Ran linter

* Changed logic to `exclude_aggregate`

* Updated from review comments

* Exclude aggregate by default false in get_tasks

* 1.36.39

Automatically generated by python-semantic-release

* docs: Add MIEB citation in benchmarks (#2544)

Add MIEB citation in benchmarks

* Add 2 new Vietnamese Retrieval Datasets (#2393)

* [ADD] 2 new Datasets

* [UPDATE] Change bibtext_citation for GreenNodeTableMarkdownRetrieval as TODO

* [UPDATE] Change bibtext_citation for ZacLegalTextRetrieval as TODO

* Update tasks table

* fix: CacheWrapper per task (#2467)

* feat: CacheWrapper per task

* refactor logic

* update documentation

---------

Co-authored-by: Florian Rottach <[email protected]>

* 1.36.40

Automatically generated by python-semantic-release

* misc: move MMTEB scripts and notebooks to separate repo (#2546)

move mmteb scripts and notebooks to separate repo

* fix: Update requirements in JinaWrapper (#2548)

fix: Update package requirements in JinaWrapper for einops and flash_attn

* 1.36.41

Automatically generated by python-semantic-release

* Docs: Add MIEB to README (#2550)

Add MIEB to README

* Add xlm_roberta_ua_distilled (#2547)

* defined model metadata for xlm_roberta_ua_distilled

* Update mteb/models/ua_sentence_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* included ua_sentence_models.py in overview.py

* applied linting, added missing fields in ModelMeta

* applied linting

---------

Co-authored-by: Roman Solomatin <[email protected]>

* fix me5 trainind data config to include xquad dataset (#2552)

* fix: me5 trainind data config to include xquad dataset

* Update mteb/models/e5_models.py

upddate: xquad key name

Co-authored-by: Roman Solomatin <[email protected]>

* fix: ME5_TRAINING_DATA format

---------

Co-authored-by: Roman Solomatin <[email protected]>

* feat: Added dataframe utilities to BenchmarkResults (#2542)

* fix: Added dataframe utilities to BenchmarkResults

- Added `get_results_table`. I was considering renaming it to `to_dataframe` to align with `tasks.to_dataframe`. WDYT?
- Added a tests for ModelResults and BenchmarksResults
- Added a few utility functions where needed
- Added docstring throughout ModelResults and BenchmarksResults
- Added todo comment for missing aspects - mostly v2 - but we join_revisions seems like it could use an update before then.

Prerequisite for #2454:

@ayush1298 can I ask you to review this PR as well? I hope this give an idea of what I was hinting at. Sorry that it took a while. I wanted to make sure to get it right.

* refactor to to_dataframe and combine common dependencies

* ibid

* fix revision joining after discussion with @x-tabdeveloping

* remove strict=True for zip() as it is a >3.9 feature

* updated mock cache

* 1.37.0

Automatically generated by python-semantic-release

* fix e5_R_mistral_7b (#2490)

* fix e5_R_mistral_7b

* change wrapper

* address comments

* Added kwargs for pad_token

* correct lang format

* address comments

* add revision

---------

Co-authored-by: Roman Solomatin <[email protected]>

* fix unintentional working of filters on leaderboard (#2535)

* fix unintentional working of filters on leaderboard

* address comments

* make lint

* address comments

* rollback unnecessary changes

* feat: UI Overhaul (#2549)

* Bumped gradio version to latest

* Added new Gradio table functionality to leaderboard

* Removed search bar

* Changed color scheme in plot to match the table

* Added new benchmark selector in sidebar

* Changed not activated button type to secondary

* Short-circuited callbacks that are based on language selection

* Re-added column width calculation since it got messed up

* Commented out gradient for per-task table as it slowed things down substantially

* Styling and layout updates

* Adjusted comments according to reviews

* Converted all print statements to logger.debug

* Removed pydantic version fix

* Ran linting

* Remove commented out code

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* Moved English,v1 to Legacy section

* Closed the benchmark sharing accordion by default

* Adjusted markdown blocks according to suggestions

* Ran linter

---------

Co-authored-by: Kenneth Enevoldsen <[email protected]>

* 1.38.0

Automatically generated by python-semantic-release

* add USER2 (#2560)

* add user2

* add training code

* update prompts

* Fix leaderboard entry for BuiltBench (#2563)

Fix leaderboard entry for BuiltBench (#2562)

Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>

* fix: jasper models embeddings having nan values (#2481)

* 1.38.1

Automatically generated by python-semantic-release

* fix frida datasets (#2565)

* Add relle (#2564)

* Add relle
* defined model metadata for relle

* Add mteb/models/relle_models.py

* Update mteb/models/relle_models.py

Co-authored-by: Roman Solomatin <[email protected]>

* lint after commit

run after "make lint"

* Add into model_modules

Add model into model_modules and lint check

---------

Co-authored-by: Roman Solomatin <[email protected]>

* Backfill task metadata for metadata for GermanDPR and GermanQuAD (#2566)

* Add metadata for GermanDPR and GermanQuAD

* PR improvements

* Update tasks table

* Add  ModelMeta for CodeSearch-ModernBERT-Crow-Plus (#2570)

* Add files via upload

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update overview.py

* Update shuu_model.py

* Update shuu_model.py

* Update shuu_model.py

* Update mteb/models/shuu_model.py

Co-authored-by: Roman Solomatin <[email protected]>

---------

Co-authored-by: Roman Solomatin <[email protected]>

* Docs: Improve MIEB docs (#2569)

* Add missing annotations (#2498)

* Update tasks table

* move icon & name to benchmark dataclass (#2573)

* Remove the comments from ImageEncoder (#2579)

* fix: Add Encodechka benchmark (#2561)

* add tasks

* add benchmark

* fix imports

* update stsb split

* Update tasks table

* 1.38.2

Automatically generated by python-semantic-release

* fix FlagEmbedding package name (#2588)

* fix codecarbon version (#2587)

* Add MIEB image only benchmark (#2590)

* add vision only bench

* add description

* correct zs task modalities

* specify tasks param

* Add image only MIEB benchmark to LB left panel (#2596)

* Update benchmarks.py

* make lint

* add to left side bar

* update Doubao-1.5-Embedding (#2575)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* fix: Add WebSSL models (#2604)

* add 2 web SSL dino models

* add models from collection and revisions

* update memory_usage_mb and embed dim

* use automodel instead

* fix mieb citation (#2606)

* 1.38.3

Automatically generated by python-semantic-release

* Update Doubao-1.5-Embedding (#2611)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

* update link

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* CI: update benchmark table (#2609)

* update benchmark table

* fix table

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update Doubao-1.5-Embedding revision (#2613)

* update seed-embedding

* update seed models

* fix linting and tiktoken problem

* fix tiktoken bug

* fix lint

* update name

* Update mteb/models/seed_models.py

adopt suggestion

Co-authored-by: Roman Solomatin <[email protected]>

* update logging

* update lint

* update link

* update revision

---------

Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* CI: fix table  (#2615)

* Update tasks & benchmarks tables

* Update gradio version (#2558)

* Update gradio version

Closes #2557

* bump gradio

* fix: Removed missing dataset for MTEB(Multilingual) and bumped version

We should probably just have done this earlier to ensure that the multilingual benchamrk is runable.

* CI: fix infinitely committing issue (#2616)

* fix token

* try to trigger

* add token

* test ci

* Update tasks & benchmarks tables

* Update tasks & benchmarks tables

* remove test lines

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add ScandiSent dataset (#2620)

* add scandisent dataset

* add to init

* typo

* lint

* 1.38.4

Automatically generated by python-semantic-release

* Format all citations (#2614)

* Fix errors in bibtex_citation

* Format all bibtex_citation fields

* format benchmarks

* fix format

* Fix tests

* add formatting script

* fix citations (#2628)

* Add Talemaader pair classification task (#2621)

Add talemaader pair classification task

* fix citations

* fix citations

---------

Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Kenneth Enevoldsen <[email protected]>
Co-authored-by: Uri K <[email protected]>
Co-authored-by: chenghao xiao <[email protected]>
Co-authored-by: Munot Ayush Sunil <[email protected]>
Co-authored-by: OnandOn <[email protected]>
Co-authored-by: richinfo-ai <[email protected]>
Co-authored-by: Niklas Muennighoff <[email protected]>
Co-authored-by: Adewole Babatunde <[email protected]>
Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: ahxgw <[email protected]>
Co-authored-by: kunka.xgw <[email protected]>
Co-authored-by: Sam Heymann <[email protected]>
Co-authored-by: sam021313 <[email protected]>
Co-authored-by: Nadia Sheikh <[email protected]>
Co-authored-by: theatollersrud <[email protected]>
Co-authored-by: hongst <[email protected]>
Co-authored-by: E. Tolga Ayan <[email protected]>
Co-authored-by: lllsy12138 <[email protected]>
Co-authored-by: shyuli <[email protected]>
Co-authored-by: Siddharth M. Bhatia <[email protected]>
Co-authored-by: Bao Loc Pham <[email protected]>
Co-authored-by: Flo <[email protected]>
Co-authored-by: Florian Rottach <[email protected]>
Co-authored-by: Alexey Vatolin <[email protected]>
Co-authored-by: Olesksii Horchynskyi <[email protected]>
Co-authored-by: Pandaswag <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>
Co-authored-by: Mehrzad Shahin-Moghadam <[email protected]>
Co-authored-by: Youngjoon Jang <[email protected]>
Co-authored-by: 24September <[email protected]>
Co-authored-by: Jan Karaś <[email protected]>
Co-authored-by: Shuu <[email protected]>
Co-authored-by: namespace-Pt <[email protected]>
Co-authored-by: zhangpeitian <[email protected]>
Co-authored-by: Imene Kerboua <[email protected]>
@KennethEnevoldsen
Copy link
Contributor

Yes as @Samoed says, the PR that I made only adds a lot of dataframe utilities required for this PR.

You would have to refactor this PR to rely on the dataframe utilities (to ensure that we consistently do it the same way)

@KennethEnevoldsen
Copy link
Contributor

@ayush1298 just pinging you here

@ayush1298
Copy link
Collaborator Author

@ayush1298 just pinging you here

Sorry, forgot about this one. Will start working on this again on weekends.

@KennethEnevoldsen
Copy link
Contributor

@ayush1298 seems like we never finished up this PR, do you still have time to work on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add CLI tool for inspecting and creating dataframe of the results on a give leaderboard
3 participants