-
Notifications
You must be signed in to change notification settings - Fork 446
Closed
Description
I have four questions.
-
Is there any automated way to aggregate results of a custom model to be used in a local leader-board?
Right now in the result I have the results of validation and test chunks, for different languages. Was not sure if I only need to report the average stats among all languages, and assuming only fortext
. Can you please clarify? -
hen I download the result table via
mteb.load_results(tasks=tasks)
, it gives me bunch of warnings like:
MTOPDomainClassification: Missing subsets {'en'} for split test
MassiveIntentClassification: Missing subsets {'en'} for split test
MassiveScenarioClassification: Missing subsets {'en'} for split test
MassiveIntentClassification: Missing subsets {'en'} for split test
Is this expected? If I want an apple to apple comparison, I assume I need to remove en
results of test
chunk for these datasets for my custom model as well, right?
- What are the metrics reported for each task-type/dataset? For example, for classification tasks, do you report F1 score or accuracy? And a similar question for other tasks like retrieval, summarization, etc.
From the paper I found the following, but wanted to confirm:
{"BitextMining": "F1", "Classification": "accuracy", "Clustering": "v_measure",
"PairClassification": "cosine_ap", "Reranking": "map", "Retrieval": "ndcg_at_10",
"STS": "cosine_spearman", "Summarization": "cosine_spearman"}
- The results of the different models which are reported in the leaderboard for each dataset are different than reported results that can be downloaded via
results = mteb.load_results(tasks=tasks)
. For example, forSTS17
dataset,"google/gemini-embedding-exp-03-07"
the leaderboard has88.57
and themteb.load_results(tasks=tasks)
has91.6
. Any idea?
Metadata
Metadata
Assignees
Labels
No labels