Evaluate GritLM-7B on MTEB datasets #57

ThisisXXZ · 2024-11-04T09:19:56Z

I am trying to evaluate GritLM-7B on MTEB datasets using the provided script.

#!/bin/bash

python /home/e/e1347696/unified_encoder_decoder/src/eval/MTEB/eval_mteb.py \
    --model_name_or_path /home/e/e1347696/unified_encoder_decoder/model/GritLM-7B \
    --output_folder /home/e/e1347696/unified_encoder_decoder/src/results/GritLM-7B-mteb \
    --task_types Classification,Clustering,PairClassification,Reranking,Retrieval,STS,Summarization \
    --batch_size 32

However, it seems that it has only been evaluated on the following datasets:

AmazonCounterFactualClassification
AmazonReviewsClassification
MassiveIntentClassification
MassiveScenarioClassification
MTOPDomainClassification
MTOPIntentClassification
STS17
STS22

Other datasets seem to be skipped. The output log is shown here:

Created GritLM: torch.bfloat16 dtype, mean pool, embedding mode, bbcc attn
GritLM-7B instruction for AmazonCounterfactualClassification:  <|user|>
Classify a given Amazon customer review text as either counterfactual or not-counterfactual
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 1 / 4 Subsets


GritLM-7B instruction for AmazonReviewsClassification:  <|user|>
Classify the given Amazon review into its appropriate rating category
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MasakhaNEWSClassification
GritLM-7B instruction for MassiveIntentClassification:  <|user|>
Given a user utterance as query, find the user intents
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MassiveScenarioClassification:  <|user|>
Given a user utterance as query, find the user scenarios
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MTOPDomainClassification:  <|user|>
Classify the intent domain of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets


GritLM-7B instruction for MTOPIntentClassification:  <|user|>
Classify the intent of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MultiHateClassification
Skipping task: MultilingualSentimentClassification
Skipping task: NusaX-senti
Skipping task: SIB200Classification
Skipping task: SouthAfricanLangClassification
Skipping task: MasakhaNEWSClusteringP2P
Skipping task: MasakhaNEWSClusteringS2S
Skipping task: SIB200ClusteringS2S
Skipping task: BelebeleRetrieval
Skipping task: MIRACLRetrieval
Skipping task: MIRACLRetrievalHardNegatives
Skipping task: MLQARetrieval
Skipping task: MultiLongDocRetrieval
Skipping task: WikipediaRetrievalMultilingual
Skipping task: XMarket
Skipping task: XQuADRetrieval
Skipping task: OpusparcusPC
Skipping task: PawsXPairClassification
Skipping task: RTE3
Skipping task: XNLI
Skipping task: MIRACLReranking
Skipping task: WikipediaRerankingMultilingual
Skipping task: SemRel24STS
GritLM-7B instruction for STS17:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS17, s2s, multilingual 1 / 11 Subsets


Skipping task: STS22.v2
GritLM-7B instruction for STS22:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS22, p2p, multilingual 1 / 18 Subsets


Skipping task: STSBenchmarkMultilingualSTS

And the error log contains some warning such as:

The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
The `task_langs` argument is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(... languages = [...])` to filter tasks instead. Note that this uses 3 letter language codes (ISO 639-3).
Passing task names as strings is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(tasks=[...])` method to get tasks instead.
The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
Dataset 'STS22' is superseeded by 'STS22.v2', you might consider using the newer version of the dataset.

I will really appreciate it if you could help me with that! Thank you so much!

The text was updated successfully, but these errors were encountered:

Muennighoff · 2024-11-04T18:12:39Z

This is on purpose & happens here

gritlm/evaluation/eval_mteb.py

Line 1177 in 7c06435

print('Skipping task: ' + task_name)

It only evaluates the 56 main MTEB EN datasets & skips others

The warnings are fine

ThisisXXZ · 2024-11-05T02:15:47Z

This is on purpose & happens here

gritlm/evaluation/eval_mteb.py

Line 1177 in 7c06435

print('Skipping task: ' + task_name)

It only evaluates the 56 main MTEB EN datasets & skips others
The warnings are fine

Thank you very much! I noticed that only 8 tasks were evaluated, with 6 of them being classification tasks and 2 being STS tasks. I'd like to evaluate GritLM-7B on all the tasks mentioned in the paper and compare the results. Could you please guide me on how to proceed with that?
Here are the results:

I want to compare them with paper, but I don't see any cluster, rerank or retrieval tasks.

Thank you so much! Sorry If I asked something dumb, I'm new to this field 🐱

Muennighoff · 2024-11-05T03:00:55Z

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: #58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below

# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)

)

ThisisXXZ · 2024-11-06T03:18:40Z

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: #58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below
# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)
)

It begins to evaluate on other datasets! thanks! Also, I'd like to know is it sufficient to evaluate MTEB on a single A100-80GB GPU?

Muennighoff · 2024-11-06T03:49:08Z

I think that is sufficient, it will just take a while (especially the retrieval datasets).

ThisisXXZ · 2024-11-08T05:45:11Z

I think that is sufficient, it will just take a while (especially the retrieval datasets).

Hi! The evaluation proceeds fine until the MindSmallReranking dataset. I'm using metb==1.4.0, datasets==3.0.2, and the complete error message is shown below:

Failed to load JSON from file 'gzip://train.jsonl::/home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small/snapshots/3bdac13927fdc888b903db93b2ffdbd90b295a69/train.jsonl.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Error while evaluating MindSmallReranking: An error occurred while generating the dataset
Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 160, in _generate_tables
    df = pandas_read_json(f)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 38, in pandas_read_json
    return pd.read_json(path_or_buf, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 815, in read_json
    return json_reader.read()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1025, in read
    obj = self._get_object_parser(self.data)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1187, in parse
    self._parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1403, in _parse
    ujson_loads(json, precise_float=self.precise_float), dtype=None
ValueError: Unexpected character found when decoding 'true'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1853, in _prepare_split_single
    for _, table in generator:
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 163, in _generate_tables
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 137, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/e/e1347696/unified_encoder_decoder/eval/MTEB/eval_mteb.py", line 1202, in <module>
    evaluation.run(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 336, in run
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 302, in run
    task.load_data(eval_splits=task_eval_splits, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/abstasks/AbsTask.py", line 37, in load_data
    self.dataset = datasets.load_dataset(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 999, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1740, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1896, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

Muennighoff · 2024-11-08T05:52:23Z

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

ThisisXXZ · 2024-11-08T06:44:42Z

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

I've tried to clean the cache but the error persists. I found a closed issue in MTEB repo and it shares the same problem. Do I need to downgrade datasets to evaluate this MindSmallReranking?

Thank you so much!

Muennighoff · 2024-11-08T07:35:44Z

Hm yeah maybe try dwongrading

ThisisXXZ changed the title ~~Evaluate GritLM-7~~ Evaluate GritLM-7 on MTEB datasets Nov 4, 2024

ThisisXXZ changed the title ~~Evaluate GritLM-7 on MTEB datasets~~ Evaluate GritLM-7B on MTEB datasets Nov 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate GritLM-7B on MTEB datasets #57

Evaluate GritLM-7B on MTEB datasets #57

ThisisXXZ commented Nov 4, 2024 •

edited

Loading

Muennighoff commented Nov 4, 2024

ThisisXXZ commented Nov 5, 2024

Muennighoff commented Nov 5, 2024

ThisisXXZ commented Nov 6, 2024

Muennighoff commented Nov 6, 2024

ThisisXXZ commented Nov 8, 2024

Muennighoff commented Nov 8, 2024

ThisisXXZ commented Nov 8, 2024 •

edited

Loading

Muennighoff commented Nov 8, 2024

Evaluate GritLM-7B on MTEB datasets #57

Evaluate GritLM-7B on MTEB datasets #57

Comments

ThisisXXZ commented Nov 4, 2024 • edited Loading

Muennighoff commented Nov 4, 2024

ThisisXXZ commented Nov 5, 2024

Muennighoff commented Nov 5, 2024

ThisisXXZ commented Nov 6, 2024

Muennighoff commented Nov 6, 2024

ThisisXXZ commented Nov 8, 2024

Muennighoff commented Nov 8, 2024

ThisisXXZ commented Nov 8, 2024 • edited Loading

Muennighoff commented Nov 8, 2024

ThisisXXZ commented Nov 4, 2024 •

edited

Loading

ThisisXXZ commented Nov 8, 2024 •

edited

Loading