Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate GritLM-7B on MTEB datasets #57

Open
ThisisXXZ opened this issue Nov 4, 2024 · 9 comments
Open

Evaluate GritLM-7B on MTEB datasets #57

ThisisXXZ opened this issue Nov 4, 2024 · 9 comments

Comments

@ThisisXXZ
Copy link

ThisisXXZ commented Nov 4, 2024

I am trying to evaluate GritLM-7B on MTEB datasets using the provided script.

#!/bin/bash

python /home/e/e1347696/unified_encoder_decoder/src/eval/MTEB/eval_mteb.py \
    --model_name_or_path /home/e/e1347696/unified_encoder_decoder/model/GritLM-7B \
    --output_folder /home/e/e1347696/unified_encoder_decoder/src/results/GritLM-7B-mteb \
    --task_types Classification,Clustering,PairClassification,Reranking,Retrieval,STS,Summarization \
    --batch_size 32

However, it seems that it has only been evaluated on the following datasets:

  • AmazonCounterFactualClassification
  • AmazonReviewsClassification
  • MassiveIntentClassification
  • MassiveScenarioClassification
  • MTOPDomainClassification
  • MTOPIntentClassification
  • STS17
  • STS22

Other datasets seem to be skipped. The output log is shown here:

Created GritLM: torch.bfloat16 dtype, mean pool, embedding mode, bbcc attn
GritLM-7B instruction for AmazonCounterfactualClassification:  <|user|>
Classify a given Amazon customer review text as either counterfactual or not-counterfactual
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonCounterfactualClassification, s2s, multilingual 1 / 4 Subsets


GritLM-7B instruction for AmazonReviewsClassification:  <|user|>
Classify the given Amazon review into its appropriate rating category
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - AmazonReviewsClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MasakhaNEWSClassification
GritLM-7B instruction for MassiveIntentClassification:  <|user|>
Given a user utterance as query, find the user intents
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveIntentClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MassiveScenarioClassification:  <|user|>
Given a user utterance as query, find the user scenarios
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MassiveScenarioClassification, s2s, multilingual 1 / 51 Subsets


GritLM-7B instruction for MTOPDomainClassification:  <|user|>
Classify the intent domain of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPDomainClassification, s2s, multilingual 1 / 6 Subsets


GritLM-7B instruction for MTOPIntentClassification:  <|user|>
Classify the intent of the given utterance in task-oriented conversation
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
Classification
    - MTOPIntentClassification, s2s, multilingual 1 / 6 Subsets


Skipping task: MultiHateClassification
Skipping task: MultilingualSentimentClassification
Skipping task: NusaX-senti
Skipping task: SIB200Classification
Skipping task: SouthAfricanLangClassification
Skipping task: MasakhaNEWSClusteringP2P
Skipping task: MasakhaNEWSClusteringS2S
Skipping task: SIB200ClusteringS2S
Skipping task: BelebeleRetrieval
Skipping task: MIRACLRetrieval
Skipping task: MIRACLRetrievalHardNegatives
Skipping task: MLQARetrieval
Skipping task: MultiLongDocRetrieval
Skipping task: WikipediaRetrievalMultilingual
Skipping task: XMarket
Skipping task: XQuADRetrieval
Skipping task: OpusparcusPC
Skipping task: PawsXPairClassification
Skipping task: RTE3
Skipping task: XNLI
Skipping task: MIRACLReranking
Skipping task: WikipediaRerankingMultilingual
Skipping task: SemRel24STS
GritLM-7B instruction for STS17:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS17, s2s, multilingual 1 / 11 Subsets


Skipping task: STS22.v2
GritLM-7B instruction for STS22:  <|user|>
Retrieve semantically similar text.
<|embed|>

─────────────────────────────── Selected tasks  ────────────────────────────────
STS
    - STS22, p2p, multilingual 1 / 18 Subsets


Skipping task: STSBenchmarkMultilingualSTS

And the error log contains some warning such as:

The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
The `task_langs` argument is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(... languages = [...])` to filter tasks instead. Note that this uses 3 letter language codes (ISO 639-3).
Passing task names as strings is deprecated and will be removed in the next release. Please use `tasks = mteb.get_tasks(tasks=[...])` method to get tasks instead.
The `batch_size` argument is deprecated and will be removed in the next release. Please use `encode_kwargs = {'batch_size': ...}` to set the batch size instead.
Failed to extract metadata from model: 'GritLM' object has no attribute 'model_card_data'. Upgrading to sentence-transformers v3.0.0 or above is recommended.
Dataset 'STS22' is superseeded by 'STS22.v2', you might consider using the newer version of the dataset.

I will really appreciate it if you could help me with that! Thank you so much!

@ThisisXXZ ThisisXXZ changed the title Evaluate GritLM-7 Evaluate GritLM-7 on MTEB datasets Nov 4, 2024
@ThisisXXZ ThisisXXZ changed the title Evaluate GritLM-7 on MTEB datasets Evaluate GritLM-7B on MTEB datasets Nov 4, 2024
@Muennighoff
Copy link
Collaborator

This is on purpose & happens here

print('Skipping task: ' + task_name)

It only evaluates the 56 main MTEB EN datasets & skips others

The warnings are fine

@ThisisXXZ
Copy link
Author

This is on purpose & happens here

print('Skipping task: ' + task_name)

It only evaluates the 56 main MTEB EN datasets & skips others
The warnings are fine

Thank you very much! I noticed that only 8 tasks were evaluated, with 6 of them being classification tasks and 2 being STS tasks. I'd like to evaluate GritLM-7B on all the tasks mentioned in the paper and compare the results. Could you please guide me on how to proceed with that?
Here are the results:
image
I want to compare them with paper, but I don't see any cluster, rerank or retrieval tasks.
image
Thank you so much! Sorry If I asked something dumb, I'm new to this field 🐱

@Muennighoff
Copy link
Collaborator

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: #58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below

# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)

)

@ThisisXXZ
Copy link
Author

Oh sorry it seems like the latest version of MTEB had some changes which render the eval script in this repository outdated.

I just changed the requirements of the repo to install a different mteb version here: #58 - Can you try downgrading your mteb to the version in that PR (pip install mteb==1.4.0) & check that it works?

(if you want to use the latest mteb it should also work via sth like the below

# !pip install mteb gritlm
import mteb
model_name = "GritLM/GritLM-7B"
revision = "13f00a0e36500c80ce12870ea513846a066004af"
model = mteb.get_model(model_name, revision=revision)
benchmark = mteb.get_benchmark("MTEB(eng, classic)")
evaluation = mteb.MTEB(tasks=benchmark)
results = evaluation.run(model)

)

It begins to evaluate on other datasets! thanks! Also, I'd like to know is it sufficient to evaluate MTEB on a single A100-80GB GPU?

@Muennighoff
Copy link
Collaborator

I think that is sufficient, it will just take a while (especially the retrieval datasets).

@ThisisXXZ
Copy link
Author

I think that is sufficient, it will just take a while (especially the retrieval datasets).

Hi! The evaluation proceeds fine until the MindSmallReranking dataset. I'm using metb==1.4.0, datasets==3.0.2, and the complete error message is shown below:

Failed to load JSON from file 'gzip://train.jsonl::/home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small/snapshots/3bdac13927fdc888b903db93b2ffdbd90b295a69/train.jsonl.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
Error while evaluating MindSmallReranking: An error occurred while generating the dataset
Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 160, in _generate_tables
    df = pandas_read_json(f)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 38, in pandas_read_json
    return pd.read_json(path_or_buf, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 815, in read_json
    return json_reader.read()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1025, in read
    obj = self._get_object_parser(self.data)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1051, in _get_object_parser
    obj = FrameParser(json, **kwargs).parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1187, in parse
    self._parse()
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/pandas/io/json/_json.py", line 1403, in _parse
    ujson_loads(json, precise_float=self.precise_float), dtype=None
ValueError: Unexpected character found when decoding 'true'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1853, in _prepare_split_single
    for _, table in generator:
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 163, in _generate_tables
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/packaged_modules/json/json.py", line 137, in _generate_tables
    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 308, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/e/e1347696/unified_encoder_decoder/eval/MTEB/eval_mteb.py", line 1202, in <module>
    evaluation.run(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 336, in run
    raise e
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/evaluation/MTEB.py", line 302, in run
    task.load_data(eval_splits=task_eval_splits, **kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/mteb/abstasks/AbsTask.py", line 37, in load_data
    self.dataset = datasets.load_dataset(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/load.py", line 2154, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 924, in download_and_prepare
    self._download_and_prepare(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 999, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1740, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/home/e/e1347696/miniconda3/envs/grit_eval/lib/python3.10/site-packages/datasets/builder.py", line 1896, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

@Muennighoff
Copy link
Collaborator

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

@ThisisXXZ
Copy link
Author

ThisisXXZ commented Nov 8, 2024

Looks like a corrupted download; You can try deleting /home/e/e1347696/.cache/huggingface/hub/datasets--mteb--mind_small & letting it redownload it or else directly download the files from https://huggingface.co/datasets/mteb/mind_small/tree/3bdac13927fdc888b903db93b2ffdbd90b295a69

I've tried to clean the cache but the error persists. I found a closed issue in MTEB repo and it shares the same problem. Do I need to downgrade datasets to evaluate this MindSmallReranking?

Thank you so much!

@Muennighoff
Copy link
Collaborator

Hm yeah maybe try dwongrading

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants