Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MEGA-Bench #496

Merged
merged 7 commits into from
Jan 13, 2025
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions lmms_eval/tasks/megabench/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
# MEGA-Bench: Scaling Multimodal Evaluation to over 500 Real-World Tasks

![image](https://github.com/user-attachments/assets/5fd44fa9-0ec2-4298-ad0c-e883cb1edf7f)

MEGA-Bench contains 505 multimodal tasks with diverse data sources, input/output formats, and skill requirements. The taxonomy tree is derived from the application dimension, which guides and calibrates the annotation process. The benchmark is equiped with a suite of 45 evaluation metrics to handle various output formats beyond multiple-choice questions.


## Step-1: Get the model response files with lmms-eval

```bash
# Core set (440 tasks)
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava_onevision \
--tasks megabench_core \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_ov_megabench_core \
--output_path ./logs/ \
--model_args=pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen

# Open-ended set (65 tasks)
python3 -m accelerate.commands.launch \
--num_processes=8 \
-m lmms_eval \
--model llava_onevision \
--tasks megabench_open \
--batch_size 1 \
--log_samples \
--log_samples_suffix llava_ov_megabench_open \
--output_path ./logs/ \
--model_args=pretrained=lmms-lab/llava-onevision-qwen2-7b-ov,conv_template=qwen_1_5,model_name=llava_qwen
```


## Step-2: Run MEGA-Bench metrics to obtain the evaluation scores


Install the dependencies of MEGA-Bench's evaluation metrics.

```bash
pip install -r requirements.txt
```

Example: evaluate the submission file with stand-alone evaluator adapted from MEGA-Bench's codebase.

```bash
# Run the metrics for the core set
python lmms_eval/tasks/megabench/evaluator.py --subset_name core --submission_file logs/llava-ov-7b/submissions/megabench_core_all_query_responses.json --output_file logs/llava-ov-7b/megabench_scores/megabench_core_data_with_scores.json

# Run the metrics for the open-ended set
python lmms_eval/tasks/megabench/evaluator.py --subset_name open --submission_file logs/llava-ov-7b/submissions/megabench_open_all_query_responses.json --output_file logs/llava-ov-7b/megabench_scores/megabench_open_data_with_scores.json

# Derive the breakdown results
python lmms_eval/tasks/megabench/breakdown/derive_breakdown_results.py --input_dir logs/llava-ov-7b/megabench_scores

```

The results in `logs/llava-ov-7b/megabench_scores/analysis` are what used by [MEGA-Bench leaderboard](https://huggingface.co/spaces/TIGER-Lab/MEGA-Bench). The leaderboard can be updated by putting the files in the results directory of the leadboard's [HuggingFace space](https://huggingface.co/spaces/TIGER-Lab/MEGA-Bench/tree/main/static/eval_results/Default).
31 changes: 31 additions & 0 deletions lmms_eval/tasks/megabench/_default_template_yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
dataset_path: TIGER-Lab/MEGA-Bench
test_split: test
dataset_kwargs:
token: True
cache_dir: megabench_data
video: True
create_link: True
generation_kwargs:
max_new_tokens: 2048
temperature: 0
do_sample: false
top_p: 1.0
num_beams: 1

output_type: generate_until
doc_to_visual: !function utils.megabench_doc_to_visual
doc_to_text: !function utils.megabench_doc_to_text
doc_to_target: !function utils.megabench_doc_to_target
process_results: !function utils.megabench_process_results

metric_list:
- metric: submission
aggregation: !function utils.megabench_aggregate_results_for_submission
higher_is_better: true

lmms_eval_specific_kwargs:
default:
max_video_subsample_frame: 64

metadata:
- version: 0.0
Loading