Skip to content

Commit c47c38e

Browse files
TianhaoLiang2000kennymckormickFangXinyu-0913
authored
[Benchmark] Add MEGA-Bench (#724)
* add MEGA-Bench core dataset support * add MEGA-Bench core dataset support * add MEGA-Bench core dataset support * add open-ended task * merge upstream to main * add README.md * [Fix and Add Features] fix some bug in megabench, add support for resume evaluation, change data.zip download problem * fix bugs of open_ended judge with eval_context * fix bugs of open_ended judge with eval_context * fix snapshot_download problem * Update video_dataset_config.py * fix import problem * fix import problem --------- Co-authored-by: Haodong Duan <[email protected]> Co-authored-by: kennymckormick <[email protected]> Co-authored-by: FangXinyu-0913 <[email protected]>
1 parent 97ce037 commit c47c38e

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+5061
-4
lines changed

vlmeval/dataset/__init__.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,8 +35,10 @@
3535
from .video_concat_dataset import ConcatVideoDataset
3636
from .mmgenbench import MMGenBench
3737
from .cgbench import CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded
38+
from .megabench import MEGABench
3839
from .moviechat1k import MovieChat1k
3940
from .vdc import VDC
41+
4042
from .worldsense import WorldSense
4143
from .qbench_video import QBench_Video, QBench_Video_MCQ, QBench_Video_VQA
4244

@@ -159,7 +161,7 @@ def evaluate(self, eval_file, **judge_kwargs):
159161
MLVU, MLVU_MCQ, MLVU_OpenEnded,
160162
TempCompass, TempCompass_MCQ, TempCompass_Captioning, TempCompass_YorN,
161163
CGBench_MCQ_Grounding_Mini, CGBench_OpenEnded_Mini, CGBench_MCQ_Grounding, CGBench_OpenEnded,
162-
QBench_Video, QBench_Video_MCQ, QBench_Video_VQA
164+
MEGABench, WorldSense, QBench_Video, QBench_Video_MCQ, QBench_Video_VQA
163165
]
164166

165167
TEXT_DATASET = [

vlmeval/dataset/megabench.py

Lines changed: 435 additions & 0 deletions
Large diffs are not rendered by default.

vlmeval/dataset/utils/ccocr_evaluator/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,4 +56,4 @@ If you find our work helpful, feel free to give us a cite.
5656

5757
## Contact Us
5858

59-
If you have any questions, feel free to send an email to: [email protected] or [email protected]
59+
If you have any questions, feel free to send an email to: [email protected] or [email protected]
Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# MEGA-Bench metrics
2+
3+
Each task's metrics are specified in `metrics.json` and follow the schema outlined below:
4+
5+
```json
6+
{
7+
"field_score_function": {
8+
for field_name in field_names:
9+
field_name: scoring_function
10+
},
11+
"aggregation": {
12+
"function": aggregation_function,
13+
"field_weights": {
14+
for field_name in field_names:
15+
field_name: field_weight
16+
}
17+
},
18+
"response_parse_function": response_parse_function
19+
}
20+
```
21+
22+
## Scoring Functions
23+
24+
### String Comparisons
25+
These metrics are applied when both the response and the correct field are strings.
26+
27+
- `exact_str_match`: Checks if the field exactly matches the reference response.
28+
- `simple_str_match`: Performs a case-insensitive comparison, ignoring spaces and hyphens, to determine if the response matches the correct field.
29+
- `exact_str_match_case_insensitive`: A case-insensitive version of `exact_str_match`.
30+
- `normalized_similarity_demarau_levenshtein`: Computes the normalized Damerau-Levenshtein similarity between the strings.
31+
- `near_str_match`: Normalizes accented characters to their ASCII equivalents, then performs a case-insensitive fuzzy string match. If the similarity score is below a certain threshold (currently 0.9), the score is set to 0.
32+
- `program_judge`: A custom suite of test cases specifically for the `code_programming_test` task.
33+
34+
### Set Comparisons
35+
These metrics are used when inputs are iterables or strings that represent sets. Inputs are converted into sets and treated as empty if parsing fails. These metrics are useful when the order doesn't matter.
36+
37+
`set_equality`: Checks if the sets are identical.
38+
`jaccard_index`: Calculates the Jaccard index of the two sets.
39+
`set_precision`: Measures the ratio of the predicted set elements that appear in the correct set.
40+
`chess_move_list_jaccard_index`: Computes the Jaccard index without requiring the response to specify whether a move results in check or checkmate.
41+
42+
### List Comparisons
43+
These metrics apply when inputs are iterables or strings that represent lists. Inputs are converted into lists and treated as empty if parsing fails.
44+
45+
`longest_common_list_prefix_ratio`: Calculates the ratio of the length of the longest common prefix (list) to the length of the correct solution.
46+
47+
#### Bounding boxes
48+
Bounding boxes are a specialized type of list metric. Each solution consists of a list of 4-tuples, where each tuple is of the form (x1, y1, x2, y2), representing the top-left and bottom-right corners of the bounding box, respectively. Since images are dynamically resized before being sent to the LMM, coordinates are normalized to the range [0, 1].
49+
50+
`nbbox_iou_tuple`: Matches each predicted bounding box with the one that has the highest Intersection over Union (IoU) score, which is then used as the score for that bounding box. The mean score across all predicted bounding boxes is calculated.
51+
52+
### Dictionary Comparisons
53+
These metrics apply when inputs are dictionaries or strings that encode dictionaries. Inputs are converted into dictionaries and treated as empty if parsing fails.
54+
55+
Generally, these metrics follow a two-step approach:
56+
1. Calculate a metric for values with matching keys, resulting in a mapping of key-score pairs. If a key is missing in the response, its score is set to 0.
57+
2. Aggregate the scores.
58+
59+
This approach is straightforward when the keys in the response and the correct answer match. If they don't, various strategies can be employed.
60+
61+
- `agg_recall`: Computes the mean score for keys that appear in the correct answer.
62+
- `agg_jaccard`: Computes the mean score across all keys, using the size of the union of keys from both the response and the correct answer as the denominator.
63+
64+
Derivative metrics that follow this format include:
65+
- `dict_exact_str_match_agg_recall`
66+
- `dict_set_equality_agg_jaccard`
67+
- `dict_jaccard_agg_jaccard`
68+
- `dict_nbbox_iou_tuple_agg_jaccard`
69+
70+
## Aggregation Functions
71+
The following functions are used to aggregate the field scores:
72+
73+
- `mean`: Calculates a weighted mean, with weights specified in `aggregation.field_weights`.
74+
- `min`: Returns the minimum field score.
75+
76+
## Response Parsing Functions
77+
These functions are used to parse the model's response:
78+
79+
`json`: Parses the response as a JSON object.
80+
`odd_one_out`: A custom parser for the `logical_reasoning_find_odd_out_one` task.
81+
`logical_2d_views_3d_shapes`: A custom parser for the `logical_reasoning_2D_views_of_3D_shapes` task.
Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
from .aggregation_type import AggregationType
2+
from .metric_type import MetricType
3+
from .response_parse_type import ResponseParseType
4+
5+
__all__ = [AggregationType, MetricType, ResponseParseType]
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
from numbers import Number
2+
from typing import Dict
3+
import numpy as np
4+
5+
6+
class MeanAggregation:
7+
"""Take the mean of all valid scores."""
8+
9+
@staticmethod
10+
def aggregate(scores: Dict[str, Number], weights: Dict[str, Number]) -> Number:
11+
"""Exact match between targets and responses."""
12+
filtered_scores = {f: s for f, s in scores.items() if s >= 0}
13+
if not filtered_scores:
14+
return -1
15+
16+
# Align the key order
17+
flattened_scores = []
18+
flattened_weights = []
19+
for field in filtered_scores:
20+
flattened_scores.append(filtered_scores[field])
21+
flattened_weights.append(weights[field])
22+
return np.average(flattened_scores, weights=flattened_weights)
Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
from numbers import Number
2+
from typing import Dict
3+
4+
5+
class MinAggregation:
6+
"""Take the minimum of all valid scores."""
7+
8+
@staticmethod
9+
def aggregate(scores: Dict[str, Number], weights: Dict[str, Number]) -> Number:
10+
"""Exact match between targets and responses."""
11+
filtered_scores = [s for s in scores.values() if s >= 0]
12+
if not filtered_scores:
13+
return -1
14+
return min(filtered_scores)
Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
from numbers import Number
2+
from typing import Dict
3+
4+
5+
class UnsupportedAggregation:
6+
@staticmethod
7+
def aggregate(scores: Dict[str, Number], weights: Dict[str, Number]) -> Number:
8+
return -1
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
from enum import Enum
2+
3+
class AggregationType(Enum):
4+
MEAN = 0
5+
6+
@classmethod
7+
def from_string(cls, s):
8+
return cls.MEAN
9+
10+
def aggregate(self, field_scores, field_weights):
11+
if not field_scores:
12+
return 0.0
13+
14+
total_score = 0.0
15+
total_weight = 0.0
16+
17+
for field, score in field_scores.items():
18+
weight = field_weights.get(field, 1.0)
19+
try:
20+
total_score += score * weight
21+
except:
22+
total_score += score[0] * weight
23+
total_weight += weight
24+
25+
return total_score / total_weight if total_weight > 0 else 0.0

0 commit comments

Comments
 (0)