|
| 1 | +# MEGA-Bench metrics |
| 2 | + |
| 3 | +Each task's metrics are specified in `metrics.json` and follow the schema outlined below: |
| 4 | + |
| 5 | +```json |
| 6 | +{ |
| 7 | + "field_score_function": { |
| 8 | + for field_name in field_names: |
| 9 | + field_name: scoring_function |
| 10 | + }, |
| 11 | + "aggregation": { |
| 12 | + "function": aggregation_function, |
| 13 | + "field_weights": { |
| 14 | + for field_name in field_names: |
| 15 | + field_name: field_weight |
| 16 | + } |
| 17 | + }, |
| 18 | + "response_parse_function": response_parse_function |
| 19 | +} |
| 20 | +``` |
| 21 | + |
| 22 | +## Scoring Functions |
| 23 | + |
| 24 | +### String Comparisons |
| 25 | +These metrics are applied when both the response and the correct field are strings. |
| 26 | + |
| 27 | +- `exact_str_match`: Checks if the field exactly matches the reference response. |
| 28 | +- `simple_str_match`: Performs a case-insensitive comparison, ignoring spaces and hyphens, to determine if the response matches the correct field. |
| 29 | +- `exact_str_match_case_insensitive`: A case-insensitive version of `exact_str_match`. |
| 30 | +- `normalized_similarity_demarau_levenshtein`: Computes the normalized Damerau-Levenshtein similarity between the strings. |
| 31 | +- `near_str_match`: Normalizes accented characters to their ASCII equivalents, then performs a case-insensitive fuzzy string match. If the similarity score is below a certain threshold (currently 0.9), the score is set to 0. |
| 32 | +- `program_judge`: A custom suite of test cases specifically for the `code_programming_test` task. |
| 33 | + |
| 34 | +### Set Comparisons |
| 35 | +These metrics are used when inputs are iterables or strings that represent sets. Inputs are converted into sets and treated as empty if parsing fails. These metrics are useful when the order doesn't matter. |
| 36 | + |
| 37 | +`set_equality`: Checks if the sets are identical. |
| 38 | +`jaccard_index`: Calculates the Jaccard index of the two sets. |
| 39 | +`set_precision`: Measures the ratio of the predicted set elements that appear in the correct set. |
| 40 | +`chess_move_list_jaccard_index`: Computes the Jaccard index without requiring the response to specify whether a move results in check or checkmate. |
| 41 | + |
| 42 | +### List Comparisons |
| 43 | +These metrics apply when inputs are iterables or strings that represent lists. Inputs are converted into lists and treated as empty if parsing fails. |
| 44 | + |
| 45 | +`longest_common_list_prefix_ratio`: Calculates the ratio of the length of the longest common prefix (list) to the length of the correct solution. |
| 46 | + |
| 47 | +#### Bounding boxes |
| 48 | +Bounding boxes are a specialized type of list metric. Each solution consists of a list of 4-tuples, where each tuple is of the form (x1, y1, x2, y2), representing the top-left and bottom-right corners of the bounding box, respectively. Since images are dynamically resized before being sent to the LMM, coordinates are normalized to the range [0, 1]. |
| 49 | + |
| 50 | +`nbbox_iou_tuple`: Matches each predicted bounding box with the one that has the highest Intersection over Union (IoU) score, which is then used as the score for that bounding box. The mean score across all predicted bounding boxes is calculated. |
| 51 | + |
| 52 | +### Dictionary Comparisons |
| 53 | +These metrics apply when inputs are dictionaries or strings that encode dictionaries. Inputs are converted into dictionaries and treated as empty if parsing fails. |
| 54 | + |
| 55 | +Generally, these metrics follow a two-step approach: |
| 56 | +1. Calculate a metric for values with matching keys, resulting in a mapping of key-score pairs. If a key is missing in the response, its score is set to 0. |
| 57 | +2. Aggregate the scores. |
| 58 | + |
| 59 | +This approach is straightforward when the keys in the response and the correct answer match. If they don't, various strategies can be employed. |
| 60 | + |
| 61 | +- `agg_recall`: Computes the mean score for keys that appear in the correct answer. |
| 62 | +- `agg_jaccard`: Computes the mean score across all keys, using the size of the union of keys from both the response and the correct answer as the denominator. |
| 63 | + |
| 64 | +Derivative metrics that follow this format include: |
| 65 | +- `dict_exact_str_match_agg_recall` |
| 66 | +- `dict_set_equality_agg_jaccard` |
| 67 | +- `dict_jaccard_agg_jaccard` |
| 68 | +- `dict_nbbox_iou_tuple_agg_jaccard` |
| 69 | + |
| 70 | +## Aggregation Functions |
| 71 | +The following functions are used to aggregate the field scores: |
| 72 | + |
| 73 | +- `mean`: Calculates a weighted mean, with weights specified in `aggregation.field_weights`. |
| 74 | +- `min`: Returns the minimum field score. |
| 75 | + |
| 76 | +## Response Parsing Functions |
| 77 | +These functions are used to parse the model's response: |
| 78 | + |
| 79 | +`json`: Parses the response as a JSON object. |
| 80 | +`odd_one_out`: A custom parser for the `logical_reasoning_find_odd_out_one` task. |
| 81 | +`logical_2d_views_3d_shapes`: A custom parser for the `logical_reasoning_2D_views_of_3D_shapes` task. |
0 commit comments