You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+5-5Lines changed: 5 additions & 5 deletions
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ LightEval is an evaluation suite which gathers a selection of features from wide
8
8
9
9
It is still an early, internal version - it should be nice to use but don't expect 100% stability!
10
10
11
-
In case of problems or question, feel free to open an issue!
11
+
In case of problems or question, feel free to open an issue!
12
12
13
13
## How to install and use
14
14
### Requirements
@@ -50,11 +50,11 @@ Lastly, create a **line summary** of your evaluation, in `metadata_table.json`.
50
50
-`suite` (list), the suite(s) to which your evaluation should belong. This field allows us to compare different tasks implementation, and is used a task selection to differentiate the versions to launch. At the moment, you'll find the keywords ["helm", "bigbench", "original", "lighteval"]; you can add also add new ones (for test, we recommend using "custom").
51
51
-`prompt_function` (str), the name of the prompt function you defined in the step above
52
52
-`hf_repo` (str), the path to your evaluation dataset on the hub
53
-
-`hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
53
+
-`hf_subset` (str), the specific subset you want to use for your evaluation (note: when the dataset has no subset, fill this field with `"default"`, not with `None` or `""`)
54
54
-`hf_avail_splits` (list), all the splits available for your dataset (train, valid or validation, test, other...)
55
55
-`evaluation_splits` (list), the splits you want to use for evaluation
56
56
-`few_shots_split` (str, can be `null`), the specific split from which you want to select samples for your few-shot examples. It should be different from the sets included in `evaluation_splits`
57
-
-`few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
57
+
-`few_shots_select` (str, can be `null`), the method that you will use to select items for your few-shot examples. Can be `null`, or one of:
58
58
-`balanced` selects examples from the `few_shots_split` with balanced labels, to avoid skewing the few shot examples (hence the model generations) towards one specific label
59
59
-`random` selects examples at random from the `few_shots_split`
60
60
-`random_sampling` selects new examples at random from the `few_shots_split` for every new item, but if a sampled item is equal to the current one, it is removed from the available samples
@@ -102,7 +102,7 @@ These metrics need the model to generate an output. They are therefore slower.
102
102
-`exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
103
103
-`f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
104
104
-`f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation
105
-
-`f1_score_macro`: Corpus level macro F1 score
105
+
-`f1_score_macro`: Corpus level macro F1 score
106
106
-`f1_score_macro`: Corpus level micro F1 score
107
107
- Summarization:
108
108
-`rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
@@ -141,7 +141,7 @@ These metrics need both the generation and its logprob. They are not working at
141
141
-`prediction_perplexity` (HELM): Measure of the logprob of a given input.
142
142
143
143
## Adding a new metric
144
-
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
144
+
If you want to add a new metric, first check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `metrics_sample`. If not, add it to either of these files depending on the level at which it is applied. Then, follow the example in `src.lighteval.metrics.metrics` to register your metric.
145
145
146
146
## Examples of scripts to launch lighteval on the cluster
"""Tests if at least one of predicted gold targets' log-likelihood is above 0.5.
280
280
281
281
Args:
282
-
target_acc (list[int]): List of scores indicating whether the predictions log-probabilities are above 0.5 aggregated.
282
+
results (list[int]): List of tuples containing, for each gold, the predictions log-probabilities associated with whether they are above 0.5 aggregated.
283
+
formatted_doc (Doc): _description_
283
284
284
285
Returns:
285
286
int: 1 if at least one of the possible golds had a log-likelihood above 0.5.
0 commit comments