commit

huggingface · NathanHB · Aug 28, 2024 · Aug 28, 2024 · Sep 3, 2024 · Sep 3, 2024
commit af1ad1302cee6deb08e468973c9bd99f83db8160
diff --git a/docs/source/installation.md b/docs/source/installation.md
@@ -32,6 +32,8 @@ appropriate extras group.
 | tensorboardX | To upload your results to tensorboard                                     |
 | vllm         | To use vllm as backend for inference                                      |
 | s3           | To upload results to s3                                                   |
+
+
 ## Hugging Face login
 
 If you want to push your results to the Hugging Face Hub or evaluate your own

diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
@@ -1,78 +1,74 @@
 # Metrics
 
-- MetricCategory.TARGET_PERPLEXITY
-	- acc_golds_likelihood
-	- target_perplexity
+## Metrics for multiple choice tasks
+These metrics use log-likelihood of the different possible targets.
+- `loglikelihood_acc` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`)
+- `loglikelihood_acc_norm` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`)
+- `loglikelihood_acc_norm_nospace` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
+- `loglikelihood_f1` (Harness): Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`)
+- `mcc` (Harness): Matthew's correlation coefficient (a measure of agreement between statistical distributions),
+- `recall_at_1` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`)
+- `recall_at_2` (Harness): Fraction of instances where the choice with the 2nd best logprob or better was correct  - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`)
+- `mrr` (Harness): Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance  - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`)
+- `target_perplexity` (Harness): Perplexity of the different choices available.
+- `acc_golds_likelihood`: (Harness): A bit different, it actually checks if the average logprob of a single target is above or below 0.5
+- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets
 
-- MetricCategory.MULTICHOICE_ONE_TOKEN
-	- loglikelihood_acc_norm_single_token
-	- loglikelihood_acc_single_token
-	- loglikelihood_f1_single_token
-	- mcc_single_token
-	- mrr_single_token
-	- multi_f1_numeric
-	- recall_at_1_single_token
-	- recall_at_2_single_token
+All these metrics also exist in a "single token" version (`loglikelihood_acc_single_token`, `loglikelihood_acc_norm_single_token`, `loglikelihood_f1_single_token`, `mcc_single_token`, `recall@2_single_token` and `mrr_single_token`). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:
+- `multi_f1_numeric` (Harness, for CB): computes the f1 score of all possible choices and averages it.
 
-- MetricCategory.IGNORED
-	- prediction_perplexity
+## Metrics for perplexity and language modeling
+These metrics use log-likelihood of prompt.
+- `word_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of words of the sequence.
+- `byte_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
+- `bits_per_byte` (HELM): Average number of bits per byte according to model probabilities.
+- `log_prob` (HELM): Predicted output's average log probability (input's log prob for language modeling).
 
-- MetricCategory.PERPLEXITY
-	- bits_per_byte
-	- byte_perplexity
-	- word_perplexity
-
-- MetricCategory.GENERATIVE
-	- bert_score
-	- bleu
-	- bleu_1
-	- bleu_4
-	- bleurt
-	- chrf
-	- copyright
-	- drop
-	- exact_match
-	- extractiveness
-	- f1_score_quasi
-	- f1_score
-	- f1_score_macro
-	- f1_score_micro
-	- faithfulness
-	- perfect_exact_match
-	- prefix_exact_match
-	- prefix_quasi_exact_match
-	- quasi_exact_match
-	- quasi_exact_match_math
-	- quasi_exact_match_triviaqa
-	- quasi_exact_match_gsm8k
-	- rouge_t5
-	- rouge1
-	- rouge2
-	- rougeL
-	- rougeLsum
-	- ter
-
-- MetricCategory.GENERATIVE_SAMPLING
-	- maj_at_4_math
-	- maj_at_5
-	- maj_at_8
-	- maj_at_8_gsm8k
-
-- MetricCategory.LLM_AS_JUDGE_MULTI_TURN
-	- llm_judge_multi_turn_gpt3p5
-	- llm_judge_multi_turn_llama_3_405b
-
-- MetricCategory.LLM_AS_JUDGE
-	- llm_judge_gpt3p5
-	- llm_judge_llama_3_405b
-
-- MetricCategory.MULTICHOICE
-	- loglikelihood_acc
-	- loglikelihood_acc_norm
-	- loglikelihood_acc_norm_nospace
-	- loglikelihood_f1
-	- mcc
-	- mrr
-	- recall_at_1
-	- recall_at_2
-	- truthfulqa_mc_metrics
+## Metrics for generative tasks
+These metrics need the model to generate an output. They are therefore slower.
+- Base:
+    - `perfect_exact_match` (Harness): Fraction of instances where the prediction matches the gold exactly.
+    - `exact_match` (HELM): Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a `strip` has been applied to both).
+    - `quasi_exact_match` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as `quasi_exact_match_triviaqa`, which only normalizes the predictions after applying a strip to all sentences.
+    - `prefix_exact_match` (HELM): Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a `strip` has been applied to both).
+    - `prefix_quasi_exact_match` (HELM): Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
+    - `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
+    - `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
+    - `f1_score`:  Average F1 score in terms of word overlap between the model output and gold without normalisation
+    - `f1_score_macro`: Corpus level macro F1 score
+    - `f1_score_macro`: Corpus level micro F1 score
+    - `maj_at_5` and `maj_at_8`: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
+- Summarization:
+    - `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
+    - `rouge1` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
+    - `rouge2` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
+    - `rougeL` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
+    - `rougeLsum` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
+    - `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics
+    - `faithfulness` (HELM): Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
+    - `extractiveness` (HELM): Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/)
+        - `summarization_coverage`: Extent to which the model-generated summaries are extractive fragments from the source document,
+        - `summarization_density`: Extent to which the model-generated summaries are extractive summaries based on the source document,
+        - `summarization_compression`: Extent to which the model-generated summaries are compressed relative to the source document.
+    - `bert_score` (HELM): Reports the average BERTScore precision, recall, and f1 score [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and gold summary.
+    - Translation
+    - `bleu`: Corpus level BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) - uses the sacrebleu implementation.
+    - `bleu_1` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap - uses the nltk implementation.
+    - `bleu_4` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap - uses the nltk implementation.
+    - `chrf` (Harness): Character n-gram matches f-score.
+    - `ter` (Harness): Translation edit/error rate.
+- Copyright
+    - `copyright` (HELM): Reports:
+        - `longest_common_prefix_length`: average length of longest common prefix between model generation and reference,
+        - `edit_distance`: average Levenshtein edit distance between model generation and reference,
+        - `edit_similarity`: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
+- Math:
+    - `quasi_exact_match_math` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
+    - `maj_at_4_math` (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
+    - `quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
+    - `maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
+- LLM-as-Judge:
+    - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API
+    - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API
+    - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench.
+    - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench.
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
@@ -60,6 +60,67 @@ accelerate launch --multi_gpu --num_processes=8 -m \
 Here, `--override_batch_size` defines the batch size per device, so the effective
 batch size will be `override_batch_size * num_gpus`.
 
+### Model Arguments
+
+The `--model_args` argument takes a string representing a list of model
+argument. The arguments allowed vary depending on the backend you use (vllm or
+accelerate).
+
+#### Accelerate
+
+- **pretrained** (str):
+    HuggingFace Hub model ID name or the path to a pre-trained
+    model to load. This is effectively the `pretrained_model_name_or_path`
+    argument of `from_pretrained` in the HuggingFace `transformers` API.
+- **tokenizer** (Optional[str]): HuggingFace Hub tokenizer ID that will be
+    used for tokenization.
+- **multichoice_continuations_start_space** (Optional[bool]): Whether to add a
+    space at the start of each continuation in multichoice generation.
+    For example, context: "What is the capital of France?" and choices: "Paris", "London".
+    Will be tokenized as: "What is the capital of France? Paris" and "What is the capital of France? London".
+    True adds a space, False strips a space, None does nothing
+- **subfolder** (Optional[str]): The subfolder within the model repository.
+- **revision** (str): The revision of the model.
+- **max_gen_toks** (Optional[int]): The maximum number of tokens to generate.
+- **max_length** (Optional[int]): The maximum length of the generated output.
+- **add_special_tokens** (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
+   If `None`, the default value will be set to `True` for seq2seq models (e.g. T5) and
+    `False` for causal models.
+- **model_parallel** (bool, optional, defaults to False):
+    True/False: force to use or not the `accelerate` library to load a large
+    model across multiple devices.
+    Default: None which corresponds to comparing the number of processes with
+        the number of GPUs. If it's smaller => model-parallelism, else not.
+- **dtype** (Union[str, torch.dtype], optional, defaults to None):):
+    Converts the model weights to `dtype`, if specified. Strings get
+    converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
+    Use `dtype="auto"` to derive the type from the model's weights.
+- **device** (Union[int, str]): device to use for model training.
+- **quantization_config** (Optional[BitsAndBytesConfig]): quantization
+    configuration for the model, manually provided to load a normally floating point
+    model at a quantized precision. Needed for 4-bit and 8-bit precision.
+- **trust_remote_code** (bool): Whether to trust remote code during model
+    loading.
+
+#### VLLM
+
+- **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
+- **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
+- **batch_size** (int): The batch size for model training.
+- **revision** (str): The revision of the model.
+- **dtype** (str, None): The data type to use for the model.
+- **tensor_parallel_size** (int): The number of tensor parallel units to use.
+- **data_parallel_size** (int): The number of data parallel units to use.
+- **max_model_length** (int): The maximum length of the model.
+- **swap_space** (int): The CPU swap space size (GiB) per GPU.
+- **seed** (int): The seed to use for the model.
+- **trust_remote_code** (bool): Whether to trust remote code during model loading.
+- **use_chat_template** (bool): Whether to use the chat template or not.
+- **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
+- **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.
+- **subfolder** (Optional[str]): The subfolder within the model repository.
+
+
 #### Pipeline parallelism
 
 To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
@@ -96,6 +157,6 @@ Nanotron models cannot be evaluated without torchrun.
  ```
 
 The `nproc-per-node` argument should match the data, tensor and pipeline
-parallelism confidured in the `lighteval_config_override_template.yaml` file.
+parallelism confidured in the `lighteval_config_template.yaml` file.
 That is: `nproc-per-node = data_parallelism * tensor_parallelism *
 pipeline_parallelism`.