Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding documentation #282

Closed
wants to merge 29 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
a324d63
adding documentation
NathanHB Aug 28, 2024
26d8402
adding documentation nanotron
NathanHB Aug 28, 2024
203045a
commit
NathanHB Sep 3, 2024
cbdcf1b
commit
NathanHB Sep 3, 2024
dd67ce4
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
015e924
undo unecessary changes
NathanHB Sep 3, 2024
4e9c30e
Merge branch 'main' into nathan-add-doc
NathanHB Sep 3, 2024
8aabbc8
still working on docs
NathanHB Sep 5, 2024
3a74186
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 5, 2024
db0c06d
Merge remote-tracking branch 'origin/main' into nathan-add-doc
NathanHB Sep 6, 2024
57b0cd4
commit
NathanHB Sep 9, 2024
7e4d56d
commit
NathanHB Sep 11, 2024
e533074
commit
NathanHB Sep 11, 2024
2f1c7f5
Update docs/source/installation.md
NathanHB Sep 17, 2024
0d1da5d
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
7a8782a
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
1c7454b
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
2539035
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
b5f2942
Update docs/source/saving_results.md
NathanHB Sep 17, 2024
9825950
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
fa67cf0
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f17ce92
Update docs/source/adding_new_metric.md
NathanHB Sep 17, 2024
f3c319d
Update docs/source/adding_new_metric.md
NathanHB Sep 18, 2024
bcd6f50
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
33c1e7f
Update docs/source/adding_new_task.md
NathanHB Sep 18, 2024
016cea4
fix
NathanHB Sep 18, 2024
e86912a
Merge branch 'nathan-add-doc' of github.com:huggingface/lighteval int…
NathanHB Sep 18, 2024
3aba2a1
fix
NathanHB Sep 18, 2024
af1ad13
commit
NathanHB Sep 18, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
commit
NathanHB committed Sep 18, 2024
commit af1ad1302cee6deb08e468973c9bd99f83db8160
2 changes: 2 additions & 0 deletions docs/source/installation.md
Original file line number Diff line number Diff line change
@@ -32,6 +32,8 @@ appropriate extras group.
| tensorboardX | To upload your results to tensorboard |
| vllm | To use vllm as backend for inference |
| s3 | To upload results to s3 |


## Hugging Face login

If you want to push your results to the Hugging Face Hub or evaluate your own
142 changes: 69 additions & 73 deletions docs/source/metric_list.md
Original file line number Diff line number Diff line change
@@ -1,78 +1,74 @@
# Metrics

- MetricCategory.TARGET_PERPLEXITY
- acc_golds_likelihood
- target_perplexity
## Metrics for multiple choice tasks
These metrics use log-likelihood of the different possible targets.
- `loglikelihood_acc` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`)
- `loglikelihood_acc_norm` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`)
- `loglikelihood_acc_norm_nospace` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
- `loglikelihood_f1` (Harness): Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`)
- `mcc` (Harness): Matthew's correlation coefficient (a measure of agreement between statistical distributions),
- `recall_at_1` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`)
- `recall_at_2` (Harness): Fraction of instances where the choice with the 2nd best logprob or better was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`)
- `mrr` (Harness): Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`)
- `target_perplexity` (Harness): Perplexity of the different choices available.
- `acc_golds_likelihood`: (Harness): A bit different, it actually checks if the average logprob of a single target is above or below 0.5
- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets

- MetricCategory.MULTICHOICE_ONE_TOKEN
- loglikelihood_acc_norm_single_token
- loglikelihood_acc_single_token
- loglikelihood_f1_single_token
- mcc_single_token
- mrr_single_token
- multi_f1_numeric
- recall_at_1_single_token
- recall_at_2_single_token
All these metrics also exist in a "single token" version (`loglikelihood_acc_single_token`, `loglikelihood_acc_norm_single_token`, `loglikelihood_f1_single_token`, `mcc_single_token`, `recall@2_single_token` and `mrr_single_token`). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:
- `multi_f1_numeric` (Harness, for CB): computes the f1 score of all possible choices and averages it.

- MetricCategory.IGNORED
- prediction_perplexity
## Metrics for perplexity and language modeling
These metrics use log-likelihood of prompt.
- `word_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of words of the sequence.
- `byte_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
- `bits_per_byte` (HELM): Average number of bits per byte according to model probabilities.
- `log_prob` (HELM): Predicted output's average log probability (input's log prob for language modeling).

- MetricCategory.PERPLEXITY
- bits_per_byte
- byte_perplexity
- word_perplexity

- MetricCategory.GENERATIVE
- bert_score
- bleu
- bleu_1
- bleu_4
- bleurt
- chrf
- copyright
- drop
- exact_match
- extractiveness
- f1_score_quasi
- f1_score
- f1_score_macro
- f1_score_micro
- faithfulness
- perfect_exact_match
- prefix_exact_match
- prefix_quasi_exact_match
- quasi_exact_match
- quasi_exact_match_math
- quasi_exact_match_triviaqa
- quasi_exact_match_gsm8k
- rouge_t5
- rouge1
- rouge2
- rougeL
- rougeLsum
- ter

- MetricCategory.GENERATIVE_SAMPLING
- maj_at_4_math
- maj_at_5
- maj_at_8
- maj_at_8_gsm8k

- MetricCategory.LLM_AS_JUDGE_MULTI_TURN
- llm_judge_multi_turn_gpt3p5
- llm_judge_multi_turn_llama_3_405b

- MetricCategory.LLM_AS_JUDGE
- llm_judge_gpt3p5
- llm_judge_llama_3_405b

- MetricCategory.MULTICHOICE
- loglikelihood_acc
- loglikelihood_acc_norm
- loglikelihood_acc_norm_nospace
- loglikelihood_f1
- mcc
- mrr
- recall_at_1
- recall_at_2
- truthfulqa_mc_metrics
## Metrics for generative tasks
These metrics need the model to generate an output. They are therefore slower.
- Base:
- `perfect_exact_match` (Harness): Fraction of instances where the prediction matches the gold exactly.
- `exact_match` (HELM): Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a `strip` has been applied to both).
- `quasi_exact_match` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as `quasi_exact_match_triviaqa`, which only normalizes the predictions after applying a strip to all sentences.
- `prefix_exact_match` (HELM): Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a `strip` has been applied to both).
- `prefix_quasi_exact_match` (HELM): Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
- `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
- `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
- `f1_score`: Average F1 score in terms of word overlap between the model output and gold without normalisation
- `f1_score_macro`: Corpus level macro F1 score
- `f1_score_macro`: Corpus level micro F1 score
- `maj_at_5` and `maj_at_8`: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
- Summarization:
- `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
- `rouge1` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
- `rouge2` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
- `rougeL` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
- `rougeLsum` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
- `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics
- `faithfulness` (HELM): Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
- `extractiveness` (HELM): Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/)
- `summarization_coverage`: Extent to which the model-generated summaries are extractive fragments from the source document,
- `summarization_density`: Extent to which the model-generated summaries are extractive summaries based on the source document,
- `summarization_compression`: Extent to which the model-generated summaries are compressed relative to the source document.
- `bert_score` (HELM): Reports the average BERTScore precision, recall, and f1 score [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and gold summary.
- Translation
- `bleu`: Corpus level BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) - uses the sacrebleu implementation.
- `bleu_1` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap - uses the nltk implementation.
- `bleu_4` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap - uses the nltk implementation.
- `chrf` (Harness): Character n-gram matches f-score.
- `ter` (Harness): Translation edit/error rate.
- Copyright
- `copyright` (HELM): Reports:
- `longest_common_prefix_length`: average length of longest common prefix between model generation and reference,
- `edit_distance`: average Levenshtein edit distance between model generation and reference,
- `edit_similarity`: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
- Math:
- `quasi_exact_match_math` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
- `maj_at_4_math` (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
- `quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
- `maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
- LLM-as-Judge:
- `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API
- `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API
- `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench.
- `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench.
63 changes: 62 additions & 1 deletion docs/source/quicktour.md
Original file line number Diff line number Diff line change
@@ -60,6 +60,67 @@ accelerate launch --multi_gpu --num_processes=8 -m \
Here, `--override_batch_size` defines the batch size per device, so the effective
batch size will be `override_batch_size * num_gpus`.

### Model Arguments

The `--model_args` argument takes a string representing a list of model
argument. The arguments allowed vary depending on the backend you use (vllm or
accelerate).

#### Accelerate

- **pretrained** (str):
HuggingFace Hub model ID name or the path to a pre-trained
model to load. This is effectively the `pretrained_model_name_or_path`
argument of `from_pretrained` in the HuggingFace `transformers` API.
- **tokenizer** (Optional[str]): HuggingFace Hub tokenizer ID that will be
used for tokenization.
- **multichoice_continuations_start_space** (Optional[bool]): Whether to add a
space at the start of each continuation in multichoice generation.
For example, context: "What is the capital of France?" and choices: "Paris", "London".
Will be tokenized as: "What is the capital of France? Paris" and "What is the capital of France? London".
True adds a space, False strips a space, None does nothing
- **subfolder** (Optional[str]): The subfolder within the model repository.
- **revision** (str): The revision of the model.
- **max_gen_toks** (Optional[int]): The maximum number of tokens to generate.
- **max_length** (Optional[int]): The maximum length of the generated output.
- **add_special_tokens** (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
If `None`, the default value will be set to `True` for seq2seq models (e.g. T5) and
`False` for causal models.
- **model_parallel** (bool, optional, defaults to False):
True/False: force to use or not the `accelerate` library to load a large
model across multiple devices.
Default: None which corresponds to comparing the number of processes with
the number of GPUs. If it's smaller => model-parallelism, else not.
- **dtype** (Union[str, torch.dtype], optional, defaults to None):):
Converts the model weights to `dtype`, if specified. Strings get
converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
Use `dtype="auto"` to derive the type from the model's weights.
- **device** (Union[int, str]): device to use for model training.
- **quantization_config** (Optional[BitsAndBytesConfig]): quantization
configuration for the model, manually provided to load a normally floating point
model at a quantized precision. Needed for 4-bit and 8-bit precision.
- **trust_remote_code** (bool): Whether to trust remote code during model
loading.

#### VLLM

- **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
- **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
- **batch_size** (int): The batch size for model training.
- **revision** (str): The revision of the model.
- **dtype** (str, None): The data type to use for the model.
- **tensor_parallel_size** (int): The number of tensor parallel units to use.
- **data_parallel_size** (int): The number of data parallel units to use.
- **max_model_length** (int): The maximum length of the model.
- **swap_space** (int): The CPU swap space size (GiB) per GPU.
- **seed** (int): The seed to use for the model.
- **trust_remote_code** (bool): Whether to trust remote code during model loading.
- **use_chat_template** (bool): Whether to use the chat template or not.
- **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
- **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.
- **subfolder** (Optional[str]): The subfolder within the model repository.


#### Pipeline parallelism

To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
@@ -96,6 +157,6 @@ Nanotron models cannot be evaluated without torchrun.
```

The `nproc-per-node` argument should match the data, tensor and pipeline
parallelism confidured in the `lighteval_config_override_template.yaml` file.
parallelism confidured in the `lighteval_config_template.yaml` file.
That is: `nproc-per-node = data_parallelism * tensor_parallelism *
pipeline_parallelism`.