From a324d63ef9a72dbfe3ce786417ef586d69da19a6 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 28 Aug 2024 14:06:29 +0200
Subject: [PATCH 01/24] adding documentation

---
 docs/source/_toctree.yml         | 20 ++++++++++
 docs/source/adding_new_metric.md | 33 ++++++++++++++++
 docs/source/adding_new_task.md   | 22 +++++++++++
 docs/source/index.md             | 12 ++++++
 docs/source/installation.md      | 42 +++++++++++++++++++++
 docs/source/metric_list.md       |  0
 docs/source/quicktour.md         | 64 ++++++++++++++++++++++++++++++++
 docs/source/task_list.md         |  0
 8 files changed, 193 insertions(+)
 create mode 100644 docs/source/_toctree.yml
 create mode 100644 docs/source/adding_new_metric.md
 create mode 100644 docs/source/adding_new_task.md
 create mode 100644 docs/source/index.md
 create mode 100644 docs/source/installation.md
 create mode 100644 docs/source/metric_list.md
 create mode 100644 docs/source/quicktour.md
 create mode 100644 docs/source/task_list.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
new file mode 100644
index 000000000..f110dafb3
--- /dev/null
+++ b/docs/source/_toctree.yml
@@ -0,0 +1,20 @@
+- sections:
+  - title: Getting Started
+  - local: index
+    title: 🌤️ Lighteval
+  - local: installation
+    title: Installation
+  - local: quicktour
+    title: Quicktour
+- sections:
+  - title: Guides
+  - local: adding_new_task
+    title: Adding a Custom Task
+  - local: adding_new_metric
+    title: Adding a Custom Metric
+- sections:
+  - title: API Reference
+  - local: metric_list
+    title: Available Metrics
+  - local: task_list
+    title: Available Tasks
diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
new file mode 100644
index 000000000..71bb039d9
--- /dev/null
+++ b/docs/source/adding_new_metric.md
@@ -0,0 +1,33 @@
+# Adding a New Metric
+
+First, check if you can use one of the parametrized functions in
+``src.lighteval.metrics.metrics_corpus`` or ``src.lighteval.metrics.metrics_sample``.
+
+If not, you can use the `custom_task` system to register your new metric:
+
+- Create a new Python file which should contain the full logic of your metric.
+- The file also needs to start with these imports
+
+```python
+from aenum import extend_enum
+from lighteval.metrics import Metrics
+
+# And any other class you might need to redefine your specific metric,
+# depending on whether it's a sample or corpus metric.
+```
+
+- And to end with the following, so that it adds your metric to our metrics
+  list when loaded as a module.
+
+```python
+# Adds the metric to the metric list!
+extend_enum(Metrics, "metric_name", metric_function)
+if __name__ == "__main__":
+    print("Imported metric")
+```
+
+You can then give your custom metric to lighteval by using `--custom-tasks
+path_to_your_file` when launching it.
+
+To see an example of a custom metric added along with a custom task, look at
+``examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py.``
diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
new file mode 100644
index 000000000..2e8f2edaa
--- /dev/null
+++ b/docs/source/adding_new_task.md
@@ -0,0 +1,22 @@
+# Adding a Custom Task
+
+To add a new task, first either open an issue, to determine whether it will be
+integrated in the core evaluations of lighteval, in the extended tasks, or the
+community tasks, and add its dataset on the hub.
+
+- Core evaluations are evaluations that only require standard logic in their
+  metrics and processing, and that we will add to our test suite to ensure non
+  regression through time. They already see high usage in the community.
+- Extended evaluations are evaluations that require custom logic in their
+  metrics (complex normalisation, an LLM as a judge, ...), that we added to
+  facilitate the life of users. They already see high usage in the community.
+- Community evaluations are submissions by the community of new tasks.
+
+A popular community evaluation can move to become an extended or core evaluation over time.
+
+[`lighteval.metrics.utils.CorpusLevelMetric`]
+
+TODO: Add code snippet to show how to add a new task to lighteval.
+
+```python
+```
diff --git a/docs/source/index.md b/docs/source/index.md
new file mode 100644
index 000000000..55b374b36
--- /dev/null
+++ b/docs/source/index.md
@@ -0,0 +1,12 @@
+# 🌤️ Lighteval
+
+A lightweight framework for LLM evaluation
+
+LightEval is a lightweight LLM evaluation suite that Hugging Face has been
+using internally with the recently released LLM data processing library
+datatrove and LLM training library nanotron.
+
+We're releasing it with the community in the spirit of building in the open.
+
+Note that it is still very much early so don't expect 100% stability ^^' In
+case of problems or questions, feel free to open an issue!
diff --git a/docs/source/installation.md b/docs/source/installation.md
new file mode 100644
index 000000000..cdbb87b8c
--- /dev/null
+++ b/docs/source/installation.md
@@ -0,0 +1,42 @@
+# Installation
+
+You can install Lighteval either from PyPi or from source.
+
+## From PyPi
+
+```bash
+pip install lighteval
+```
+
+## From source
+
+```bash
+git clone https://github.com/huggingface/lighteval.git
+cd lighteval
+pip install -e .
+```
+
+### Extras
+
+Lighteval has optional dependencies that you can install by specifying the
+appropriate extras group. `pip install lighteval[<group>]` or `pip install -e
+.[<group>]`.
+
+| extra name   | description                                                               |
+|--------------|---------------------------------------------------------------------------|
+| accelerate   | To use accelerate for model and data parallelism with transformers models |
+| tgi          | To use Text Generation Inference API to evaluate your model               |
+| nanotron     | To evaluate nanotron models                                               |
+| quantization | To evaluate quantized models                                              |
+| adapters     | To evaluate adapters models (delta and peft)                              |
+| tensorboardX | To upload your results to tensorboard                                     |
+
+## Hugging Face login
+
+If you want to push your results to the Hugging Face Hub or evaluate your own
+private models, don't forget to add your access token to the environment
+variable `HF_TOKEN`. You can do this by running:
+
+```bash
+huggingface-cli login
+```
diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
new file mode 100644
index 000000000..dc5b37ae0
--- /dev/null
+++ b/docs/source/quicktour.md
@@ -0,0 +1,64 @@
+# Quicktour
+
+We provide two main entry points to evaluate models:
+
+- `lighteval accelerate` : evaluate models on CPU or one or more GPUs using 🤗 Accelerate.
+- `lighteval nanotron`: evaluate models in distributed settings using ⚡️ Nanotron.
+
+## Accelerate
+
+### Evaluate a model on one or more GPUs
+
+To evaluate a model on one or more GPUs, first create a multi-gpu config by running.
+
+```bash
+accelerate config
+```
+
+You can then evaluate a model using data parallelism as follows:
+
+```bash
+accelerate launch --multi_gpu --num_processes=<num_gpus> -m \
+    lighteval accelerate \
+    --model_args="pretrained=<path to model on the hub>" \
+    --tasks <task parameters> \
+    --output_dir output_dir
+```
+
+Here, --tasks refers to either a comma-separated list of supported tasks from
+the tasks_list in the format: Tasks details can also be found in the file
+implementing them.
+
+```bash
+suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
+```
+
+or a file path like ``examples/tasks/recommended_set.txt`` which specifies
+multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA
+benchmark run:
+
+```bash
+accelerate launch --multi_gpu --num_processes=8 -m \
+    lighteval accelerate \
+    --model_args "pretrained=gpt2" \
+    --tasks "leaderboard|truthfulqa:mc|0|0" \
+    --override_batch_size 1 \
+    --output_dir="./evals/"
+```
+
+Here, --override_batch_size defines the batch size per device, so the effective
+batch size will be override_batch_size x num_gpus. To evaluate on multiple
+benchmarks, separate each task configuration with a comma, e.g.
+
+```bash
+accelerate launch --multi_gpu --num_processes=8 -m \
+    lighteval accelerate \
+    --model_args "pretrained=gpt2" \
+    --tasks "leaderboard|truthfulqa:mc|0|0,leaderboard|gsm8k|0|0" \
+    --override_batch_size 1 \
+    --output_dir="./evals/"
+```
+
+## Nanotron
+
+...
diff --git a/docs/source/task_list.md b/docs/source/task_list.md
new file mode 100644
index 000000000..e69de29bb

From 26d84023e0e78c7e295baa45b1d21b869f46a1ff Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 28 Aug 2024 17:35:50 +0200
Subject: [PATCH 02/24] adding documentation nanotron

---
 docs/source/quicktour.md | 18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index dc5b37ae0..56fb106fa 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -61,4 +61,20 @@ accelerate launch --multi_gpu --num_processes=8 -m \
 
 ## Nanotron
 
-...
+To evaluate a model trained with nanotron on a single gpu.
+
+<Tip warning={true}>
+Nanotron models cannot be evaluated without torchrun.
+</Tip>
+
+```bash
+ torchrun --standalone --nnodes=1 --nproc-per-node=1  \
+ src/lighteval/__main__.py nanotron \
+ --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
+ --lighteval-override examples/nanotron/lighteval_config_override_template.yaml
+ ```
+
+The `nproc-per-node` argument should match the data, tensor and pipeline
+parallelism confidured in the `lighteval_config_override_template.yaml` file.
+That is: `nproc-per-node = data_parallelism * tensor_parallelism *
+pipeline_parallelism`.

From 203045a8431bc9b77245c9998e05fc54509ea07f Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Tue, 3 Sep 2024 14:09:23 +0200
Subject: [PATCH 03/24] commit

---
 docs/source/_toctree.yml              |  44 ++++-----
 docs/source/adding_new_metric.md      |  70 +++++++++++++--
 docs/source/adding_new_task.md        | 125 +++++++++++++++++++++++++-
 docs/source/metric_list.md            |  12 +++
 docs/source/quicktour.md              |  62 ++++++++-----
 docs/source/task_list.md              |   0
 src/lighteval/metrics/metrics.py      |   8 ++
 src/lighteval/metrics/utils.py        |  23 +++++
 src/lighteval/tasks/default_tasks.py  |   4 +
 src/lighteval/tasks/lighteval_task.py |   3 +
 10 files changed, 299 insertions(+), 52 deletions(-)
 delete mode 100644 docs/source/task_list.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index f110dafb3..6c41e0777 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -1,20 +1,24 @@
-- sections:
-  - title: Getting Started
-  - local: index
-    title: 🌤️ Lighteval
-  - local: installation
-    title: Installation
-  - local: quicktour
-    title: Quicktour
-- sections:
-  - title: Guides
-  - local: adding_new_task
-    title: Adding a Custom Task
-  - local: adding_new_metric
-    title: Adding a Custom Metric
-- sections:
-  - title: API Reference
-  - local: metric_list
-    title: Available Metrics
-  - local: task_list
-    title: Available Tasks
+- local: index
+  title: 🌤️ Lighteval
+- title: "Getting Started"
+  sections:
+    - local: installation
+      title: Installation
+    - local: quicktour
+      title: Quicktour
+- title: "Guides"
+  sections:
+    - local: use_vllm
+      title: Using VLLM as backend
+    - local: adding_new_task
+      title: Adding a Custom Task
+    - local: adding_new_metric
+      title: Adding a Custom Metric
+    - local: saving_results
+      title: Saving Results
+- title: "API Reference"
+  sections:
+    - local: metric_list
+      title: Available Metrics
+    - local: tasks
+      title: Available Tasks
diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
index 71bb039d9..16281815c 100644
--- a/docs/source/adding_new_metric.md
+++ b/docs/source/adding_new_metric.md
@@ -1,23 +1,80 @@
 # Adding a New Metric
 
 First, check if you can use one of the parametrized functions in
-``src.lighteval.metrics.metrics_corpus`` or ``src.lighteval.metrics.metrics_sample``.
+[src.lighteval.metrics.metrics_corpus]() or
+[src.lighteval.metrics.metrics_sample]().
 
 If not, you can use the `custom_task` system to register your new metric:
 
+<Tip>
+To see an example of a custom metric added along with a custom task, look at
+<a href="">the IFEval custom task</a>.
+</Tip>
+
 - Create a new Python file which should contain the full logic of your metric.
 - The file also needs to start with these imports
 
 ```python
 from aenum import extend_enum
 from lighteval.metrics import Metrics
+```
+
+You need to define sample level metric:
+
+```python
+def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
+    response = predictions[0]
+    return response == formatted_doc.choices[formatted_doc.gold_index]
+```
+
+Here the sample level metric only return one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
 
-# And any other class you might need to redefine your specific metric,
-# depending on whether it's a sample or corpus metric.
+```python
+def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
+    response = predictions[0]
+    return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
 ```
 
-- And to end with the following, so that it adds your metric to our metrics
-  list when loaded as a module.
+Then, you can define an aggreagtion function if needed, a comon aggregation function is `np.mean`.
+
+```python
+def agg_function(items):
+    flat_items = [item for sublist in items for item in sublist]
+    score = sum(flat_items) / len(flat_items)
+    return score
+```
+
+Finally, you can define your metric. If it's a sample level metric, you can use the following code:
+
+```python
+my_custom_metric = SampleLevelMetric(
+    metric_name={custom_metric_name},
+    higher_is_better={either True or False},
+    category={MetricCategory},
+    use_case={MetricUseCase},
+    sample_level_fn=custom_metric,
+    corpus_level_fn=agg_function,
+)
+```
+
+If your metric defines multiple metrics per sample, you can use the following code:
+
+```python
+custom_metric = SampleLevelMetricGrouping(
+    metric_name={submetric_names},
+    higher_is_better={n: {True or False} for n in submetric_names},
+    category={MetricCategory},
+    use_case={MetricUseCase},
+    sample_level_fn=custom_metric,
+    corpus_level_fn={
+        "accuracy": np.mean,
+        "other_metric": agg_function,
+    },
+)
+```
+
+And to end with the following, so that it adds your metric to our metrics list
+when loaded as a module.
 
 ```python
 # Adds the metric to the metric list!
@@ -28,6 +85,3 @@ if __name__ == "__main__":
 
 You can then give your custom metric to lighteval by using `--custom-tasks
 path_to_your_file` when launching it.
-
-To see an example of a custom metric added along with a custom task, look at
-``examples/tasks/custom_tasks_with_custom_metrics/ifeval/ifeval.py.``
diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
index 2e8f2edaa..650881f3b 100644
--- a/docs/source/adding_new_task.md
+++ b/docs/source/adding_new_task.md
@@ -14,9 +14,130 @@ community tasks, and add its dataset on the hub.
 
 A popular community evaluation can move to become an extended or core evaluation over time.
 
-[`lighteval.metrics.utils.CorpusLevelMetric`]
+<Tip>
+You can find examples of custom tasks in the <a
+href="https://github.com/huggingface/lighteval/tree/main/community_tasks">community_task</a>
+directory.
+</Tip>
 
-TODO: Add code snippet to show how to add a new task to lighteval.
+## Step by step creation of a custom task
+
+First, create a python file under the `community_tasks` directory.
+
+You need to define a prompt function that will convert a line from your
+dataset to a document to be used for evaluation.
+
+```python
+# Define as many as you need for your different tasks
+def prompt_fn(line, task_name: str = None):
+    """Defines how to go from a dataset line to a doc object.
+    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
+    about what this function should do in the README.
+    """
+    return Doc(
+        task_name=task_name,
+        query="",
+        choices="",
+        gold_index=0,
+        instruction="",
+    )
+```
+
+Then, you need to choose a metric, you can either use an existing one (defined
+in `lighteval/metrics/metrics.py`) or [create a custom one](./adding_new_metric).
 
 ```python
+custom_metric = SampleLevelMetric(
+    metric_name="my_custom_metric_name",
+    higher_is_better=True,
+    category=MetricCategory.IGNORED,
+    use_case=MetricUseCase.NONE,
+    sample_level_fn=lambda x: x,  # how to compute score for one sample
+    corpus_level_fn=np.mean,  # How to aggreagte the samples metrics
+)
+```
+
+Then, you need to define your task. You can define a task with or without subsets.
+To define a task with no subsets:
+
+```python
+# This is how you create a simple task (like hellaswag) which has one single subset
+# attached to it, and one evaluation possible.
+task = LightevalTaskConfig(
+    name="myothertask",
+    prompt_function=prompt_fn,  # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
+    suite=["community"],
+    hf_repo="",
+    hf_subset="default",
+    hf_avail_splits=[],
+    evaluation_splits=[],
+    few_shots_split=None,
+    few_shots_select=None,
+    metric=[],  # select your metric in Metrics
+)
+```
+
+If you want to create a task with multiple subset, add them to the
+`SAMPLE_SUBSETS` list and create a task for each subset.
+
+```python
+SAMPLE_SUBSETS = []  # list of all the subsets to use for this eval
+
+
+class CustomSubsetTask(LightevalTaskConfig):
+    def __init__(
+        self,
+        name,
+        hf_subset,
+    ):
+        super().__init__(
+            name=name,
+            hf_subset=hf_subset,
+            prompt_function=prompt_fn,  # must be defined in the file or imported from src/lighteval/tasks/tasks_prompt_formatting.py
+            hf_repo="",
+            metric=[custom_metric],  # select your metric in Metrics or use your custom_metric
+            hf_avail_splits=[],
+            evaluation_splits=[],
+            few_shots_split=None,
+            few_shots_select=None,
+            suite=["community"],
+            generation_size=-1,
+            stop_sequence=None,
+            output_regex=None,
+            frozen=False,
+        )
+SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
+```
+
+Then you need to add your task to the `TASKS_TABLE` list.
+
+```python
+# STORE YOUR EVALS
+
+# tasks with subset:
+TASKS_TABLE = SUBSET_TASKS
+
+# tasks without subset:
+# TASKS_TABLE = [task]
+```
+
+Finally, you need to add a module logic to convert your task to a dict for lighteval.
+
+```python
+# MODULE LOGIC
+# You should not need to touch this
+# Convert to dict for lighteval
+if __name__ == "__main__":
+    print(t.name for t in TASKS_TABLE)
+    print(len(TASKS_TABLE))
+```
+
+Once your file is created you can then run the evaluation with the following command:
+
+```bash
+lighteval accelerate \
+    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
+    --tasks "community|{custom_task}|{fewshots}|{truncate_few_shot}" \
+    --custom_tasks {path_to_your_custom_task_file} \
+    --output_dir "./evals"
 ```
diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
index e69de29bb..d47fb65bb 100644
--- a/docs/source/metric_list.md
+++ b/docs/source/metric_list.md
@@ -0,0 +1,12 @@
+
+
+
+[[autodoc]] lighteval.metrics.stderr.get_stderr_function
+
+[[autodoc]] lighteval.logging.hierarchical_logger.hlog
+
+[[autodoc]] lighteval.metrics.utils.Metric
+
+`[[autodoc]] lighteval.models.model_output.ModelResponse`
+
+`[[autodoc]] lighteval.tasks.lighteval_task.LightevalTask`
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 56fb106fa..73099ec5a 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -2,31 +2,27 @@
 
 We provide two main entry points to evaluate models:
 
-- `lighteval accelerate` : evaluate models on CPU or one or more GPUs using 🤗 Accelerate.
-- `lighteval nanotron`: evaluate models in distributed settings using ⚡️ Nanotron.
+- `lighteval accelerate` : evaluate models on CPU or one or more GPUs using [🤗
+  Accelerate](https://github.com/huggingface/accelerate)
+- `lighteval nanotron`: evaluate models in distributed settings using [⚡️
+  Nanotron](https://github.com/huggingface/nanotron)
 
 ## Accelerate
 
-### Evaluate a model on one or more GPUs
+### Evaluate a model on a GPU
 
-To evaluate a model on one or more GPUs, first create a multi-gpu config by running.
+To evaluate `GPT-2` on the Truthful QA benchmark, run:
 
 ```bash
-accelerate config
-```
-
-You can then evaluate a model using data parallelism as follows:
-
-```bash
-accelerate launch --multi_gpu --num_processes=<num_gpus> -m \
-    lighteval accelerate \
-    --model_args="pretrained=<path to model on the hub>" \
-    --tasks <task parameters> \
-    --output_dir output_dir
+lighteval accelerate \
+     --model_args "pretrained=gpt2" \
+     --tasks "leaderboard|truthfulqa:mc|0|0" \
+     --override_batch_size 1 \
+     --output_dir="./evals/"
 ```
 
 Here, --tasks refers to either a comma-separated list of supported tasks from
-the tasks_list in the format: Tasks details can also be found in the file
+the `tasks_list` in the format: Tasks details can also be found in the file
 implementing them.
 
 ```bash
@@ -37,6 +33,18 @@ or a file path like ``examples/tasks/recommended_set.txt`` which specifies
 multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA
 benchmark run:
 
+### Evaluate a model on one or more GPUs
+
+#### Data parallelism
+
+To evaluate a model on one or more GPUs, first create a multi-gpu config by running.
+
+```bash
+accelerate config
+```
+
+You can then evaluate a model using data parallelism on 8 GPUs like follows:
+
 ```bash
 accelerate launch --multi_gpu --num_processes=8 -m \
     lighteval accelerate \
@@ -46,19 +54,29 @@ accelerate launch --multi_gpu --num_processes=8 -m \
     --output_dir="./evals/"
 ```
 
-Here, --override_batch_size defines the batch size per device, so the effective
-batch size will be override_batch_size x num_gpus. To evaluate on multiple
-benchmarks, separate each task configuration with a comma, e.g.
+Here, `--override_batch_size` defines the batch size per device, so the effective
+batch size will be `override_batch_size * num_gpus`.
+
+#### Pipeline parallelism
+
+To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
 
 ```bash
-accelerate launch --multi_gpu --num_processes=8 -m \
     lighteval accelerate \
-    --model_args "pretrained=gpt2" \
-    --tasks "leaderboard|truthfulqa:mc|0|0,leaderboard|gsm8k|0|0" \
+    --model_args "pretrained=gpt2,model_parallel=True" \
+    --tasks "leaderboard|truthfulqa:mc|0|0" \
     --override_batch_size 1 \
     --output_dir="./evals/"
 ```
 
+This will automatically use accelerate to distribute the model across the GPUs.
+
+<Tip>
+Both data and pipeline parallelism can be combined by setting
+`model_parallel=True` and using accelerate to distribute the data across the
+GPUs.
+</Tip>
+
 ## Nanotron
 
 To evaluate a model trained with nanotron on a single gpu.
diff --git a/docs/source/task_list.md b/docs/source/task_list.md
deleted file mode 100644
index e69de29bb..000000000
diff --git a/src/lighteval/metrics/metrics.py b/src/lighteval/metrics/metrics.py
index 04b86ded5..9b4203475 100644
--- a/src/lighteval/metrics/metrics.py
+++ b/src/lighteval/metrics/metrics.py
@@ -74,6 +74,8 @@
 
 
 class Metrics(Enum):
+    """hello"""
+
     acc_golds_likelihood = SampleLevelMetric(  # todo: we need a better name for this!
         metric_name="acc",
         sample_level_fn=acc_golds_likelihood,
@@ -607,6 +609,12 @@ class Metrics(Enum):
     def __str__(self):
         return self.name.replace("_at_", "@")
 
+    def __call__(self, **kwargs):
+        pass
+
+    def __name__(self):
+        return self.name
+
     @staticmethod
     def higher_is_better():
         res = {}
diff --git a/src/lighteval/metrics/utils.py b/src/lighteval/metrics/utils.py
index c20da5399..bea484bc7 100644
--- a/src/lighteval/metrics/utils.py
+++ b/src/lighteval/metrics/utils.py
@@ -54,6 +54,25 @@ class MetricUseCase(str, Enum):
 
 @dataclass
 class Metric:
+    """
+    Array with associated photographic information.
+
+    ...
+
+    Attributes
+    ----------
+    exposure : float
+        Exposure in seconds.
+
+    Methods
+    -------
+    colorspace(c='rgb')
+        Represent the photo in the given colorspace.
+    gamma(n=1.0)
+        Change the photo's gamma exposure.
+
+    """
+
     metric_name: str
     higher_is_better: bool
     category: MetricCategory
@@ -111,3 +130,7 @@ class SampleLevelMetricGrouping(MetricGrouping):
     """MetricGrouping are computed per sample, then aggregated over the corpus"""
 
     pass
+
+
+def hello(hi):
+    pass
diff --git a/src/lighteval/tasks/default_tasks.py b/src/lighteval/tasks/default_tasks.py
index 96799e7d0..a563b3bde 100644
--- a/src/lighteval/tasks/default_tasks.py
+++ b/src/lighteval/tasks/default_tasks.py
@@ -24,6 +24,10 @@
 from lighteval.tasks.lighteval_task import LightevalTaskConfig
 
 
+"""
+default tasks
+"""
+
 abstract_narrative_understanding_bigbench = LightevalTaskConfig(
     name="abstract_narrative_understanding",
     suite=["bigbench", "bigbench_json"],
diff --git a/src/lighteval/tasks/lighteval_task.py b/src/lighteval/tasks/lighteval_task.py
index 0e7f06df9..edf925792 100644
--- a/src/lighteval/tasks/lighteval_task.py
+++ b/src/lighteval/tasks/lighteval_task.py
@@ -116,6 +116,9 @@ class LightevalTaskConfig:
 
     version: int = 0
 
+    def __doc__(self):
+        return """DOCUMENTAION"""
+
     def __post_init__(self):
         if self.suite is None:
             self.suite = ["custom"]

From cbdcf1b640bd819e4e5a072c572b32be0a724324 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Tue, 3 Sep 2024 15:40:54 +0200
Subject: [PATCH 04/24] commit

---
 docs/source/_toctree.yml      |   2 +
 docs/source/metric_list.md    |  12 +-
 docs/source/saving_results.md | 211 ++++++++++++++++++++++++++++++++++
 docs/source/tasks.md          |   0
 docs/source/use_tgi.md        |   3 +
 docs/source/use_vllm.md       |   4 +
 6 files changed, 221 insertions(+), 11 deletions(-)
 create mode 100644 docs/source/saving_results.md
 create mode 100644 docs/source/tasks.md
 create mode 100644 docs/source/use_tgi.md
 create mode 100644 docs/source/use_vllm.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 6c41e0777..3c2a86540 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -10,6 +10,8 @@
   sections:
     - local: use_vllm
       title: Using VLLM as backend
+    - local: use_tgi
+      title: Using TGI as backend
     - local: adding_new_task
       title: Adding a Custom Task
     - local: adding_new_metric
diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
index d47fb65bb..9d72a2cac 100644
--- a/docs/source/metric_list.md
+++ b/docs/source/metric_list.md
@@ -1,12 +1,2 @@
 
-
-
-[[autodoc]] lighteval.metrics.stderr.get_stderr_function
-
-[[autodoc]] lighteval.logging.hierarchical_logger.hlog
-
-[[autodoc]] lighteval.metrics.utils.Metric
-
-`[[autodoc]] lighteval.models.model_output.ModelResponse`
-
-`[[autodoc]] lighteval.tasks.lighteval_task.LightevalTask`
+# Metrics
diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
new file mode 100644
index 000000000..be553155e
--- /dev/null
+++ b/docs/source/saving_results.md
@@ -0,0 +1,211 @@
+# Saving results
+
+## Saving results locally
+
+Lighteval will automatically save results and evaluation details in the directory
+set with the `--output_dir` argument. The results will be saved in
+`{output_dir}/results/{model_org}/{model_name}/results_{timestamp}.json`.
+[Here is an example of a result file](#example-of-a-result-file).
+
+To save the details of the evaluation, you can use the `--save_details`
+argument. The details will be saved in a parquet file
+`{output_dir}/details/{model_org}/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet`.
+
+## Pushing results to the HuggingFace hub
+
+You can push the results and evaluation details to the HuggingFace hub. To do
+so, you need to set the `--push_results_to_hub` as well as the `--results_org`
+argument. The results will be saved in a dataset with the name at
+`{results_org}/{model_org}/{model_name}`. To push the details, you need to set
+the `--push_details_to_hub` argument.
+The dataset created will be private by default, you can make it public by
+setting the `--public_run` argument.
+
+
+## Pushing results to Tensorboard
+
+You can push the results to Tensorboard by setting the `--push_results_to_tensorboard`.
+
+
+## How to load and investigate details
+
+### Load from local detail files
+
+```python
+from datasets import load_dataset
+import os
+
+output_dir = "evals_doc"
+model = "HuggingFaceH4/zephyr-7b-beta"
+model_org = model.split("/")[0]
+model_name = model.split("/")[1]
+timestamp = "2024-09-03T15-06-11.234678"
+task = "lighteval|gsm8k|0"
+
+details_path = f"{output_dir}/details/{model_org}/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
+
+# Load the details
+details = load_dataset("parquet", data_files=details_path, split="train")
+
+for detail in details:
+    print(detail)
+```
+
+### Load from the HuggingFace hub
+
+```python
+from datasets import load_dataset
+
+output_dir = "evals_doc"
+results_org = "SaylorTwift"
+model = "HuggingFaceH4/zephyr-7b-beta"
+model_org = model.split("/")[0]
+model_name = model.split("/")[1]
+timestamp = "2024-09-03T15-06-11.234678"
+task = "lighteval|gsm8k|0"
+public_run = False
+
+dataset_path = f"{results_org}/details_{model_name}{'_private' if not public_run else ''}"
+details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")
+
+for detail in details:
+    print(detail)
+```
+
+
+The detail file contains the following columns:
+- `choices`: The choices presented to the model in the case of mutlichoice tasks.
+- `gold`: The gold answer.
+- `gold_index`: The index of the gold answer in the choices list.
+- `cont_tokens`: The continuation tokens.
+- `example`: The input in text form.
+- `full_prompt`: The full prompt, that will be inputed to the model.
+- `input_tokens`: The tokens of the full prompt.
+- `instruction`: The instruction given to the model.
+- `metrics`: The metrics computed for the example.
+- `num_asked_few_shots`: The number of few shots asked to the model.
+- `num_effective_few_shots`: The number of effective few shots.
+- `padded`: Whether the input was padded.
+- `pred_logits`: The logits of the model.
+- `predictions`: The predictions of the model.
+- `specifics`: The specifics of the task.
+- `truncated`: Whether the input was truncated.
+
+
+## Example of a result file
+
+```json
+{
+  "config_general": {
+    "lighteval_sha": "203045a8431bc9b77245c9998e05fc54509ea07f",
+    "num_fewshot_seeds": 1,
+    "override_batch_size": 1,
+    "max_samples": 1,
+    "job_id": "",
+    "start_time": 620979.879320166,
+    "end_time": 621004.632108041,
+    "total_evaluation_time_secondes": "24.752787875011563",
+    "model_name": "gpt2",
+    "model_sha": "607a30d783dfa663caf39e06633721c8d4cfcd7e",
+    "model_dtype": null,
+    "model_size": "476.2 MB"
+  },
+  "results": {
+    "lighteval|gsm8k|0": {
+      "qem": 0.0,
+      "qem_stderr": 0.0,
+      "maj@8": 0.0,
+      "maj@8_stderr": 0.0
+    },
+    "all": {
+      "qem": 0.0,
+      "qem_stderr": 0.0,
+      "maj@8": 0.0,
+      "maj@8_stderr": 0.0
+    }
+  },
+  "versions": {
+    "lighteval|gsm8k|0": 0
+  },
+  "config_tasks": {
+    "lighteval|gsm8k": {
+      "name": "gsm8k",
+      "prompt_function": "gsm8k",
+      "hf_repo": "gsm8k",
+      "hf_subset": "main",
+      "metric": [
+        {
+          "metric_name": "qem",
+          "higher_is_better": true,
+          "category": "3",
+          "use_case": "5",
+          "sample_level_fn": "compute",
+          "corpus_level_fn": "mean"
+        },
+        {
+          "metric_name": "maj@8",
+          "higher_is_better": true,
+          "category": "5",
+          "use_case": "5",
+          "sample_level_fn": "compute",
+          "corpus_level_fn": "mean"
+        }
+      ],
+      "hf_avail_splits": [
+        "train",
+        "test"
+      ],
+      "evaluation_splits": [
+        "test"
+      ],
+      "few_shots_split": null,
+      "few_shots_select": "random_sampling_from_train",
+      "generation_size": 256,
+      "generation_grammar": null,
+      "stop_sequence": [
+        "Question="
+      ],
+      "output_regex": null,
+      "num_samples": null,
+      "frozen": false,
+      "suite": [
+        "lighteval"
+      ],
+      "original_num_docs": 1319,
+      "effective_num_docs": 1,
+      "trust_dataset": true,
+      "must_remove_duplicate_docs": null,
+      "version": 0
+    }
+  },
+  "summary_tasks": {
+    "lighteval|gsm8k|0": {
+      "hashes": {
+        "hash_examples": "8517d5bf7e880086",
+        "hash_full_prompts": "8517d5bf7e880086",
+        "hash_input_tokens": "29916e7afe5cb51d",
+        "hash_cont_tokens": "37f91ce23ef6d435"
+      },
+      "truncated": 2,
+      "non_truncated": 0,
+      "padded": 0,
+      "non_padded": 2,
+      "effective_few_shots": 0.0,
+      "num_truncated_few_shots": 0
+    }
+  },
+  "summary_general": {
+    "hashes": {
+      "hash_examples": "5f383c395f01096e",
+      "hash_full_prompts": "5f383c395f01096e",
+      "hash_input_tokens": "ac933feb14f96d7b",
+      "hash_cont_tokens": "9d03fb26f8da7277"
+    },
+    "truncated": 2,
+    "non_truncated": 0,
+    "padded": 0,
+    "non_padded": 2,
+    "num_truncated_few_shots": 0
+  }
+}
+```
diff --git a/docs/source/tasks.md b/docs/source/tasks.md
new file mode 100644
index 000000000..e69de29bb
diff --git a/docs/source/use_tgi.md b/docs/source/use_tgi.md
new file mode 100644
index 000000000..7ae6b000b
--- /dev/null
+++ b/docs/source/use_tgi.md
@@ -0,0 +1,3 @@
+# Use TGI
+
+blabla
diff --git a/docs/source/use_vllm.md b/docs/source/use_vllm.md
new file mode 100644
index 000000000..10919f413
--- /dev/null
+++ b/docs/source/use_vllm.md
@@ -0,0 +1,4 @@
+# Use VLLM as backend
+
+
+blablablal

From 015e924cbe303ead7c0d355c41c6c63c57906331 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Tue, 3 Sep 2024 15:42:49 +0200
Subject: [PATCH 05/24] undo unecessary changes

---
 src/lighteval/metrics/metrics.py      |  8 --------
 src/lighteval/metrics/utils.py        | 23 -----------------------
 src/lighteval/tasks/default_tasks.py  |  4 ----
 src/lighteval/tasks/lighteval_task.py |  3 ---
 4 files changed, 38 deletions(-)

diff --git a/src/lighteval/metrics/metrics.py b/src/lighteval/metrics/metrics.py
index 37213082a..399bb3f0e 100644
--- a/src/lighteval/metrics/metrics.py
+++ b/src/lighteval/metrics/metrics.py
@@ -75,8 +75,6 @@
 
 
 class Metrics(Enum):
-    """hello"""
-
     acc_golds_likelihood = SampleLevelMetric(  # todo: we need a better name for this!
         metric_name="acc",
         sample_level_fn=acc_golds_likelihood,
@@ -610,12 +608,6 @@ class Metrics(Enum):
     def __str__(self):
         return self.name.replace("_at_", "@")
 
-    def __call__(self, **kwargs):
-        pass
-
-    def __name__(self):
-        return self.name
-
     @staticmethod
     def higher_is_better():
         res = {}
diff --git a/src/lighteval/metrics/utils.py b/src/lighteval/metrics/utils.py
index e71306725..cb9f5e744 100644
--- a/src/lighteval/metrics/utils.py
+++ b/src/lighteval/metrics/utils.py
@@ -55,25 +55,6 @@ class MetricUseCase(str, Enum):
 
 @dataclass
 class Metric:
-    """
-    Array with associated photographic information.
-
-    ...
-
-    Attributes
-    ----------
-    exposure : float
-        Exposure in seconds.
-
-    Methods
-    -------
-    colorspace(c='rgb')
-        Represent the photo in the given colorspace.
-    gamma(n=1.0)
-        Change the photo's gamma exposure.
-
-    """
-
     metric_name: str
     higher_is_better: bool
     category: MetricCategory
@@ -131,7 +112,3 @@ class SampleLevelMetricGrouping(MetricGrouping):
     """MetricGrouping are computed per sample, then aggregated over the corpus"""
 
     pass
-
-
-def hello(hi):
-    pass
diff --git a/src/lighteval/tasks/default_tasks.py b/src/lighteval/tasks/default_tasks.py
index a563b3bde..96799e7d0 100644
--- a/src/lighteval/tasks/default_tasks.py
+++ b/src/lighteval/tasks/default_tasks.py
@@ -24,10 +24,6 @@
 from lighteval.tasks.lighteval_task import LightevalTaskConfig
 
 
-"""
-default tasks
-"""
-
 abstract_narrative_understanding_bigbench = LightevalTaskConfig(
     name="abstract_narrative_understanding",
     suite=["bigbench", "bigbench_json"],
diff --git a/src/lighteval/tasks/lighteval_task.py b/src/lighteval/tasks/lighteval_task.py
index 72d3eac3c..d4a04d9ae 100644
--- a/src/lighteval/tasks/lighteval_task.py
+++ b/src/lighteval/tasks/lighteval_task.py
@@ -115,9 +115,6 @@ class LightevalTaskConfig:
 
     version: int = 0
 
-    def __doc__(self):
-        return """DOCUMENTAION"""
-
     def __post_init__(self):
         if self.suite is None:
             self.suite = ["custom"]

From 8aabbc8a91614fef05bcc5284fda75c11a511d65 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Thu, 5 Sep 2024 13:07:54 +0200
Subject: [PATCH 06/24] still working on docs

---
 docs/source/_toctree.yml       |    4 +-
 docs/source/adding_new_task.md |   51 ++
 docs/source/metric_list.md     |   78 +-
 docs/source/tasks.md           | 1250 ++++++++++++++++++++++++++++++++
 docs/source/use_tgi.md         |   68 +-
 docs/source/use_vllm.md        |    4 +-
 6 files changed, 1449 insertions(+), 6 deletions(-)

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 3c2a86540..8c17aee40 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -11,13 +11,15 @@
     - local: use_vllm
       title: Using VLLM as backend
     - local: use_tgi
-      title: Using TGI as backend
+      title: Evaluate on Server
     - local: adding_new_task
       title: Adding a Custom Task
     - local: adding_new_metric
       title: Adding a Custom Metric
     - local: saving_results
       title: Saving Results
+    - local: training_and_eval_loop
+      title: Training and Evaluation Loop
 - title: "API Reference"
   sections:
     - local: metric_list
diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
index 650881f3b..03f5eb7bf 100644
--- a/docs/source/adding_new_task.md
+++ b/docs/source/adding_new_task.md
@@ -109,6 +109,57 @@ class CustomSubsetTask(LightevalTaskConfig):
 SUBSET_TASKS = [CustomSubsetTask(name=f"mytask:{subset}", hf_subset=subset) for subset in SAMPLE_SUBSETS]
 ```
 
+Here is a list of the parameters and their meaning:
+
+- `name` (str), your evaluation name
+- `suite` (list), the suite(s) to which your evaluation should belong. This
+  field allows us to compare different task implementations and is used as a
+  task selection to differentiate the versions to launch. At the moment, you'll
+  find the keywords ["helm", "bigbench", "original", "lighteval", "community",
+  "custom"]; for core evals, please choose `lighteval`.
+- `prompt_function` (Callable), the prompt function you defined in the step
+  above
+- `hf_repo` (str), the path to your evaluation dataset on the hub
+- `hf_subset` (str), the specific subset you want to use for your evaluation
+  (note: when the dataset has no subset, fill this field with `"default"`, not
+  with `None` or `""`)
+- `hf_avail_splits` (list), all the splits available for your dataset (train,
+  valid or validation, test, other...)
+- `evaluation_splits` (list), the splits you want to use for evaluation
+- `few_shots_split` (str, can be `null`), the specific split from which you
+  want to select samples for your few-shot examples. It should be different
+  from the sets included in `evaluation_splits`
+- `few_shots_select` (str, can be `null`), the method that you will use to
+  select items for your few-shot examples. Can be `null`, or one of:
+    - `balanced` select examples from the `few_shots_split` with balanced
+      labels, to avoid skewing the few shot examples (hence the model
+      generations) toward one specific label
+    - `random` selects examples at random from the `few_shots_split`
+    - `random_sampling` selects new examples at random from the
+      `few_shots_split` for every new item, but if a sampled item is equal to
+      the current one, it is removed from the available samples
+    - `random_sampling_from_train` selects new examples at random from the
+      `few_shots_split` for every new item, but if a sampled item is equal to
+      the current one, it is kept! Only use this if you know what you are
+      doing.
+    - `sequential` selects the first `n` examples of the `few_shots_split`
+- `generation_size` (int), the maximum number of tokens allowed for a
+  generative evaluation. If your evaluation is a log likelihood evaluation
+  (multi-choice), this value should be -1
+- `stop_sequence` (list), a list of strings acting as end of sentence tokens
+  for your generation
+- `metric` (list), the metrics you want to use for your evaluation (see next
+  section for a detailed explanation)
+- `output_regex` (str), A regex string that will be used to filter your
+  generation. (Generative metrics will only select tokens that are between the
+  first and the second sequence matched by the regex. For example, for a regex
+  matching `\n` and a generation `\nModel generation output\nSome other text`
+  the metric will only be fed with `Model generation output`)
+- `frozen` (bool), for now, is set to False, but we will steadily pass all
+  stable tasks to True.
+- `trust_dataset` (bool), set to True if you trust the dataset.
+
+
 Then you need to add your task to the `TASKS_TABLE` list.
 
 ```python
diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
index 9d72a2cac..e961f1a2c 100644
--- a/docs/source/metric_list.md
+++ b/docs/source/metric_list.md
@@ -1,2 +1,78 @@
-
 # Metrics
+
+- MetricCategory.TARGET_PERPLEXITY
+	- acc_golds_likelihood
+	- target_perplexity
+
+- MetricCategory.MULTICHOICE_ONE_TOKEN
+	- loglikelihood_acc_norm_single_token
+	- loglikelihood_acc_single_token
+	- loglikelihood_f1_single_token
+	- mcc_single_token
+	- mrr_single_token
+	- multi_f1_numeric
+	- recall_at_1_single_token
+	- recall_at_2_single_token
+
+- MetricCategory.IGNORED
+	- prediction_perplexity
+
+- MetricCategory.PERPLEXITY
+	- bits_per_byte
+	- byte_perplexity
+	- word_perplexity
+
+- MetricCategory.GENERATIVE
+	- bert_score
+	- bleu
+	- bleu_1
+	- bleu_4
+	- bleurt
+	- chrf
+	- copyright
+	- drop
+	- exact_match
+	- extractiveness
+	- f1_score_quasi
+	- f1_score
+	- f1_score_macro
+	- f1_score_micro
+	- faithfulness
+	- perfect_exact_match
+	- prefix_exact_match
+	- prefix_quasi_exact_match
+	- quasi_exact_match
+	- quasi_exact_match_math
+	- quasi_exact_match_triviaqa
+	- quasi_exact_match_gsm8k
+	- rouge_t5
+	- rouge1
+	- rouge2
+	- rougeL
+	- rougeLsum
+	- ter
+
+- MetricCategory.GENERATIVE_SAMPLING
+	- maj_at_4_math
+	- maj_at_5
+	- maj_at_8
+	- maj_at_8_gsm8k
+
+- MetricCategory.LLM_AS_JUDGE_MULTI_TURN
+	- llm_judge_multi_turn_gpt3p5
+	- llm_judge_multi_turn_llama_3_405b
+
+- MetricCategory.LLM_AS_JUDGE
+	- llm_judge_gpt3p5
+	- llm_judge_llama_3_405b
+
+- MetricCategory.MULTICHOICE
+	- loglikelihood_acc
+	- loglikelihood_acc_norm
+	- loglikelihood_acc_norm_nospace
+	- loglikelihood_f1
+	- mcc
+	- mrr
+	- recall_at_1
+	- recall_at_2
+	- truthfulqa_mc_metrics
diff --git a/docs/source/tasks.md b/docs/source/tasks.md
index e69de29bb..0ac5ef14b 100644
--- a/docs/source/tasks.md
+++ b/docs/source/tasks.md
@@ -0,0 +1,1250 @@
+# Tasks
+
+You can get a list of all the available tasks by running:
+
+```bash
+lighteval tasks --list
+```
+
+## List of tasks
+
+- bigbench:
+  - bigbench|abstract_narrative_understanding
+  - bigbench|anachronisms
+  - bigbench|analogical_similarity
+  - bigbench|analytic_entailment
+  - bigbench|arithmetic_bb
+  - bigbench|ascii_word_recognition
+  - bigbench|authorship_verification
+  - bigbench|auto_categorization
+  - bigbench|auto_debugging
+  - bigbench|bbq_lite_json
+  - bigbench|bridging_anaphora_resolution_barqa
+  - bigbench|causal_judgment
+  - bigbench|cause_and_effect
+  - bigbench|checkmate_in_one
+  - bigbench|chess_state_tracking
+  - bigbench|chinese_remainder_theorem
+  - bigbench|cifar10_classification
+  - bigbench|code_line_description
+  - bigbench|codenames
+  - bigbench|color
+  - bigbench|common_morpheme
+  - bigbench|conceptual_combinations
+  - bigbench|conlang_translation
+  - bigbench|contextual_parametric_knowledge_conflicts
+  - bigbench|coqa_bb
+  - bigbench|crash_blossom
+  - bigbench|crass_ai
+  - bigbench|cryobiology_spanish
+  - bigbench|cryptonite
+  - bigbench|cs_algorithms
+  - bigbench|dark_humor_detection
+  - bigbench|date_understanding
+  - bigbench|disambiguation_qa
+  - bigbench|discourse_marker_prediction
+  - bigbench|disfl_qa
+  - bigbench|dyck_languages
+  - bigbench|elementary_math_qa
+  - bigbench|emoji_movie
+  - bigbench|emojis_emotion_prediction
+  - bigbench|empirical_judgments
+  - bigbench|english_proverbs
+  - bigbench|english_russian_proverbs
+  - bigbench|entailed_polarity
+  - bigbench|entailed_polarity_hindi
+  - bigbench|epistemic_reasoning
+  - bigbench|evaluating_information_essentiality
+  - bigbench|fact_checker
+  - bigbench|fantasy_reasoning
+  - bigbench|few_shot_nlg
+  - bigbench|figure_of_speech_detection
+  - bigbench|formal_fallacies_syllogisms_negation
+  - bigbench|gem
+  - bigbench|gender_inclusive_sentences_german
+  - bigbench|general_knowledge
+  - bigbench|geometric_shapes
+  - bigbench|goal_step_wikihow
+  - bigbench|gre_reading_comprehension
+  - bigbench|hhh_alignment
+  - bigbench|hindi_question_answering
+  - bigbench|hindu_knowledge
+  - bigbench|hinglish_toxicity
+  - bigbench|human_organs_senses
+  - bigbench|hyperbaton
+  - bigbench|identify_math_theorems
+  - bigbench|identify_odd_metaphor
+  - bigbench|implicatures
+  - bigbench|implicit_relations
+  - bigbench|intent_recognition
+  - bigbench|international_phonetic_alphabet_nli
+  - bigbench|international_phonetic_alphabet_transliterate
+  - bigbench|intersect_geometry
+  - bigbench|irony_identification
+  - bigbench|kanji_ascii
+  - bigbench|kannada
+  - bigbench|key_value_maps
+  - bigbench|known_unknowns
+  - bigbench|language_games
+  - bigbench|language_identification
+  - bigbench|linguistic_mappings
+  - bigbench|linguistics_puzzles
+  - bigbench|logic_grid_puzzle
+  - bigbench|logical_args
+  - bigbench|logical_deduction
+  - bigbench|logical_fallacy_detection
+  - bigbench|logical_sequence
+  - bigbench|mathematical_induction
+  - bigbench|matrixshapes
+  - bigbench|metaphor_boolean
+  - bigbench|metaphor_understanding
+  - bigbench|minute_mysteries_qa
+  - bigbench|misconceptions
+  - bigbench|misconceptions_russian
+  - bigbench|mnist_ascii
+  - bigbench|modified_arithmetic
+  - bigbench|moral_permissibility
+  - bigbench|movie_dialog_same_or_different
+  - bigbench|movie_recommendation
+  - bigbench|mult_data_wrangling
+  - bigbench|multiemo
+  - bigbench|natural_instructions
+  - bigbench|navigate
+  - bigbench|nonsense_words_grammar
+  - bigbench|novel_concepts
+  - bigbench|object_counting
+  - bigbench|odd_one_out
+  - bigbench|operators
+  - bigbench|paragraph_segmentation
+  - bigbench|parsinlu_qa
+  - bigbench|parsinlu_reading_comprehension
+  - bigbench|penguins_in_a_table
+  - bigbench|periodic_elements
+  - bigbench|persian_idioms
+  - bigbench|phrase_relatedness
+  - bigbench|physical_intuition
+  - bigbench|physics
+  - bigbench|physics_questions
+  - bigbench|play_dialog_same_or_different
+  - bigbench|polish_sequence_labeling
+  - bigbench|presuppositions_as_nli
+  - bigbench|qa_wikidata
+  - bigbench|question_selection
+  - bigbench|real_or_fake_text
+  - bigbench|reasoning_about_colored_objects
+  - bigbench|repeat_copy_logic
+  - bigbench|rephrase
+  - bigbench|rhyming
+  - bigbench|riddle_sense
+  - bigbench|ruin_names
+  - bigbench|salient_translation_error_detection
+  - bigbench|scientific_press_release
+  - bigbench|semantic_parsing_in_context_sparc
+  - bigbench|semantic_parsing_spider
+  - bigbench|sentence_ambiguity
+  - bigbench|similarities_abstraction
+  - bigbench|simp_turing_concept
+  - bigbench|simple_arithmetic_json
+  - bigbench|simple_arithmetic_json_multiple_choice
+  - bigbench|simple_arithmetic_json_subtasks
+  - bigbench|simple_arithmetic_multiple_targets_json
+  - bigbench|simple_ethical_questions
+  - bigbench|simple_text_editing
+  - bigbench|snarks
+  - bigbench|social_iqa
+  - bigbench|social_support
+  - bigbench|sports_understanding
+  - bigbench|strange_stories
+  - bigbench|strategyqa
+  - bigbench|sufficient_information
+  - bigbench|suicide_risk
+  - bigbench|swahili_english_proverbs
+  - bigbench|swedish_to_german_proverbs
+  - bigbench|symbol_interpretation
+  - bigbench|tellmewhy
+  - bigbench|temporal_sequences
+  - bigbench|tense
+  - bigbench|timedial
+  - bigbench|topical_chat
+  - bigbench|tracking_shuffled_objects
+  - bigbench|understanding_fables
+  - bigbench|undo_permutation
+  - bigbench|unit_conversion
+  - bigbench|unit_interpretation
+  - bigbench|unnatural_in_context_learning
+  - bigbench|vitaminc_fact_verification
+  - bigbench|what_is_the_tao
+  - bigbench|which_wiki_edit
+  - bigbench|wino_x_german
+  - bigbench|winowhy
+  - bigbench|word_sorting
+  - bigbench|word_unscrambling
+
+- harness:
+  - harness|bbh:boolean_expressions
+  - harness|bbh:causal_judgment
+  - harness|bbh:date_understanding
+  - harness|bbh:disambiguation_qa
+  - harness|bbh:dyck_languages
+  - harness|bbh:formal_fallacies
+  - harness|bbh:geometric_shapes
+  - harness|bbh:hyperbaton
+  - harness|bbh:logical_deduction_five_objects
+  - harness|bbh:logical_deduction_seven_objects
+  - harness|bbh:logical_deduction_three_objects
+  - harness|bbh:movie_recommendation
+  - harness|bbh:multistep_arithmetic_two
+  - harness|bbh:navigate
+  - harness|bbh:object_counting
+  - harness|bbh:penguins_in_a_table
+  - harness|bbh:reasoning_about_colored_objects
+  - harness|bbh:ruin_names
+  - harness|bbh:salient_translation_error_detection
+  - harness|bbh:snarks
+  - harness|bbh:sports_understanding
+  - harness|bbh:temporal_sequences
+  - harness|bbh:tracking_shuffled_objects_five_objects
+  - harness|bbh:tracking_shuffled_objects_seven_objects
+  - harness|bbh:tracking_shuffled_objects_three_objects
+  - harness|bbh:web_of_lies
+  - harness|bbh:word_sorting
+  - harness|bigbench:causal_judgment
+  - harness|bigbench:date_understanding
+  - harness|bigbench:disambiguation_qa
+  - harness|bigbench:geometric_shapes
+  - harness|bigbench:logical_deduction_five_objects
+  - harness|bigbench:logical_deduction_seven_objects
+  - harness|bigbench:logical_deduction_three_objects
+  - harness|bigbench:movie_recommendation
+  - harness|bigbench:navigate
+  - harness|bigbench:reasoning_about_colored_objects
+  - harness|bigbench:ruin_names
+  - harness|bigbench:salient_translation_error_detection
+  - harness|bigbench:snarks
+  - harness|bigbench:sports_understanding
+  - harness|bigbench:temporal_sequences
+  - harness|bigbench:tracking_shuffled_objects_five_objects
+  - harness|bigbench:tracking_shuffled_objects_seven_objects
+  - harness|bigbench:tracking_shuffled_objects_three_objects
+  - harness|wikitext:103:document_level
+
+- helm:
+  - helm|babi_qa
+  - helm|bbq
+  - helm|bbq:Age
+  - helm|bbq:Disability_status
+  - helm|bbq:Gender_identity
+  - helm|bbq:Physical_appearance
+  - helm|bbq:Race_ethnicity
+  - helm|bbq:Race_x_SES
+  - helm|bbq:Race_x_gender
+  - helm|bbq:Religion
+  - helm|bbq:SES
+  - helm|bbq:Sexual_orientation
+  - helm|bbq=Nationality
+  - helm|bigbench:auto_debugging
+  - helm|bigbench:bbq_lite_json:age_ambig
+  - helm|bigbench:bbq_lite_json:age_disambig
+  - helm|bigbench:bbq_lite_json:disability_status_ambig
+  - helm|bigbench:bbq_lite_json:disability_status_disambig
+  - helm|bigbench:bbq_lite_json:gender_identity_ambig
+  - helm|bigbench:bbq_lite_json:gender_identity_disambig
+  - helm|bigbench:bbq_lite_json:nationality_ambig
+  - helm|bigbench:bbq_lite_json:nationality_disambig
+  - helm|bigbench:bbq_lite_json:physical_appearance_ambig
+  - helm|bigbench:bbq_lite_json:physical_appearance_disambig
+  - helm|bigbench:bbq_lite_json:race_ethnicity_ambig
+  - helm|bigbench:bbq_lite_json:race_ethnicity_disambig
+  - helm|bigbench:bbq_lite_json:religion_ambig
+  - helm|bigbench:bbq_lite_json:religion_disambig
+  - helm|bigbench:bbq_lite_json:ses_ambig
+  - helm|bigbench:bbq_lite_json:ses_disambig
+  - helm|bigbench:bbq_lite_json:sexual_orientation_ambig
+  - helm|bigbench:bbq_lite_json:sexual_orientation_disambig
+  - helm|bigbench:code_line_description
+  - helm|bigbench:conceptual_combinations:contradictions
+  - helm|bigbench:conceptual_combinations:emergent_properties
+  - helm|bigbench:conceptual_combinations:fanciful_fictional_combinations
+  - helm|bigbench:conceptual_combinations:homonyms
+  - helm|bigbench:conceptual_combinations:invented_words
+  - helm|bigbench:conlang_translation:adna_from
+  - helm|bigbench:conlang_translation:adna_to
+  - helm|bigbench:conlang_translation:atikampe_from
+  - helm|bigbench:conlang_translation:atikampe_to
+  - helm|bigbench:conlang_translation:gornam_from
+  - helm|bigbench:conlang_translation:gornam_to
+  - helm|bigbench:conlang_translation:holuan_from
+  - helm|bigbench:conlang_translation:holuan_to
+  - helm|bigbench:conlang_translation:mkafala_from
+  - helm|bigbench:conlang_translation:mkafala_to
+  - helm|bigbench:conlang_translation:postpositive_english_from
+  - helm|bigbench:conlang_translation:postpositive_english_to
+  - helm|bigbench:conlang_translation:unapuri_from
+  - helm|bigbench:conlang_translation:unapuri_to
+  - helm|bigbench:conlang_translation:vaomi_from
+  - helm|bigbench:conlang_translation:vaomi_to
+  - helm|bigbench:emoji_movie
+  - helm|bigbench:formal_fallacies_syllogisms_negation
+  - helm|bigbench:hindu_knowledge
+  - helm|bigbench:known_unknowns
+  - helm|bigbench:language_identification
+  - helm|bigbench:linguistics_puzzles
+  - helm|bigbench:logic_grid_puzzle
+  - helm|bigbench:logical_deduction-five_objects
+  - helm|bigbench:logical_deduction-seven_objects
+  - helm|bigbench:logical_deduction-three_objects
+  - helm|bigbench:misconceptions_russian
+  - helm|bigbench:novel_concepts
+  - helm|bigbench:operators
+  - helm|bigbench:parsinlu_reading_comprehension
+  - helm|bigbench:play_dialog_same_or_different
+  - helm|bigbench:repeat_copy_logic
+  - helm|bigbench:strange_stories-boolean
+  - helm|bigbench:strange_stories-multiple_choice
+  - helm|bigbench:strategyqa
+  - helm|bigbench:symbol_interpretation-adversarial
+  - helm|bigbench:symbol_interpretation-emoji_agnostic
+  - helm|bigbench:symbol_interpretation-name_agnostic
+  - helm|bigbench:symbol_interpretation-plain
+  - helm|bigbench:symbol_interpretation-tricky
+  - helm|bigbench:vitaminc_fact_verification
+  - helm|bigbench:winowhy
+  - helm|blimp:adjunct_island
+  - helm|blimp:anaphor_gender_agreement
+  - helm|blimp:anaphor_number_agreement
+  - helm|blimp:animate_subject_passive
+  - helm|blimp:animate_subject_trans
+  - helm|blimp:causative
+  - helm|blimp:complex_NP_island
+  - helm|blimp:coordinate_structure_constraint_complex_left_branch
+  - helm|blimp:coordinate_structure_constraint_object_extraction
+  - helm|blimp:determiner_noun_agreement_1
+  - helm|blimp:determiner_noun_agreement_2
+  - helm|blimp:determiner_noun_agreement_irregular_1
+  - helm|blimp:determiner_noun_agreement_irregular_2
+  - helm|blimp:determiner_noun_agreement_with_adj_2
+  - helm|blimp:determiner_noun_agreement_with_adj_irregular_1
+  - helm|blimp:determiner_noun_agreement_with_adj_irregular_2
+  - helm|blimp:determiner_noun_agreement_with_adjective_1
+  - helm|blimp:distractor_agreement_relational_noun
+  - helm|blimp:distractor_agreement_relative_clause
+  - helm|blimp:drop_argument
+  - helm|blimp:ellipsis_n_bar_1
+  - helm|blimp:ellipsis_n_bar_2
+  - helm|blimp:existential_there_object_raising
+  - helm|blimp:existential_there_quantifiers_1
+  - helm|blimp:existential_there_quantifiers_2
+  - helm|blimp:existential_there_subject_raising
+  - helm|blimp:expletive_it_object_raising
+  - helm|blimp:inchoative
+  - helm|blimp:intransitive
+  - helm|blimp:irregular_past_participle_adjectives
+  - helm|blimp:irregular_past_participle_verbs
+  - helm|blimp:irregular_plural_subject_verb_agreement_1
+  - helm|blimp:irregular_plural_subject_verb_agreement_2
+  - helm|blimp:left_branch_island_echo_question
+  - helm|blimp:left_branch_island_simple_question
+  - helm|blimp:matrix_question_npi_licensor_present
+  - helm|blimp:npi_present_1
+  - helm|blimp:npi_present_2
+  - helm|blimp:only_npi_licensor_present
+  - helm|blimp:only_npi_scope
+  - helm|blimp:passive_1
+  - helm|blimp:passive_2
+  - helm|blimp:principle_A_c_command
+  - helm|blimp:principle_A_case_1
+  - helm|blimp:principle_A_case_2
+  - helm|blimp:principle_A_domain_1
+  - helm|blimp:principle_A_domain_2
+  - helm|blimp:principle_A_domain_3
+  - helm|blimp:principle_A_reconstruction
+  - helm|blimp:regular_plural_subject_verb_agreement_1
+  - helm|blimp:regular_plural_subject_verb_agreement_2
+  - helm|blimp:sentential_negation_npi_licensor_present
+  - helm|blimp:sentential_negation_npi_scope
+  - helm|blimp:sentential_subject_island
+  - helm|blimp:superlative_quantifiers_1
+  - helm|blimp:superlative_quantifiers_2
+  - helm|blimp:tough_vs_raising_1
+  - helm|blimp:tough_vs_raising_2
+  - helm|blimp:transitive
+  - helm|blimp:wh_island
+  - helm|blimp:wh_questions_object_gap
+  - helm|blimp:wh_questions_subject_gap
+  - helm|blimp:wh_questions_subject_gap_long_distance
+  - helm|blimp:wh_vs_that_no_gap
+  - helm|blimp:wh_vs_that_no_gap_long_distance
+  - helm|blimp:wh_vs_that_with_gap
+  - helm|blimp:wh_vs_that_with_gap_long_distance
+  - helm|bold
+  - helm|bold:gender
+  - helm|bold:political_ideology
+  - helm|bold:profession
+  - helm|bold:race
+  - helm|bold:religious_ideology
+  - helm|boolq
+  - helm|boolq:contrastset
+  - helm|civil_comments
+  - helm|civil_comments:LGBTQ
+  - helm|civil_comments:black
+  - helm|civil_comments:christian
+  - helm|civil_comments:female
+  - helm|civil_comments:male
+  - helm|civil_comments:muslim
+  - helm|civil_comments:other_religions
+  - helm|civil_comments:white
+  - helm|commonsenseqa
+  - helm|copyright:n_books_1000-extractions_per_book_1-prefix_length_125
+  - helm|copyright:n_books_1000-extractions_per_book_1-prefix_length_25
+  - helm|copyright:n_books_1000-extractions_per_book_1-prefix_length_5
+  - helm|copyright:n_books_1000-extractions_per_book_3-prefix_length_125
+  - helm|copyright:n_books_1000-extractions_per_book_3-prefix_length_25
+  - helm|copyright:n_books_1000-extractions_per_book_3-prefix_length_5
+  - helm|copyright:oh_the_places
+  - helm|copyright:pilot
+  - helm|copyright:popular_books-prefix_length_10
+  - helm|copyright:popular_books-prefix_length_125
+  - helm|copyright:popular_books-prefix_length_25
+  - helm|copyright:popular_books-prefix_length_250
+  - helm|copyright:popular_books-prefix_length_5
+  - helm|copyright:popular_books-prefix_length_50
+  - helm|copyright:prompt_num_line_1-min_lines_20
+  - helm|copyright:prompt_num_line_10-min_lines_20
+  - helm|copyright:prompt_num_line_5-min_lines_20
+  - helm|covid_dialogue
+  - helm|dyck_language:2
+  - helm|dyck_language:3
+  - helm|dyck_language:4
+  - helm|entity_data_imputation:Buy
+  - helm|entity_data_imputation:Restaurant
+  - helm|entity_matching:Abt_Buy
+  - helm|entity_matching:Amazon_Google
+  - helm|entity_matching:Beer
+  - helm|entity_matching:Company
+  - helm|entity_matching:DBLP_ACM
+  - helm|entity_matching:DBLP_GoogleScholar
+  - helm|entity_matching:Dirty_DBLP_ACM
+  - helm|entity_matching:Dirty_DBLP_GoogleScholar
+  - helm|entity_matching:Dirty_Walmart_Amazon
+  - helm|entity_matching:Dirty_iTunes_Amazon
+  - helm|entity_matching:Walmart_Amazon
+  - helm|entity_matching:iTunes_Amazon
+  - helm|entity_matching=Fodors_Zagats
+  - helm|hellaswag
+  - helm|imdb
+  - helm|imdb:contrastset
+  - helm|interactive_qa_mmlu:abstract_algebra
+  - helm|interactive_qa_mmlu:college_chemistry
+  - helm|interactive_qa_mmlu:global_facts
+  - helm|interactive_qa_mmlu:miscellaneous
+  - helm|interactive_qa_mmlu:nutrition
+  - helm|interactive_qa_mmlu:us_foreign_policy
+  - helm|legal_summarization:billsum
+  - helm|legal_summarization:eurlexsum
+  - helm|legal_summarization:multilexsum
+  - helm|legalsupport
+  - helm|lexglue:case_hold
+  - helm|lexglue:ecthr_a
+  - helm|lexglue:ecthr_b
+  - helm|lexglue:eurlex
+  - helm|lexglue:ledgar
+  - helm|lexglue:scotus
+  - helm|lexglue:unfair_tos
+  - helm|lextreme:brazilian_court_decisions_judgment
+  - helm|lextreme:brazilian_court_decisions_unanimity
+  - helm|lextreme:covid19_emergency_event
+  - helm|lextreme:german_argument_mining
+  - helm|lextreme:greek_legal_code_chapter
+  - helm|lextreme:greek_legal_code_subject
+  - helm|lextreme:greek_legal_code_volume
+  - helm|lextreme:greek_legal_ner
+  - helm|lextreme:legalnero
+  - helm|lextreme:lener_br
+  - helm|lextreme:mapa_coarse
+  - helm|lextreme:mapa_fine
+  - helm|lextreme:multi_eurlex_level_1
+  - helm|lextreme:multi_eurlex_level_2
+  - helm|lextreme:multi_eurlex_level_3
+  - helm|lextreme:online_terms_of_service_clause_topics
+  - helm|lextreme:online_terms_of_service_unfairness_levels
+  - helm|lextreme:swiss_judgment_prediction
+  - helm|lsat_qa
+  - helm|lsat_qa:assignment
+  - helm|lsat_qa:grouping
+  - helm|lsat_qa:miscellaneous
+  - helm|lsat_qa:ordering
+  - helm|me_q_sum
+  - helm|med_dialog:healthcaremagic
+  - helm|med_dialog:icliniq
+  - helm|med_mcqa
+  - helm|med_paragraph_simplification
+  - helm|med_qa
+  - helm|mmlu
+  - helm|mmlu:abstract_algebra
+  - helm|mmlu:anatomy
+  - helm|mmlu:astronomy
+  - helm|mmlu:business_ethics
+  - helm|mmlu:clinical_knowledge
+  - helm|mmlu:college_biology
+  - helm|mmlu:college_chemistry
+  - helm|mmlu:college_computer_science
+  - helm|mmlu:college_mathematics
+  - helm|mmlu:college_medicine
+  - helm|mmlu:college_physics
+  - helm|mmlu:computer_security
+  - helm|mmlu:conceptual_physics
+  - helm|mmlu:econometrics
+  - helm|mmlu:electrical_engineering
+  - helm|mmlu:elementary_mathematics
+  - helm|mmlu:formal_logic
+  - helm|mmlu:global_facts
+  - helm|mmlu:high_school_biology
+  - helm|mmlu:high_school_chemistry
+  - helm|mmlu:high_school_computer_science
+  - helm|mmlu:high_school_european_history
+  - helm|mmlu:high_school_geography
+  - helm|mmlu:high_school_government_and_politics
+  - helm|mmlu:high_school_macroeconomics
+  - helm|mmlu:high_school_mathematics
+  - helm|mmlu:high_school_microeconomics
+  - helm|mmlu:high_school_physics
+  - helm|mmlu:high_school_psychology
+  - helm|mmlu:high_school_statistics
+  - helm|mmlu:high_school_us_history
+  - helm|mmlu:high_school_world_history
+  - helm|mmlu:human_aging
+  - helm|mmlu:human_sexuality
+  - helm|mmlu:international_law
+  - helm|mmlu:jurisprudence
+  - helm|mmlu:logical_fallacies
+  - helm|mmlu:machine_learning
+  - helm|mmlu:management
+  - helm|mmlu:marketing
+  - helm|mmlu:medical_genetics
+  - helm|mmlu:miscellaneous
+  - helm|mmlu:moral_disputes
+  - helm|mmlu:moral_scenarios
+  - helm|mmlu:nutrition
+  - helm|mmlu:philosophy
+  - helm|mmlu:prehistory
+  - helm|mmlu:professional_accounting
+  - helm|mmlu:professional_law
+  - helm|mmlu:professional_medicine
+  - helm|mmlu:professional_psychology
+  - helm|mmlu:public_relations
+  - helm|mmlu:security_studies
+  - helm|mmlu:sociology
+  - helm|mmlu:us_foreign_policy
+  - helm|mmlu:virology
+  - helm|mmlu:world_religions
+  - helm|narrativeqa
+  - helm|numeracy:linear_example
+  - helm|numeracy:linear_standard
+  - helm|numeracy:parabola_example
+  - helm|numeracy:parabola_standard
+  - helm|numeracy:paraboloid_example
+  - helm|numeracy:paraboloid_standard
+  - helm|numeracy:plane_example
+  - helm|numeracy:plane_standard
+  - helm|openbookqa
+  - helm|piqa
+  - helm|pubmedqa
+  - helm|quac
+  - helm|raft:ade_corpus_v2
+  - helm|raft:banking_77
+  - helm|raft:neurips_impact_statement_risks
+  - helm|raft:one_stop_english
+  - helm|raft:overruling
+  - helm|raft:semiconductor_org_types
+  - helm|raft:systematic_review_inclusion
+  - helm|raft:tai_safety_research
+  - helm|raft:terms_of_service
+  - helm|raft:tweet_eval_hate
+  - helm|raft:twitter_complaints
+  - helm|real_toxicity_prompts
+  - helm|siqa
+  - helm|summarization:cnn-dm
+  - helm|summarization:xsum
+  - helm|summarization:xsum-sampled
+  - helm|synthetic_reasoning:induction
+  - helm|synthetic_reasoning:natural_easy
+  - helm|synthetic_reasoning:natural_hard
+  - helm|synthetic_reasoning:pattern_match
+  - helm|synthetic_reasoning:variable_substitution
+  - helm|the_pile:arxiv
+  - helm|the_pile:bibliotik
+  - helm|the_pile:commoncrawl
+  - helm|the_pile:dm-mathematics
+  - helm|the_pile:enron
+  - helm|the_pile:europarl
+  - helm|the_pile:freelaw
+  - helm|the_pile:github
+  - helm|the_pile:gutenberg
+  - helm|the_pile:hackernews
+  - helm|the_pile:nih-exporter
+  - helm|the_pile:opensubtitles
+  - helm|the_pile:openwebtext2
+  - helm|the_pile:pubmed-abstracts
+  - helm|the_pile:pubmed-central
+  - helm|the_pile:stackexchange
+  - helm|the_pile:upsto
+  - helm|the_pile:wikipedia
+  - helm|the_pile:youtubesubtitles
+  - helm|truthfulqa
+  - helm|twitterAAE:aa
+  - helm|twitterAAE:white
+  - helm|wikifact:applies_to_jurisdiction
+  - helm|wikifact:atomic_number
+  - helm|wikifact:author
+  - helm|wikifact:award_received
+  - helm|wikifact:basic_form_of_government
+  - helm|wikifact:capital
+  - helm|wikifact:capital_of
+  - helm|wikifact:central_bank
+  - helm|wikifact:composer
+  - helm|wikifact:continent
+  - helm|wikifact:country
+  - helm|wikifact:country_of_citizenship
+  - helm|wikifact:country_of_origin
+  - helm|wikifact:creator
+  - helm|wikifact:currency
+  - helm|wikifact:defendant
+  - helm|wikifact:developer
+  - helm|wikifact:diplomatic_relation
+  - helm|wikifact:director
+  - helm|wikifact:discoverer_or_inventor
+  - helm|wikifact:drug_or_therapy_used_for_treatment
+  - helm|wikifact:educated_at
+  - helm|wikifact:electron_configuration
+  - helm|wikifact:employer
+  - helm|wikifact:field_of_work
+  - helm|wikifact:file_extension
+  - helm|wikifact:genetic_association
+  - helm|wikifact:genre
+  - helm|wikifact:has_part
+  - helm|wikifact:head_of_government
+  - helm|wikifact:head_of_state
+  - helm|wikifact:headquarters_location
+  - helm|wikifact:industry
+  - helm|wikifact:influenced_by
+  - helm|wikifact:instance_of
+  - helm|wikifact:instrument
+  - helm|wikifact:language_of_work_or_name
+  - helm|wikifact:languages_spoken_written_or_signed
+  - helm|wikifact:laws_applied
+  - helm|wikifact:located_in_the_administrative_territorial_entity
+  - helm|wikifact:location
+  - helm|wikifact:location_of_discovery
+  - helm|wikifact:location_of_formation
+  - helm|wikifact:majority_opinion_by
+  - helm|wikifact:manufacturer
+  - helm|wikifact:measured_physical_quantity
+  - helm|wikifact:medical_condition_treated
+  - helm|wikifact:member_of
+  - helm|wikifact:member_of_political_party
+  - helm|wikifact:member_of_sports_team
+  - helm|wikifact:movement
+  - helm|wikifact:named_after
+  - helm|wikifact:native_language
+  - helm|wikifact:number_of_processor_cores
+  - helm|wikifact:occupation
+  - helm|wikifact:office_held_by_head_of_government
+  - helm|wikifact:office_held_by_head_of_state
+  - helm|wikifact:official_language
+  - helm|wikifact:operating_system
+  - helm|wikifact:original_language_of_film_or_TV_show
+  - helm|wikifact:original_network
+  - helm|wikifact:overrules
+  - helm|wikifact:owned_by
+  - helm|wikifact:part_of
+  - helm|wikifact:participating_team
+  - helm|wikifact:place_of_birth
+  - helm|wikifact:place_of_death
+  - helm|wikifact:plaintiff
+  - helm|wikifact:position_held
+  - helm|wikifact:position_played_on_team
+  - helm|wikifact:programming_language
+  - helm|wikifact:recommended_unit_of_measurement
+  - helm|wikifact:record_label
+  - helm|wikifact:religion
+  - helm|wikifact:repealed_by
+  - helm|wikifact:shares_border_with
+  - helm|wikifact:solved_by
+  - helm|wikifact:statement_describes
+  - helm|wikifact:stock_exchange
+  - helm|wikifact:subclass_of
+  - helm|wikifact:subsidiary
+  - helm|wikifact:symptoms_and_signs
+  - helm|wikifact:therapeutic_area
+  - helm|wikifact:time_of_discovery_or_invention
+  - helm|wikifact:twinned_administrative_body
+  - helm|wikifact:work_location
+  - helm|wikitext:103:document_level
+  - helm|wmt14:cs-en
+  - helm|wmt14:de-en
+  - helm|wmt14:fr-en
+  - helm|wmt14:hi-en
+  - helm|wmt14:ru-en
+
+- leaderboard:
+  - leaderboard|arc:challenge
+  - leaderboard|gsm8k
+  - leaderboard|hellaswag
+  - leaderboard|mmlu:abstract_algebra
+  - leaderboard|mmlu:anatomy
+  - leaderboard|mmlu:astronomy
+  - leaderboard|mmlu:business_ethics
+  - leaderboard|mmlu:clinical_knowledge
+  - leaderboard|mmlu:college_biology
+  - leaderboard|mmlu:college_chemistry
+  - leaderboard|mmlu:college_computer_science
+  - leaderboard|mmlu:college_mathematics
+  - leaderboard|mmlu:college_medicine
+  - leaderboard|mmlu:college_physics
+  - leaderboard|mmlu:computer_security
+  - leaderboard|mmlu:conceptual_physics
+  - leaderboard|mmlu:econometrics
+  - leaderboard|mmlu:electrical_engineering
+  - leaderboard|mmlu:elementary_mathematics
+  - leaderboard|mmlu:formal_logic
+  - leaderboard|mmlu:global_facts
+  - leaderboard|mmlu:high_school_biology
+  - leaderboard|mmlu:high_school_chemistry
+  - leaderboard|mmlu:high_school_computer_science
+  - leaderboard|mmlu:high_school_european_history
+  - leaderboard|mmlu:high_school_geography
+  - leaderboard|mmlu:high_school_government_and_politics
+  - leaderboard|mmlu:high_school_macroeconomics
+  - leaderboard|mmlu:high_school_mathematics
+  - leaderboard|mmlu:high_school_microeconomics
+  - leaderboard|mmlu:high_school_physics
+  - leaderboard|mmlu:high_school_psychology
+  - leaderboard|mmlu:high_school_statistics
+  - leaderboard|mmlu:high_school_us_history
+  - leaderboard|mmlu:high_school_world_history
+  - leaderboard|mmlu:human_aging
+  - leaderboard|mmlu:human_sexuality
+  - leaderboard|mmlu:international_law
+  - leaderboard|mmlu:jurisprudence
+  - leaderboard|mmlu:logical_fallacies
+  - leaderboard|mmlu:machine_learning
+  - leaderboard|mmlu:management
+  - leaderboard|mmlu:marketing
+  - leaderboard|mmlu:medical_genetics
+  - leaderboard|mmlu:miscellaneous
+  - leaderboard|mmlu:moral_disputes
+  - leaderboard|mmlu:moral_scenarios
+  - leaderboard|mmlu:nutrition
+  - leaderboard|mmlu:philosophy
+  - leaderboard|mmlu:prehistory
+  - leaderboard|mmlu:professional_accounting
+  - leaderboard|mmlu:professional_law
+  - leaderboard|mmlu:professional_medicine
+  - leaderboard|mmlu:professional_psychology
+  - leaderboard|mmlu:public_relations
+  - leaderboard|mmlu:security_studies
+  - leaderboard|mmlu:sociology
+  - leaderboard|mmlu:us_foreign_policy
+  - leaderboard|mmlu:virology
+  - leaderboard|mmlu:world_religions
+  - leaderboard|truthfulqa:mc
+  - leaderboard|winogrande
+
+- lighteval:
+  - lighteval|agieval:aqua-rat
+  - lighteval|agieval:gaokao-biology
+  - lighteval|agieval:gaokao-chemistry
+  - lighteval|agieval:gaokao-chinese
+  - lighteval|agieval:gaokao-english
+  - lighteval|agieval:gaokao-geography
+  - lighteval|agieval:gaokao-history
+  - lighteval|agieval:gaokao-mathqa
+  - lighteval|agieval:gaokao-physics
+  - lighteval|agieval:logiqa-en
+  - lighteval|agieval:logiqa-zh
+  - lighteval|agieval:lsat-ar
+  - lighteval|agieval:lsat-lr
+  - lighteval|agieval:lsat-rc
+  - lighteval|agieval:sat-en
+  - lighteval|agieval:sat-en-without-passage
+  - lighteval|agieval:sat-math
+  - lighteval|anli
+  - lighteval|anli:r1
+  - lighteval|anli:r2
+  - lighteval|anli:r3
+  - lighteval|arc:easy
+  - lighteval|arithmetic:1dc
+  - lighteval|arithmetic:2da
+  - lighteval|arithmetic:2dm
+  - lighteval|arithmetic:2ds
+  - lighteval|arithmetic:3da
+  - lighteval|arithmetic:3ds
+  - lighteval|arithmetic:4da
+  - lighteval|arithmetic:4ds
+  - lighteval|arithmetic:5da
+  - lighteval|arithmetic:5ds
+  - lighteval|asdiv
+  - lighteval|bigbench:causal_judgment
+  - lighteval|bigbench:date_understanding
+  - lighteval|bigbench:disambiguation_qa
+  - lighteval|bigbench:geometric_shapes
+  - lighteval|bigbench:logical_deduction_five_objects
+  - lighteval|bigbench:logical_deduction_seven_objects
+  - lighteval|bigbench:logical_deduction_three_objects
+  - lighteval|bigbench:movie_recommendation
+  - lighteval|bigbench:navigate
+  - lighteval|bigbench:reasoning_about_colored_objects
+  - lighteval|bigbench:ruin_names
+  - lighteval|bigbench:salient_translation_error_detection
+  - lighteval|bigbench:snarks
+  - lighteval|bigbench:sports_understanding
+  - lighteval|bigbench:temporal_sequences
+  - lighteval|bigbench:tracking_shuffled_objects_five_objects
+  - lighteval|bigbench:tracking_shuffled_objects_seven_objects
+  - lighteval|bigbench:tracking_shuffled_objects_three_objects
+  - lighteval|blimp:adjunct_island
+  - lighteval|blimp:anaphor_gender_agreement
+  - lighteval|blimp:anaphor_number_agreement
+  - lighteval|blimp:animate_subject_passive
+  - lighteval|blimp:animate_subject_trans
+  - lighteval|blimp:causative
+  - lighteval|blimp:complex_NP_island
+  - lighteval|blimp:coordinate_structure_constraint_complex_left_branch
+  - lighteval|blimp:coordinate_structure_constraint_object_extraction
+  - lighteval|blimp:determiner_noun_agreement_1
+  - lighteval|blimp:determiner_noun_agreement_2
+  - lighteval|blimp:determiner_noun_agreement_irregular_1
+  - lighteval|blimp:determiner_noun_agreement_irregular_2
+  - lighteval|blimp:determiner_noun_agreement_with_adj_2
+  - lighteval|blimp:determiner_noun_agreement_with_adj_irregular_1
+  - lighteval|blimp:determiner_noun_agreement_with_adj_irregular_2
+  - lighteval|blimp:determiner_noun_agreement_with_adjective_1
+  - lighteval|blimp:distractor_agreement_relational_noun
+  - lighteval|blimp:distractor_agreement_relative_clause
+  - lighteval|blimp:drop_argument
+  - lighteval|blimp:ellipsis_n_bar_1
+  - lighteval|blimp:ellipsis_n_bar_2
+  - lighteval|blimp:existential_there_object_raising
+  - lighteval|blimp:existential_there_quantifiers_1
+  - lighteval|blimp:existential_there_quantifiers_2
+  - lighteval|blimp:existential_there_subject_raising
+  - lighteval|blimp:expletive_it_object_raising
+  - lighteval|blimp:inchoative
+  - lighteval|blimp:intransitive
+  - lighteval|blimp:irregular_past_participle_adjectives
+  - lighteval|blimp:irregular_past_participle_verbs
+  - lighteval|blimp:irregular_plural_subject_verb_agreement_1
+  - lighteval|blimp:irregular_plural_subject_verb_agreement_2
+  - lighteval|blimp:left_branch_island_echo_question
+  - lighteval|blimp:left_branch_island_simple_question
+  - lighteval|blimp:matrix_question_npi_licensor_present
+  - lighteval|blimp:npi_present_1
+  - lighteval|blimp:npi_present_2
+  - lighteval|blimp:only_npi_licensor_present
+  - lighteval|blimp:only_npi_scope
+  - lighteval|blimp:passive_1
+  - lighteval|blimp:passive_2
+  - lighteval|blimp:principle_A_c_command
+  - lighteval|blimp:principle_A_case_1
+  - lighteval|blimp:principle_A_case_2
+  - lighteval|blimp:principle_A_domain_1
+  - lighteval|blimp:principle_A_domain_2
+  - lighteval|blimp:principle_A_domain_3
+  - lighteval|blimp:principle_A_reconstruction
+  - lighteval|blimp:regular_plural_subject_verb_agreement_1
+  - lighteval|blimp:regular_plural_subject_verb_agreement_2
+  - lighteval|blimp:sentential_negation_npi_licensor_present
+  - lighteval|blimp:sentential_negation_npi_scope
+  - lighteval|blimp:sentential_subject_island
+  - lighteval|blimp:superlative_quantifiers_1
+  - lighteval|blimp:superlative_quantifiers_2
+  - lighteval|blimp:tough_vs_raising_1
+  - lighteval|blimp:tough_vs_raising_2
+  - lighteval|blimp:transitive
+  - lighteval|blimp:wh_island
+  - lighteval|blimp:wh_questions_object_gap
+  - lighteval|blimp:wh_questions_subject_gap
+  - lighteval|blimp:wh_questions_subject_gap_long_distance
+  - lighteval|blimp:wh_vs_that_no_gap
+  - lighteval|blimp:wh_vs_that_no_gap_long_distance
+  - lighteval|blimp:wh_vs_that_with_gap
+  - lighteval|blimp:wh_vs_that_with_gap_long_distance
+  - lighteval|coqa
+  - lighteval|coqa_bb
+  - lighteval|drop
+  - lighteval|ethics:commonsense
+  - lighteval|ethics:deontology
+  - lighteval|ethics:justice
+  - lighteval|ethics:utilitarianism
+  - lighteval|ethics:virtue
+  - lighteval|glue:cola
+  - lighteval|glue:mnli
+  - lighteval|glue:mnli_mismatched
+  - lighteval|glue:mrpc
+  - lighteval|glue:qnli
+  - lighteval|glue:qqp
+  - lighteval|glue:rte
+  - lighteval|glue:sst2
+  - lighteval|glue:stsb
+  - lighteval|glue:wnli
+  - lighteval|gpqa
+  - lighteval|gsm8k
+  - lighteval|headqa:en
+  - lighteval|headqa:es
+  - lighteval|iwslt17:ar-en
+  - lighteval|iwslt17:de-en
+  - lighteval|iwslt17:en-ar
+  - lighteval|iwslt17:en-de
+  - lighteval|iwslt17:en-fr
+  - lighteval|iwslt17:en-ja
+  - lighteval|iwslt17:en-ko
+  - lighteval|iwslt17:en-zh
+  - lighteval|iwslt17:fr-en
+  - lighteval|iwslt17:ja-en
+  - lighteval|iwslt17:ko-en
+  - lighteval|iwslt17:zh-en
+  - lighteval|lambada:openai
+  - lighteval|lambada:openai:de
+  - lighteval|lambada:openai:en
+  - lighteval|lambada:openai:es
+  - lighteval|lambada:openai:fr
+  - lighteval|lambada:openai:it
+  - lighteval|lambada:openai_cloze
+  - lighteval|lambada:standard
+  - lighteval|lambada:standard_cloze
+  - lighteval|logiqa
+  - lighteval|math:algebra
+  - lighteval|math:counting_and_probability
+  - lighteval|math:geometry
+  - lighteval|math:intermediate_algebra
+  - lighteval|math:number_theory
+  - lighteval|math:prealgebra
+  - lighteval|math:precalculus
+  - lighteval|math_cot:algebra
+  - lighteval|math_cot:counting_and_probability
+  - lighteval|math_cot:geometry
+  - lighteval|math_cot:intermediate_algebra
+  - lighteval|math_cot:number_theory
+  - lighteval|math_cot:prealgebra
+  - lighteval|math_cot:precalculus
+  - lighteval|mathqa
+  - lighteval|mgsm:bn
+  - lighteval|mgsm:de
+  - lighteval|mgsm:en
+  - lighteval|mgsm:es
+  - lighteval|mgsm:fr
+  - lighteval|mgsm:ja
+  - lighteval|mgsm:ru
+  - lighteval|mgsm:sw
+  - lighteval|mgsm:te
+  - lighteval|mgsm:th
+  - lighteval|mgsm:zh
+  - lighteval|mtnt2019:en-fr
+  - lighteval|mtnt2019:en-ja
+  - lighteval|mtnt2019:fr-en
+  - lighteval|mtnt2019:ja-en
+  - lighteval|mutual
+  - lighteval|mutual_plus
+  - lighteval|openbookqa
+  - lighteval|piqa
+  - lighteval|prost
+  - lighteval|pubmedqa
+  - lighteval|qa4mre:2011
+  - lighteval|qa4mre:2012
+  - lighteval|qa4mre:2013
+  - lighteval|qasper
+  - lighteval|qasper_ll
+  - lighteval|race:high
+  - lighteval|sciq
+  - lighteval|storycloze:2016
+  - lighteval|storycloze:2018
+  - lighteval|super_glue:boolq
+  - lighteval|super_glue:cb
+  - lighteval|super_glue:copa
+  - lighteval|super_glue:multirc
+  - lighteval|super_glue:rte
+  - lighteval|super_glue:wic
+  - lighteval|super_glue:wsc
+  - lighteval|swag
+  - lighteval|the_pile:arxiv
+  - lighteval|the_pile:bookcorpus2
+  - lighteval|the_pile:books3
+  - lighteval|the_pile:dm-mathematics
+  - lighteval|the_pile:enron
+  - lighteval|the_pile:europarl
+  - lighteval|the_pile:freelaw
+  - lighteval|the_pile:github
+  - lighteval|the_pile:gutenberg
+  - lighteval|the_pile:hackernews
+  - lighteval|the_pile:nih-exporter
+  - lighteval|the_pile:opensubtitles
+  - lighteval|the_pile:openwebtext2
+  - lighteval|the_pile:philpapers
+  - lighteval|the_pile:pile-cc
+  - lighteval|the_pile:pubmed-abstracts
+  - lighteval|the_pile:pubmed-central
+  - lighteval|the_pile:stackexchange
+  - lighteval|the_pile:ubuntu-irc
+  - lighteval|the_pile:uspto
+  - lighteval|the_pile:wikipedia
+  - lighteval|the_pile:youtubesubtitles
+  - lighteval|toxigen
+  - lighteval|triviaqa
+  - lighteval|truthfulqa:gen
+  - lighteval|unscramble:anagrams1
+  - lighteval|unscramble:anagrams2
+  - lighteval|unscramble:cycle_letters
+  - lighteval|unscramble:random_insertion
+  - lighteval|unscramble:reversed_words
+  - lighteval|webqs
+  - lighteval|wikitext:2
+  - lighteval|wmt08:cs-en
+  - lighteval|wmt08:de-en
+  - lighteval|wmt08:en-cs
+  - lighteval|wmt08:en-de
+  - lighteval|wmt08:en-es
+  - lighteval|wmt08:en-fr
+  - lighteval|wmt08:en-hu
+  - lighteval|wmt08:es-en
+  - lighteval|wmt08:fr-en
+  - lighteval|wmt08:hu-en
+  - lighteval|wmt09:cs-en
+  - lighteval|wmt09:de-en
+  - lighteval|wmt09:en-cs
+  - lighteval|wmt09:en-de
+  - lighteval|wmt09:en-es
+  - lighteval|wmt09:en-fr
+  - lighteval|wmt09:en-hu
+  - lighteval|wmt09:en-it
+  - lighteval|wmt09:es-en
+  - lighteval|wmt09:fr-en
+  - lighteval|wmt09:hu-en
+  - lighteval|wmt09:it-en
+  - lighteval|wmt10:cs-en
+  - lighteval|wmt10:de-en
+  - lighteval|wmt10:en-cs
+  - lighteval|wmt10:en-de
+  - lighteval|wmt10:en-es
+  - lighteval|wmt10:en-fr
+  - lighteval|wmt10:es-en
+  - lighteval|wmt10:fr-en
+  - lighteval|wmt11:cs-en
+  - lighteval|wmt11:de-en
+  - lighteval|wmt11:en-cs
+  - lighteval|wmt11:en-de
+  - lighteval|wmt11:en-es
+  - lighteval|wmt11:en-fr
+  - lighteval|wmt11:es-en
+  - lighteval|wmt11:fr-en
+  - lighteval|wmt12:cs-en
+  - lighteval|wmt12:de-en
+  - lighteval|wmt12:en-cs
+  - lighteval|wmt12:en-de
+  - lighteval|wmt12:en-es
+  - lighteval|wmt12:en-fr
+  - lighteval|wmt12:es-en
+  - lighteval|wmt12:fr-en
+  - lighteval|wmt13:cs-en
+  - lighteval|wmt13:de-en
+  - lighteval|wmt13:en-cs
+  - lighteval|wmt13:en-de
+  - lighteval|wmt13:en-es
+  - lighteval|wmt13:en-fr
+  - lighteval|wmt13:en-ru
+  - lighteval|wmt13:es-en
+  - lighteval|wmt13:fr-en
+  - lighteval|wmt13:ru-en
+  - lighteval|wmt14:cs-en
+  - lighteval|wmt14:de-en
+  - lighteval|wmt14:en-cs
+  - lighteval|wmt14:en-de
+  - lighteval|wmt14:en-fr
+  - lighteval|wmt14:en-hi
+  - lighteval|wmt14:en-ru
+  - lighteval|wmt14:fr-en
+  - lighteval|wmt14:hi-en
+  - lighteval|wmt14:ru-en
+  - lighteval|wmt15:cs-en
+  - lighteval|wmt15:de-en
+  - lighteval|wmt15:en-cs
+  - lighteval|wmt15:en-de
+  - lighteval|wmt15:en-fi
+  - lighteval|wmt15:en-fr
+  - lighteval|wmt15:en-ru
+  - lighteval|wmt15:fi-en
+  - lighteval|wmt15:fr-en
+  - lighteval|wmt15:ru-en
+  - lighteval|wmt16:cs-en
+  - lighteval|wmt16:de-en
+  - lighteval|wmt16:en-cs
+  - lighteval|wmt16:en-de
+  - lighteval|wmt16:en-fi
+  - lighteval|wmt16:en-ro
+  - lighteval|wmt16:en-ru
+  - lighteval|wmt16:en-tr
+  - lighteval|wmt16:fi-en
+  - lighteval|wmt16:ro-en
+  - lighteval|wmt16:ru-en
+  - lighteval|wmt16:tr-en
+  - lighteval|wmt17:cs-en
+  - lighteval|wmt17:de-en
+  - lighteval|wmt17:en-cs
+  - lighteval|wmt17:en-de
+  - lighteval|wmt17:en-fi
+  - lighteval|wmt17:en-lv
+  - lighteval|wmt17:en-ru
+  - lighteval|wmt17:en-tr
+  - lighteval|wmt17:en-zh
+  - lighteval|wmt17:fi-en
+  - lighteval|wmt17:lv-en
+  - lighteval|wmt17:ru-en
+  - lighteval|wmt17:tr-en
+  - lighteval|wmt17:zh-en
+  - lighteval|wmt18:cs-en
+  - lighteval|wmt18:de-en
+  - lighteval|wmt18:en-cs
+  - lighteval|wmt18:en-de
+  - lighteval|wmt18:en-et
+  - lighteval|wmt18:en-fi
+  - lighteval|wmt18:en-ru
+  - lighteval|wmt18:en-tr
+  - lighteval|wmt18:en-zh
+  - lighteval|wmt18:et-en
+  - lighteval|wmt18:fi-en
+  - lighteval|wmt18:ru-en
+  - lighteval|wmt18:tr-en
+  - lighteval|wmt18:zh-en
+  - lighteval|wmt19:cs-de
+  - lighteval|wmt19:de-cs
+  - lighteval|wmt19:de-en
+  - lighteval|wmt19:de-fr
+  - lighteval|wmt19:en-cs
+  - lighteval|wmt19:en-de
+  - lighteval|wmt19:en-fi
+  - lighteval|wmt19:en-gu
+  - lighteval|wmt19:en-kk
+  - lighteval|wmt19:en-lt
+  - lighteval|wmt19:en-ru
+  - lighteval|wmt19:en-zh
+  - lighteval|wmt19:fi-en
+  - lighteval|wmt19:fr-de
+  - lighteval|wmt19:gu-en
+  - lighteval|wmt19:kk-en
+  - lighteval|wmt19:lt-en
+  - lighteval|wmt19:ru-en
+  - lighteval|wmt19:zh-en
+  - lighteval|wmt20:cs-en
+  - lighteval|wmt20:de-en
+  - lighteval|wmt20:de-fr
+  - lighteval|wmt20:en-cs
+  - lighteval|wmt20:en-de
+  - lighteval|wmt20:en-iu
+  - lighteval|wmt20:en-ja
+  - lighteval|wmt20:en-km
+  - lighteval|wmt20:en-pl
+  - lighteval|wmt20:en-ps
+  - lighteval|wmt20:en-ru
+  - lighteval|wmt20:en-ta
+  - lighteval|wmt20:en-zh
+  - lighteval|wmt20:fr-de
+  - lighteval|wmt20:iu-en
+  - lighteval|wmt20:ja-en
+  - lighteval|wmt20:km-en
+  - lighteval|wmt20:pl-en
+  - lighteval|wmt20:ps-en
+  - lighteval|wmt20:ru-en
+  - lighteval|wmt20:ta-en
+  - lighteval|wmt20:zh-en
+  - lighteval|wsc273
+  - lighteval|xcopa:en
+  - lighteval|xcopa:et
+  - lighteval|xcopa:ht
+  - lighteval|xcopa:id
+  - lighteval|xcopa:it
+  - lighteval|xcopa:qu
+  - lighteval|xcopa:sw
+  - lighteval|xcopa:ta
+  - lighteval|xcopa:th
+  - lighteval|xcopa:tr
+  - lighteval|xcopa:vi
+  - lighteval|xcopa:zh
+  - lighteval|xstory_cloze:ar
+  - lighteval|xstory_cloze:en
+  - lighteval|xstory_cloze:es
+  - lighteval|xstory_cloze:eu
+  - lighteval|xstory_cloze:hi
+  - lighteval|xstory_cloze:id
+  - lighteval|xstory_cloze:my
+  - lighteval|xstory_cloze:ru
+  - lighteval|xstory_cloze:sw
+  - lighteval|xstory_cloze:te
+  - lighteval|xstory_cloze:zh
+  - lighteval|xwinograd:en
+  - lighteval|xwinograd:fr
+  - lighteval|xwinograd:jp
+  - lighteval|xwinograd:pt
+  - lighteval|xwinograd:ru
+  - lighteval|xwinograd:zh
+
+- original:
+  - original|arc:c:letters
+  - original|arc:c:options
+  - original|arc:c:simple
+  - original|mmlu
+  - original|mmlu:abstract_algebra
+  - original|mmlu:anatomy
+  - original|mmlu:astronomy
+  - original|mmlu:business_ethics
+  - original|mmlu:clinical_knowledge
+  - original|mmlu:college_biology
+  - original|mmlu:college_chemistry
+  - original|mmlu:college_computer_science
+  - original|mmlu:college_mathematics
+  - original|mmlu:college_medicine
+  - original|mmlu:college_physics
+  - original|mmlu:computer_security
+  - original|mmlu:conceptual_physics
+  - original|mmlu:econometrics
+  - original|mmlu:electrical_engineering
+  - original|mmlu:elementary_mathematics
+  - original|mmlu:formal_logic
+  - original|mmlu:global_facts
+  - original|mmlu:high_school_biology
+  - original|mmlu:high_school_chemistry
+  - original|mmlu:high_school_computer_science
+  - original|mmlu:high_school_european_history
+  - original|mmlu:high_school_geography
+  - original|mmlu:high_school_government_and_politics
+  - original|mmlu:high_school_macroeconomics
+  - original|mmlu:high_school_mathematics
+  - original|mmlu:high_school_microeconomics
+  - original|mmlu:high_school_physics
+  - original|mmlu:high_school_psychology
+  - original|mmlu:high_school_statistics
+  - original|mmlu:high_school_us_history
+  - original|mmlu:high_school_world_history
+  - original|mmlu:human_aging
+  - original|mmlu:human_sexuality
+  - original|mmlu:international_law
+  - original|mmlu:jurisprudence
+  - original|mmlu:logical_fallacies
+  - original|mmlu:machine_learning
+  - original|mmlu:management
+  - original|mmlu:marketing
+  - original|mmlu:medical_genetics
+  - original|mmlu:miscellaneous
+  - original|mmlu:moral_disputes
+  - original|mmlu:moral_scenarios
+  - original|mmlu:nutrition
+  - original|mmlu:philosophy
+  - original|mmlu:prehistory
+  - original|mmlu:professional_accounting
+  - original|mmlu:professional_law
+  - original|mmlu:professional_medicine
+  - original|mmlu:professional_psychology
+  - original|mmlu:public_relations
+  - original|mmlu:security_studies
+  - original|mmlu:sociology
+  - original|mmlu:us_foreign_policy
+  - original|mmlu:virology
+  - original|mmlu:world_religions
diff --git a/docs/source/use_tgi.md b/docs/source/use_tgi.md
index 7ae6b000b..805420734 100644
--- a/docs/source/use_tgi.md
+++ b/docs/source/use_tgi.md
@@ -1,3 +1,67 @@
-# Use TGI
+# Evaluate the model on a server or container
 
-blabla
+An alternative to launching the evaluation locally is to serve the model on a
+TGI-compatible server/container and then run the evaluation by sending requests
+to the server. The command is the same as before, except you specify a path to
+a yaml config file (detailed below):
+
+```
+python run_evals_accelerate.py \
+    --model_config_path="/path/to/config/file"\
+    --tasks <task parameters> \
+    --output_dir output_dir
+```
+
+There are two types of configuration files that can be provided for running on
+the server:
+
+### Hugging Face Inference Endpoints
+
+To launch a model using HuggingFace's Inference Endpoints, you need to provide
+the following file: `endpoint_model.yaml`. Lighteval will automatically deploy
+the endpoint, run the evaluation, and finally delete the endpoint (unless you
+specify an endpoint that was already launched, in which case the endpoint won't
+be deleted afterwards).
+
+__configuration file example:__
+
+```yaml
+model:
+  type: "endpoint"
+  base_params:
+    endpoint_name: "llama-2-7B-lighteval" # needs to be lower case without special characters
+    model: "meta-llama/Llama-2-7b-hf"
+    revision: "main"
+    dtype: "float16" # can be any of "awq", "eetq", "gptq", "4bit' or "8bit" (will use bitsandbytes), "bfloat16" or "float16"
+    reuse_existing: false # if true, ignore all params in instance, and don't delete the endpoint after evaluation
+  instance:
+    accelerator: "gpu"
+    region: "eu-west-1"
+    vendor: "aws"
+    instance_size: "medium"
+    instance_type: "g5.2xlarge"
+    framework: "pytorch"
+    endpoint_type: "protected"
+    namespace: null # The namespace under which to launch the endopint. Defaults to the current user's namespace
+    image_url: null # Optionally specify the docker image to use when launching the endpoint model. E.g., launching models with later releases of the TGI container with support for newer models.
+    env_vars:
+      null # Optional environment variables to include when launching the endpoint. e.g., `MAX_INPUT_LENGTH: 2048`
+  generation:
+    add_special_tokens: true
+```
+
+### Text Generation Inference (TGI)
+
+To use a model already deployed on a TGI server, for example on HuggingFace's
+serverless inference.
+
+__configuration file example:__
+
+```yaml
+model:
+  type: "tgi"
+  instance:
+    inference_server_address: ""
+    inference_server_auth: null
+    model_id: null # Optional, only required if the TGI container was launched with model_id pointing to a local directory
+```
diff --git a/docs/source/use_vllm.md b/docs/source/use_vllm.md
index 10919f413..c3c49e010 100644
--- a/docs/source/use_vllm.md
+++ b/docs/source/use_vllm.md
@@ -1,4 +1,4 @@
 # Use VLLM as backend
 
-
-blablablal
+```bash
+```

From 57b0cd42669b69d2ed5b54e9eabea1c250615e5b Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Mon, 9 Sep 2024 11:52:12 +0200
Subject: [PATCH 07/24] commit

---
 docs/source/_toctree.yml    |  2 --
 docs/source/installation.md |  5 +++--
 docs/source/quicktour.md    | 19 +++++++++++--------
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 8c17aee40..5c8771113 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -18,8 +18,6 @@
       title: Adding a Custom Metric
     - local: saving_results
       title: Saving Results
-    - local: training_and_eval_loop
-      title: Training and Evaluation Loop
 - title: "API Reference"
   sections:
     - local: metric_list
diff --git a/docs/source/installation.md b/docs/source/installation.md
index cdbb87b8c..d30d4d64c 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -19,8 +19,8 @@ pip install -e .
 ### Extras
 
 Lighteval has optional dependencies that you can install by specifying the
-appropriate extras group. `pip install lighteval[<group>]` or `pip install -e
-.[<group>]`.
+appropriate extras group.
+`pip install lighteval[<group>]` or `pip install -e .[<group>]`.
 
 | extra name   | description                                                               |
 |--------------|---------------------------------------------------------------------------|
@@ -30,6 +30,7 @@ appropriate extras group. `pip install lighteval[<group>]` or `pip install -e
 | quantization | To evaluate quantized models                                              |
 | adapters     | To evaluate adapters models (delta and peft)                              |
 | tensorboardX | To upload your results to tensorboard                                     |
+| vllm         | To use vllm as backend for inference                                      |
 
 ## Hugging Face login
 
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 73099ec5a..5d0c0ac83 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -21,17 +21,20 @@ lighteval accelerate \
      --output_dir="./evals/"
 ```
 
-Here, --tasks refers to either a comma-separated list of supported tasks from
-the `tasks_list` in the format: Tasks details can also be found in the file
-implementing them.
+Here, `--tasks` refers to either a comma-separated list of supported tasks from
+the [tasks_list](tasks) in the format:
 
 ```bash
 suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
 ```
 
-or a file path like ``examples/tasks/recommended_set.txt`` which specifies
-multiple task configurations. For example, to evaluate GPT-2 on the Truthful QA
-benchmark run:
+or a file path like
+[examples/tasks/recommended_set.txt](https://github.com/huggingface/lighteval/blob/main/examples/tasks/recommended_set.txt)
+which specifies multiple task configurations.
+
+Tasks details can be found in the
+[file](https://github.com/huggingface/lighteval/blob/main/src/lighteval/tasks/default_tasks.py)
+implementing them.
 
 ### Evaluate a model on one or more GPUs
 
@@ -88,8 +91,8 @@ Nanotron models cannot be evaluated without torchrun.
 ```bash
  torchrun --standalone --nnodes=1 --nproc-per-node=1  \
  src/lighteval/__main__.py nanotron \
- --checkpoint-config-path ../nanotron/checkpoints/10/config.yaml \
- --lighteval-override examples/nanotron/lighteval_config_override_template.yaml
+ --checkpoint_config_path ../nanotron/checkpoints/10/config.yaml \
+ --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
  ```
 
 The `nproc-per-node` argument should match the data, tensor and pipeline

From 7e4d56d36fd9e13439ca02777e592970d7675e13 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 11 Sep 2024 13:08:32 +0200
Subject: [PATCH 08/24] commit

---
 docs/source/_toctree.yml             | 14 +++++++-----
 docs/source/quicktour.md             |  4 ++--
 docs/source/use_python_api.md        |  7 ++++++
 docs/source/use_vllm.md              | 33 ++++++++++++++++++++++++++++
 src/lighteval/models/model_config.py | 21 ++++++++++++++++++
 5 files changed, 71 insertions(+), 8 deletions(-)
 create mode 100644 docs/source/use_python_api.md

diff --git a/docs/source/_toctree.yml b/docs/source/_toctree.yml
index 5c8771113..5e3226c2c 100644
--- a/docs/source/_toctree.yml
+++ b/docs/source/_toctree.yml
@@ -8,16 +8,18 @@
       title: Quicktour
 - title: "Guides"
   sections:
-    - local: use_vllm
-      title: Using VLLM as backend
-    - local: use_tgi
-      title: Evaluate on Server
+    - local: saving_results
+      title: Saving Results
+    - local: use_python_api
+      title: Use The Python API
     - local: adding_new_task
       title: Adding a Custom Task
     - local: adding_new_metric
       title: Adding a Custom Metric
-    - local: saving_results
-      title: Saving Results
+    - local: use_vllm
+      title: Using VLLM as backend
+    - local: use_tgi
+      title: Evaluate on Server
 - title: "API Reference"
   sections:
     - local: metric_list
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 5d0c0ac83..54c988263 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -25,7 +25,7 @@ Here, `--tasks` refers to either a comma-separated list of supported tasks from
 the [tasks_list](tasks) in the format:
 
 ```bash
-suite|task|num_few_shot|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
+{suite}|{task}|{num_few_shot}|{0 or 1 to automatically reduce `num_few_shot` if prompt is too long}
 ```
 
 or a file path like
@@ -65,7 +65,7 @@ batch size will be `override_batch_size * num_gpus`.
 To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
 
 ```bash
-    lighteval accelerate \
+lighteval accelerate \
     --model_args "pretrained=gpt2,model_parallel=True" \
     --tasks "leaderboard|truthfulqa:mc|0|0" \
     --override_batch_size 1 \
diff --git a/docs/source/use_python_api.md b/docs/source/use_python_api.md
new file mode 100644
index 000000000..60ab98e9d
--- /dev/null
+++ b/docs/source/use_python_api.md
@@ -0,0 +1,7 @@
+# Use the Python API
+
+
+Hello World.
+
+```python
+```
diff --git a/docs/source/use_vllm.md b/docs/source/use_vllm.md
index c3c49e010..107f78534 100644
--- a/docs/source/use_vllm.md
+++ b/docs/source/use_vllm.md
@@ -1,4 +1,37 @@
 # Use VLLM as backend
 
+Lighteval allows you to use `vllm` as backend allowing great speedups.
+To use, simply change the `model_args` to reflect the arguments you want to pass to vllm.
+
+```bash
+lighteval accelerate \
+    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16" \
+    --tasks "leaderboard|truthfulqa:mc|0|0" \
+    --output_dir="./evals/"
+```
+
+`vllm` is able to distribute the model across multiple GPUs using data
+parallelism, pipeline parallelism or tensor parallelism.
+You can choose the parallelism method by setting in the the `model_args`.
+
+For example if you have 4 GPUs you can split it across using `tensor_parallelism`:
+
 ```bash
+export VLLM_WORKER_MULTIPROC_METHOD=spawn && lighteval accelerate \
+    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,tensor_parallel_size=4" \
+    --tasks "leaderboard|truthfulqa:mc|0|0" \
+    --output_dir="./evals/"
 ```
+
+Or, if your model fits on a single GPU, you can use `data_parallelism` to speed up the evaluation:
+
+```bash
+lighteval accelerate \
+    --model_args="vllm,pretrained=HuggingFaceH4/zephyr-7b-beta,dtype=float16,data_parallel_size=4" \
+    --tasks "leaderboard|truthfulqa:mc|0|0" \
+    --output_dir="./evals/"
+```
+
+Available arguments for `vllm` can be found in the `VLLMModelConfig`:
+
+[[autodoc]] lighteval.models.model_config.VLLMModelConfig
diff --git a/src/lighteval/models/model_config.py b/src/lighteval/models/model_config.py
index 51185912d..ec0a9610d 100644
--- a/src/lighteval/models/model_config.py
+++ b/src/lighteval/models/model_config.py
@@ -206,6 +206,27 @@ def init_configs(self, env_config: EnvConfig):
 
 @dataclass
 class VLLMModelConfig:
+    r"""
+    Docstring
+
+    **Atributes**:
+        - **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
+        - **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
+        - **batch_size** (int): The batch size for model training.
+        - **revision** (str): The revision of the model.
+        - **dtype** (str, None): The data type to use for the model.
+        - **tensor_parallel_size** (int): The number of tensor parallel units to use.
+        - **data_parallel_size** (int): The number of data parallel units to use.
+        - **max_model_length** (int): The maximum length of the model.
+        - **swap_space** (int): The CPU swap space size (GiB) per GPU.
+        - **seed** (int): The seed to use for the model.
+        - **trust_remote_code** (bool): Whether to trust remote code during model loading.
+        - **use_chat_template** (bool): Whether to use the chat template or not.
+        - **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
+        - **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.
+        - **subfolder** (Optional[str]): The subfolder within the model repository.
+    """
+
     pretrained: str
     gpu_memory_utilisation: float = 0.8
     batch_size: int = -1

From e533074c2e9bfd4f66b5741a8cb70fc8c7cf8038 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 11 Sep 2024 16:06:46 +0200
Subject: [PATCH 09/24] commit

---
 .../workflows/build_main_documentation.yml    | 19 ++++++++
 .github/workflows/build_pr_documentation.yml  | 17 +++++++
 .../workflows/delete_doc_comment_trigger.yml  |  0
 .github/workflows/upload_pr_documentation.yml | 16 +++++++
 docs/source/use_python_api.md                 | 46 ++++++++++++++++++-
 5 files changed, 97 insertions(+), 1 deletion(-)
 create mode 100644 .github/workflows/build_main_documentation.yml
 create mode 100644 .github/workflows/build_pr_documentation.yml
 create mode 100644 .github/workflows/delete_doc_comment_trigger.yml
 create mode 100644 .github/workflows/upload_pr_documentation.yml

diff --git a/.github/workflows/build_main_documentation.yml b/.github/workflows/build_main_documentation.yml
new file mode 100644
index 000000000..f6633a23c
--- /dev/null
+++ b/.github/workflows/build_main_documentation.yml
@@ -0,0 +1,19 @@
+name: Build documentation
+
+on:
+  push:
+    branches:
+      - main
+      - doc-builder*
+      - v*-release
+      - v*-alpha
+
+jobs:
+   build:
+    uses: huggingface/doc-builder/.github/workflows/build_main_documentation.yml@main
+    with:
+      commit_sha: ${{ github.sha }}
+      package: lighteval
+      custom_container: huggingface/transformers-doc-builder
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
diff --git a/.github/workflows/build_pr_documentation.yml b/.github/workflows/build_pr_documentation.yml
new file mode 100644
index 000000000..ee4298293
--- /dev/null
+++ b/.github/workflows/build_pr_documentation.yml
@@ -0,0 +1,17 @@
+name: Build PR Documentation
+
+on:
+  pull_request:
+
+concurrency:
+  group: ${{ github.workflow }}-${{ github.head_ref || github.run_id }}
+  cancel-in-progress: true
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/build_pr_documentation.yml@main
+    with:
+      commit_sha: ${{ github.event.pull_request.head.sha }}
+      pr_number: ${{ github.event.number }}
+      package: lighteval
+      custom_container: huggingface/transformers-doc-builder
diff --git a/.github/workflows/delete_doc_comment_trigger.yml b/.github/workflows/delete_doc_comment_trigger.yml
new file mode 100644
index 000000000..e69de29bb
diff --git a/.github/workflows/upload_pr_documentation.yml b/.github/workflows/upload_pr_documentation.yml
new file mode 100644
index 000000000..ab6f32d7a
--- /dev/null
+++ b/.github/workflows/upload_pr_documentation.yml
@@ -0,0 +1,16 @@
+name: Upload PR Documentation
+
+on:
+  workflow_run:
+    workflows: ["Build PR Documentation"]
+    types:
+      - completed
+
+jobs:
+  build:
+    uses: huggingface/doc-builder/.github/workflows/upload_pr_documentation.yml@main
+    with:
+      package_name: lighteval
+    secrets:
+      hf_token: ${{ secrets.HF_DOC_BUILD_PUSH }}
+      comment_bot_token: ${{ secrets.COMMENT_BOT_TOKEN }}
diff --git a/docs/source/use_python_api.md b/docs/source/use_python_api.md
index 60ab98e9d..06b4a0f57 100644
--- a/docs/source/use_python_api.md
+++ b/docs/source/use_python_api.md
@@ -1,7 +1,51 @@
 # Use the Python API
 
+Lighteval can be used from a custom python script. To evaluate a model you will
+need to setup an `evaluation_tracker`, `pipeline_parameters`, `model_config`
+and a `pipeline`.
+
+After that, simply run the pipeline and save the results.
 
-Hello World.
 
 ```python
+import lighteval
+from lighteval.logging.evaluation_tracker import EvaluationTracker
+from lighteval.models.model_config import VLLMModelConfig
+from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
+
+
+def main():
+    evaluation_tracker = EvaluationTracker(
+        output_dir="./results",
+        save_details=True,
+        push_to_hub=True,
+        hub_results_org="SaylorTwift",
+    )
+
+    pipeline_params = PipelineParameters(
+        launcher_type=ParallelismManager.ACCELERATE,
+    )
+
+    model_config = VLLMModelConfig(
+            pretrained="HuggingFaceH4/zephyr-7b-beta",
+            dtype="float16",
+            use_chat_template=True,
+    )
+
+    task = "helm|mmlu|5|1"
+
+    pipeline = Pipeline(
+        tasks=task,
+        pipeline_parameters=pipeline_params,
+        evaluation_tracker=evaluation_tracker,
+        model_config=model_config,
+        custom_task_directory=None, # if using a custom task
+    )
+
+    pipeline.evaluate()
+    pipeline.save_and_push_results()
+    pipeline.show_results()
+
+if __name__ == "__main__":
+    main()
 ```

From 2f1c7f595c31ea90b34a0d12991a0232d9f2e5ae Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:21:09 +0200
Subject: [PATCH 10/24] Update docs/source/installation.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/installation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/installation.md b/docs/source/installation.md
index d30d4d64c..ac7bdd1f3 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -31,7 +31,7 @@ appropriate extras group.
 | adapters     | To evaluate adapters models (delta and peft)                              |
 | tensorboardX | To upload your results to tensorboard                                     |
 | vllm         | To use vllm as backend for inference                                      |
-
+| s3         | To upload results to s3                                      |
 ## Hugging Face login
 
 If you want to push your results to the Hugging Face Hub or evaluate your own

From 0d1da5d8423ce12ab21afb02e9f57835387db888 Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:22:24 +0200
Subject: [PATCH 11/24] Update docs/source/saving_results.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/saving_results.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
index be553155e..2cf7d2dc1 100644
--- a/docs/source/saving_results.md
+++ b/docs/source/saving_results.md
@@ -14,10 +14,10 @@ argument. The details will be saved in a parquet file
 ## Pushing results to the HuggingFace hub
 
 You can push the results and evaluation details to the HuggingFace hub. To do
-so, you need to set the `--push_results_to_hub` as well as the `--results_org`
+so, you need to set the `--push_to_hub` as well as the `--results_org`
 argument. The results will be saved in a dataset with the name at
 `{results_org}/{model_org}/{model_name}`. To push the details, you need to set
-the `--push_details_to_hub` argument.
+the `--save_details` argument.
 The dataset created will be private by default, you can make it public by
 setting the `--public_run` argument.
 

From 7a8782acb3eeeacb38956b117f003b106fc9ab67 Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:23:40 +0200
Subject: [PATCH 12/24] Update docs/source/saving_results.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/saving_results.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
index 2cf7d2dc1..522fb85f0 100644
--- a/docs/source/saving_results.md
+++ b/docs/source/saving_results.md
@@ -1,15 +1,15 @@
 # Saving results
 
-## Saving results locally
+## Saving results elsewhere
 
 Lighteval will automatically save results and evaluation details in the directory
 set with the `--output_dir` argument. The results will be saved in
-`{output_dir}/results/{model_org}/{model_name}/results_{timestamp}.json`.
-[Here is an example of a result file](#example-of-a-result-file).
+`{output_dir}/results/{model_name}/results_{timestamp}.json`.
+[Here is an example of a result file](#example-of-a-result-file). The output path can be any [fsspec](https://filesystem-spec.readthedocs.io/en/latest/index.html) compliant path (local, s3, hf hub, gdrive, ftp, etc).
 
 To save the details of the evaluation, you can use the `--save_details`
 argument. The details will be saved in a parquet file
-`{output_dir}/details/{model_org}/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet`.
+`{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet`.
 
 ## Pushing results to the HuggingFace hub
 

From 1c7454b9ea0ee5b4129c593d843f666dd4c7a99f Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:23:54 +0200
Subject: [PATCH 13/24] Update docs/source/saving_results.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/saving_results.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
index 522fb85f0..a7925ad50 100644
--- a/docs/source/saving_results.md
+++ b/docs/source/saving_results.md
@@ -24,7 +24,7 @@ setting the `--public_run` argument.
 
 ## Pushing results to Tensorboard
 
-You can push the results to Tensorboard by setting the `--push_results_to_tensorboard`.
+You can push the results to Tensorboard by setting `--push_to_tensorboard`.
 
 
 ## How to load and investigate details

From 25390355989e0c306e5268f6afcca30d8900f4f9 Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:24:15 +0200
Subject: [PATCH 14/24] Update docs/source/saving_results.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/saving_results.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
index a7925ad50..2d3a44999 100644
--- a/docs/source/saving_results.md
+++ b/docs/source/saving_results.md
@@ -36,13 +36,11 @@ from datasets import load_dataset
 import os
 
 output_dir = "evals_doc"
-model = "HuggingFaceH4/zephyr-7b-beta"
-model_org = model.split("/")[0]
-model_name = model.split("/")[1]
+model_name = "HuggingFaceH4/zephyr-7b-beta"
 timestamp = "2024-09-03T15-06-11.234678"
 task = "lighteval|gsm8k|0"
 
-details_path = f"{output_dir}/details/{model_org}/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
+details_path = f"{output_dir}/details/{model_name}/{timestamp}/details_{task}_{timestamp}.parquet"
 
 # Load the details
 details = load_dataset("parquet", data_files=details_path, split="train")

From b5f29425903dfb7106efe3f8162d578a9c98796c Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:28:54 +0200
Subject: [PATCH 15/24] Update docs/source/saving_results.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/saving_results.md | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/docs/source/saving_results.md b/docs/source/saving_results.md
index 2d3a44999..df9b9f6bc 100644
--- a/docs/source/saving_results.md
+++ b/docs/source/saving_results.md
@@ -56,14 +56,13 @@ from datasets import load_dataset
 
 output_dir = "evals_doc"
 results_org = "SaylorTwift"
-model = "HuggingFaceH4/zephyr-7b-beta"
-model_org = model.split("/")[0]
-model_name = model.split("/")[1]
+model_name = "HuggingFaceH4/zephyr-7b-beta"
+sanitized_model_name = model_name.replace("/", "__")
 timestamp = "2024-09-03T15-06-11.234678"
 task = "lighteval|gsm8k|0"
 public_run = False
 
-dataset_path = f"{results_org}/details_{model_name}{'_private' if not public_run else ''}"
+dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
 details = load_dataset(dataset_path, task.replace("|", "_"), split="latest")
 
 for detail in details:

From 9825950fa1fbadd208ad0b32d4fc6c3a79b993fc Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:30:02 +0200
Subject: [PATCH 16/24] Update docs/source/adding_new_metric.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_metric.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
index 16281815c..f1b74f068 100644
--- a/docs/source/adding_new_metric.md
+++ b/docs/source/adding_new_metric.md
@@ -19,7 +19,7 @@ from aenum import extend_enum
 from lighteval.metrics import Metrics
 ```
 
-You need to define sample level metric:
+You need to define a sample level metric:
 
 ```python
 def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:

From fa67cf06a56f6705f7d11a05cf0a7aae6f1d4d4a Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:30:13 +0200
Subject: [PATCH 17/24] Update docs/source/adding_new_metric.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_metric.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
index f1b74f068..c6327e082 100644
--- a/docs/source/adding_new_metric.md
+++ b/docs/source/adding_new_metric.md
@@ -27,7 +27,7 @@ def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> bool:
     return response == formatted_doc.choices[formatted_doc.gold_index]
 ```
 
-Here the sample level metric only return one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
+Here the sample level metric only returns one metric, if you want to return multiple metrics per sample you need to return a dictionary with the metrics as keys and the values as values.
 
 ```python
 def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:

From f17ce922e844f4bcc9735708569fe9c857079db3 Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Tue, 17 Sep 2024 17:30:30 +0200
Subject: [PATCH 18/24] Update docs/source/adding_new_metric.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_metric.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
index c6327e082..c5a26416a 100644
--- a/docs/source/adding_new_metric.md
+++ b/docs/source/adding_new_metric.md
@@ -35,7 +35,7 @@ def custom_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
     return {"accuracy": response == formatted_doc.choices[formatted_doc.gold_index], "other_metric": 0.5}
 ```
 
-Then, you can define an aggreagtion function if needed, a comon aggregation function is `np.mean`.
+Then, you can define an aggregation function if needed, a common aggregation function is `np.mean`.
 
 ```python
 def agg_function(items):

From f3c319d030423ee8fe481efda228ef9fa301e3ad Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Wed, 18 Sep 2024 11:19:39 +0200
Subject: [PATCH 19/24] Update docs/source/adding_new_metric.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_metric.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/adding_new_metric.md b/docs/source/adding_new_metric.md
index c5a26416a..bc598f0b1 100644
--- a/docs/source/adding_new_metric.md
+++ b/docs/source/adding_new_metric.md
@@ -73,7 +73,7 @@ custom_metric = SampleLevelMetricGrouping(
 )
 ```
 
-And to end with the following, so that it adds your metric to our metrics list
+To finish, add the following, so that it adds your metric to our metrics list
 when loaded as a module.
 
 ```python

From bcd6f5006248545f403499bae87206c33c1e2f38 Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Wed, 18 Sep 2024 11:20:17 +0200
Subject: [PATCH 20/24] Update docs/source/adding_new_task.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_task.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
index 03f5eb7bf..7ee23ca3f 100644
--- a/docs/source/adding_new_task.md
+++ b/docs/source/adding_new_task.md
@@ -31,7 +31,7 @@ dataset to a document to be used for evaluation.
 # Define as many as you need for your different tasks
 def prompt_fn(line, task_name: str = None):
     """Defines how to go from a dataset line to a doc object.
-    Follow examples in src/lighteval/tasks/tasks_prompt_formatting.py, or get more info
+    Follow examples in src/lighteval/tasks/default_prompts.py, or get more info
     about what this function should do in the README.
     """
     return Doc(

From 33c1e7f006841ee4b71558edd74e6adc741bddbd Mon Sep 17 00:00:00 2001
From: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Date: Wed, 18 Sep 2024 11:20:45 +0200
Subject: [PATCH 21/24] Update docs/source/adding_new_task.md

Co-authored-by: Guilherme Penedo <nostrumg@gmail.com>
---
 docs/source/adding_new_task.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/docs/source/adding_new_task.md b/docs/source/adding_new_task.md
index 7ee23ca3f..4151df902 100644
--- a/docs/source/adding_new_task.md
+++ b/docs/source/adding_new_task.md
@@ -36,9 +36,9 @@ def prompt_fn(line, task_name: str = None):
     """
     return Doc(
         task_name=task_name,
-        query="",
-        choices="",
-        gold_index=0,
+        query=line["question"],
+        choices=[f" {c}" for c in line["choices"]],
+        gold_index=line["gold"],
         instruction="",
     )
 ```

From 016cea4144c960f3e34fea26a5bad4ba315553ac Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 18 Sep 2024 10:24:17 +0100
Subject: [PATCH 22/24] fix

---
 docs/source/index.md                 |  5 ++--
 src/lighteval/models/model_config.py | 39 ++++++++++++----------------
 2 files changed, 19 insertions(+), 25 deletions(-)

diff --git a/docs/source/index.md b/docs/source/index.md
index 55b374b36..13a663dc5 100644
--- a/docs/source/index.md
+++ b/docs/source/index.md
@@ -8,5 +8,6 @@ datatrove and LLM training library nanotron.
 
 We're releasing it with the community in the spirit of building in the open.
 
-Note that it is still very much early so don't expect 100% stability ^^' In
-case of problems or questions, feel free to open an issue!
+Even though it has been used in a variety of projects, keep in mind that parts
+of lighteval are still unstable and might break ! In case of any problem or
+question, feel free to open an issue.
diff --git a/src/lighteval/models/model_config.py b/src/lighteval/models/model_config.py
index ec0a9610d..0b64db8a4 100644
--- a/src/lighteval/models/model_config.py
+++ b/src/lighteval/models/model_config.py
@@ -50,49 +50,42 @@ class BaseModelConfig:
     """
     Base configuration class for models.
 
-    Attributes:
-        pretrained (str):
+    **Attributes**:
+        - **pretrained** (str):
             HuggingFace Hub model ID name or the path to a pre-trained
             model to load. This is effectively the `pretrained_model_name_or_path`
             argument of `from_pretrained` in the HuggingFace `transformers` API.
-        accelerator (Accelerator): accelerator to use for model training.
-        tokenizer (Optional[str]): HuggingFace Hub tokenizer ID that will be
+        - **accelerator** (Accelerator): accelerator to use for model training.
+        - **tokenizer** (Optional[str]): HuggingFace Hub tokenizer ID that will be
             used for tokenization.
-        multichoice_continuations_start_space (Optional[bool]): Whether to add a
+        - **multichoice_continuations_start_space** (Optional[bool]): Whether to add a
             space at the start of each continuation in multichoice generation.
             For example, context: "What is the capital of France?" and choices: "Paris", "London".
             Will be tokenized as: "What is the capital of France? Paris" and "What is the capital of France? London".
             True adds a space, False strips a space, None does nothing
-        subfolder (Optional[str]): The subfolder within the model repository.
-        revision (str): The revision of the model.
-        batch_size (int): The batch size for model training.
-        max_gen_toks (Optional[int]): The maximum number of tokens to generate.
-        max_length (Optional[int]): The maximum length of the generated output.
-        add_special_tokens (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
+        - **subfolder** (Optional[str]): The subfolder within the model repository.
+        - **revision** (str): The revision of the model.
+        - **batch_size** (int): The batch size for model training.
+        - **max_gen_toks** (Optional[int]): The maximum number of tokens to generate.
+        - **max_length** (Optional[int]): The maximum length of the generated output.
+        - **add_special_tokens** (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
            If `None`, the default value will be set to `True` for seq2seq models (e.g. T5) and
             `False` for causal models.
-        model_parallel (bool, optional, defaults to False):
+        - **model_parallel** (bool, optional, defaults to False):
             True/False: force to use or not the `accelerate` library to load a large
             model across multiple devices.
             Default: None which corresponds to comparing the number of processes with
                 the number of GPUs. If it's smaller => model-parallelism, else not.
-        dtype (Union[str, torch.dtype], optional, defaults to None):):
+        - **dtype** (Union[str, torch.dtype], optional, defaults to None):):
             Converts the model weights to `dtype`, if specified. Strings get
             converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
             Use `dtype="auto"` to derive the type from the model's weights.
-        device (Union[int, str]): device to use for model training.
-        quantization_config (Optional[BitsAndBytesConfig]): quantization
+        - **device** (Union[int, str]): device to use for model training.
+        - **quantization_config** (Optional[BitsAndBytesConfig]): quantization
             configuration for the model, manually provided to load a normally floating point
             model at a quantized precision. Needed for 4-bit and 8-bit precision.
-        trust_remote_code (bool): Whether to trust remote code during model
+        - **trust_remote_code** (bool): Whether to trust remote code during model
             loading.
-
-    Methods:
-        __post_init__(): Performs post-initialization checks on the configuration.
-        _init_configs(model_name, env_config): Initializes the model configuration.
-        init_configs(env_config): Initializes the model configuration using the environment configuration.
-        get_model_sha(): Retrieves the SHA of the model.
-
     """
 
     pretrained: str

From 3aba2a196fae4f76bbe8fd1e52546ddf7bbd9f1a Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 18 Sep 2024 10:25:02 +0100
Subject: [PATCH 23/24] fix

---
 docs/source/installation.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/source/installation.md b/docs/source/installation.md
index ac7bdd1f3..8ed510b99 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -31,7 +31,7 @@ appropriate extras group.
 | adapters     | To evaluate adapters models (delta and peft)                              |
 | tensorboardX | To upload your results to tensorboard                                     |
 | vllm         | To use vllm as backend for inference                                      |
-| s3         | To upload results to s3                                      |
+| s3           | To upload results to s3                                                   |
 ## Hugging Face login
 
 If you want to push your results to the Hugging Face Hub or evaluate your own

From af1ad1302cee6deb08e468973c9bd99f83db8160 Mon Sep 17 00:00:00 2001
From: Nathan Habib <nathan.habib19@gmail.com>
Date: Wed, 18 Sep 2024 10:37:40 +0100
Subject: [PATCH 24/24] commit

---
 docs/source/installation.md |   2 +
 docs/source/metric_list.md  | 142 ++++++++++++++++++------------------
 docs/source/quicktour.md    |  63 +++++++++++++++-
 3 files changed, 133 insertions(+), 74 deletions(-)

diff --git a/docs/source/installation.md b/docs/source/installation.md
index 8ed510b99..183c06e58 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -32,6 +32,8 @@ appropriate extras group.
 | tensorboardX | To upload your results to tensorboard                                     |
 | vllm         | To use vllm as backend for inference                                      |
 | s3           | To upload results to s3                                                   |
+
+
 ## Hugging Face login
 
 If you want to push your results to the Hugging Face Hub or evaluate your own
diff --git a/docs/source/metric_list.md b/docs/source/metric_list.md
index e961f1a2c..1c5ae5984 100644
--- a/docs/source/metric_list.md
+++ b/docs/source/metric_list.md
@@ -1,78 +1,74 @@
 # Metrics
 
-- MetricCategory.TARGET_PERPLEXITY
-	- acc_golds_likelihood
-	- target_perplexity
+## Metrics for multiple choice tasks
+These metrics use log-likelihood of the different possible targets.
+- `loglikelihood_acc` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_single_token`)
+- `loglikelihood_acc_norm` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_acc_norm_single_token`)
+- `loglikelihood_acc_norm_nospace` (Harness): Fraction of instances where the choice with the best logprob, normalized by sequence length, was correct, with the first space ignored
+- `loglikelihood_f1` (Harness): Corpus level F1 score of the multichoice selection - also exists in a faster version for tasks where the possible choices include only one token (`loglikelihood_f1_single_token`)
+- `mcc` (Harness): Matthew's correlation coefficient (a measure of agreement between statistical distributions),
+- `recall_at_1` (Harness): Fraction of instances where the choice with the best logprob was correct - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_1_single_token`)
+- `recall_at_2` (Harness): Fraction of instances where the choice with the 2nd best logprob or better was correct  - also exists in a faster version for tasks where the possible choices include only one token per choice (`recall_at_2_single_token`)
+- `mrr` (Harness): Mean reciprocal rank, a measure of the quality of a ranking of choices ordered by correctness/relevance  - also exists in a faster version for tasks where the possible choices include only one token (`mrr_single_token`)
+- `target_perplexity` (Harness): Perplexity of the different choices available.
+- `acc_golds_likelihood`: (Harness): A bit different, it actually checks if the average logprob of a single target is above or below 0.5
+- `multi_f1_numeric`: Loglikelihood F1 score for multiple gold targets
 
-- MetricCategory.MULTICHOICE_ONE_TOKEN
-	- loglikelihood_acc_norm_single_token
-	- loglikelihood_acc_single_token
-	- loglikelihood_f1_single_token
-	- mcc_single_token
-	- mrr_single_token
-	- multi_f1_numeric
-	- recall_at_1_single_token
-	- recall_at_2_single_token
+All these metrics also exist in a "single token" version (`loglikelihood_acc_single_token`, `loglikelihood_acc_norm_single_token`, `loglikelihood_f1_single_token`, `mcc_single_token`, `recall@2_single_token` and `mrr_single_token`). When the multichoice option compares only one token (ex: "A" vs "B" vs "C" vs "D", or "yes" vs "no"), using these metrics in the single token version will divide the time spent by the number of choices. Single token evals also include:
+- `multi_f1_numeric` (Harness, for CB): computes the f1 score of all possible choices and averages it.
 
-- MetricCategory.IGNORED
-	- prediction_perplexity
+## Metrics for perplexity and language modeling
+These metrics use log-likelihood of prompt.
+- `word_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of words of the sequence.
+- `byte_perplexity` (Harness): Perplexity (log probability of the input) weighted by the number of bytes of the sequence.
+- `bits_per_byte` (HELM): Average number of bits per byte according to model probabilities.
+- `log_prob` (HELM): Predicted output's average log probability (input's log prob for language modeling).
 
-- MetricCategory.PERPLEXITY
-	- bits_per_byte
-	- byte_perplexity
-	- word_perplexity
-
-- MetricCategory.GENERATIVE
-	- bert_score
-	- bleu
-	- bleu_1
-	- bleu_4
-	- bleurt
-	- chrf
-	- copyright
-	- drop
-	- exact_match
-	- extractiveness
-	- f1_score_quasi
-	- f1_score
-	- f1_score_macro
-	- f1_score_micro
-	- faithfulness
-	- perfect_exact_match
-	- prefix_exact_match
-	- prefix_quasi_exact_match
-	- quasi_exact_match
-	- quasi_exact_match_math
-	- quasi_exact_match_triviaqa
-	- quasi_exact_match_gsm8k
-	- rouge_t5
-	- rouge1
-	- rouge2
-	- rougeL
-	- rougeLsum
-	- ter
-
-- MetricCategory.GENERATIVE_SAMPLING
-	- maj_at_4_math
-	- maj_at_5
-	- maj_at_8
-	- maj_at_8_gsm8k
-
-- MetricCategory.LLM_AS_JUDGE_MULTI_TURN
-	- llm_judge_multi_turn_gpt3p5
-	- llm_judge_multi_turn_llama_3_405b
-
-- MetricCategory.LLM_AS_JUDGE
-	- llm_judge_gpt3p5
-	- llm_judge_llama_3_405b
-
-- MetricCategory.MULTICHOICE
-	- loglikelihood_acc
-	- loglikelihood_acc_norm
-	- loglikelihood_acc_norm_nospace
-	- loglikelihood_f1
-	- mcc
-	- mrr
-	- recall_at_1
-	- recall_at_2
-	- truthfulqa_mc_metrics
+## Metrics for generative tasks
+These metrics need the model to generate an output. They are therefore slower.
+- Base:
+    - `perfect_exact_match` (Harness): Fraction of instances where the prediction matches the gold exactly.
+    - `exact_match` (HELM): Fraction of instances where the prediction matches the gold with the exception of the border whitespaces (= after a `strip` has been applied to both).
+    - `quasi_exact_match` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...). Other variations exist, with other normalizers, such as `quasi_exact_match_triviaqa`, which only normalizes the predictions after applying a strip to all sentences.
+    - `prefix_exact_match` (HELM): Fraction of instances where the beginning of the prediction matches the gold at the exception of the border whitespaces (= after a `strip` has been applied to both).
+    - `prefix_quasi_exact_match` (HELM): Fraction of instances where the normalized beginning of the prediction matches the normalized gold (normalization done on whitespace, articles, capitalization, ...)
+    - `exact_match_indicator`: Exact match with some preceding context (before an indicator) removed
+    - `f1_score_quasi` (HELM): Average F1 score in terms of word overlap between the model output and gold, with both being normalized first
+    - `f1_score`:  Average F1 score in terms of word overlap between the model output and gold without normalisation
+    - `f1_score_macro`: Corpus level macro F1 score
+    - `f1_score_macro`: Corpus level micro F1 score
+    - `maj_at_5` and `maj_at_8`: Model majority vote. Takes n (5 or 8) generations from the model and assumes the most frequent is the actual prediction.
+- Summarization:
+    - `rouge` (Harness): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/)
+    - `rouge1` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 1-gram overlap.
+    - `rouge2` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on 2-gram overlap.
+    - `rougeL` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
+    - `rougeLsum` (HELM): Average ROUGE score [(Lin, 2004)](https://aclanthology.org/W04-1013/) based on longest common subsequence overlap.
+    - `rouge_t5` (BigBench): Corpus level ROUGE score for all available ROUGE metrics
+    - `faithfulness` (HELM): Faithfulness scores based on the SummaC method of [Laban et al. (2022)](https://aclanthology.org/2022.tacl-1.10/).
+    - `extractiveness` (HELM): Reports, based on [(Grusky et al., 2018)](https://aclanthology.org/N18-1065/)
+        - `summarization_coverage`: Extent to which the model-generated summaries are extractive fragments from the source document,
+        - `summarization_density`: Extent to which the model-generated summaries are extractive summaries based on the source document,
+        - `summarization_compression`: Extent to which the model-generated summaries are compressed relative to the source document.
+    - `bert_score` (HELM): Reports the average BERTScore precision, recall, and f1 score [(Zhang et al., 2020)](https://openreview.net/pdf?id=SkeHuCVFDr) between model generation and gold summary.
+    - Translation
+    - `bleu`: Corpus level BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) - uses the sacrebleu implementation.
+    - `bleu_1` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 1-gram overlap - uses the nltk implementation.
+    - `bleu_4` (HELM): Average sample BLEU score [(Papineni et al., 2002)](https://aclanthology.org/P02-1040/) based on 4-gram overlap - uses the nltk implementation.
+    - `chrf` (Harness): Character n-gram matches f-score.
+    - `ter` (Harness): Translation edit/error rate.
+- Copyright
+    - `copyright` (HELM): Reports:
+        - `longest_common_prefix_length`: average length of longest common prefix between model generation and reference,
+        - `edit_distance`: average Levenshtein edit distance between model generation and reference,
+        - `edit_similarity`: average Levenshtein edit similarity (normalized by length of longer sequence) between model generation and reference.
+- Math:
+    - `quasi_exact_match_math` (HELM): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for math, where latex symbols, units, etc are removed)
+    - `maj_at_4_math` (Lighteval): Majority choice evaluation, using the math normalisation for the predictions and gold
+    - `quasi_exact_match_gsm8k` (Harness): Fraction of instances where the normalized prediction matches the normalized gold (normalization done for gsm8k, where latex symbols, units, etc are removed)
+    - `maj_at_8_gsm8k` (Lighteval): Majority choice evaluation, using the gsm8k normalisation for the predictions and gold
+- LLM-as-Judge:
+    - `llm_judge_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API
+    - `llm_judge_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API
+    - `llm_judge_multi_turn_gpt3p5`: Can be used for any generative task, the model will be scored by a GPT3.5 model using the openai API. It is used for multiturn tasks like mt-bench.
+    - `llm_judge_multi_turn_llama_3_405b`: Can be used for any generative task, the model will be scored by a Llama 3.405B model using the openai API. It is used for multiturn tasks like mt-bench.
diff --git a/docs/source/quicktour.md b/docs/source/quicktour.md
index 54c988263..5d9d49c30 100644
--- a/docs/source/quicktour.md
+++ b/docs/source/quicktour.md
@@ -60,6 +60,67 @@ accelerate launch --multi_gpu --num_processes=8 -m \
 Here, `--override_batch_size` defines the batch size per device, so the effective
 batch size will be `override_batch_size * num_gpus`.
 
+### Model Arguments
+
+The `--model_args` argument takes a string representing a list of model
+argument. The arguments allowed vary depending on the backend you use (vllm or
+accelerate).
+
+#### Accelerate
+
+- **pretrained** (str):
+    HuggingFace Hub model ID name or the path to a pre-trained
+    model to load. This is effectively the `pretrained_model_name_or_path`
+    argument of `from_pretrained` in the HuggingFace `transformers` API.
+- **tokenizer** (Optional[str]): HuggingFace Hub tokenizer ID that will be
+    used for tokenization.
+- **multichoice_continuations_start_space** (Optional[bool]): Whether to add a
+    space at the start of each continuation in multichoice generation.
+    For example, context: "What is the capital of France?" and choices: "Paris", "London".
+    Will be tokenized as: "What is the capital of France? Paris" and "What is the capital of France? London".
+    True adds a space, False strips a space, None does nothing
+- **subfolder** (Optional[str]): The subfolder within the model repository.
+- **revision** (str): The revision of the model.
+- **max_gen_toks** (Optional[int]): The maximum number of tokens to generate.
+- **max_length** (Optional[int]): The maximum length of the generated output.
+- **add_special_tokens** (bool, optional, defaults to True): Whether to add special tokens to the input sequences.
+   If `None`, the default value will be set to `True` for seq2seq models (e.g. T5) and
+    `False` for causal models.
+- **model_parallel** (bool, optional, defaults to False):
+    True/False: force to use or not the `accelerate` library to load a large
+    model across multiple devices.
+    Default: None which corresponds to comparing the number of processes with
+        the number of GPUs. If it's smaller => model-parallelism, else not.
+- **dtype** (Union[str, torch.dtype], optional, defaults to None):):
+    Converts the model weights to `dtype`, if specified. Strings get
+    converted to `torch.dtype` objects (e.g. `float16` -> `torch.float16`).
+    Use `dtype="auto"` to derive the type from the model's weights.
+- **device** (Union[int, str]): device to use for model training.
+- **quantization_config** (Optional[BitsAndBytesConfig]): quantization
+    configuration for the model, manually provided to load a normally floating point
+    model at a quantized precision. Needed for 4-bit and 8-bit precision.
+- **trust_remote_code** (bool): Whether to trust remote code during model
+    loading.
+
+#### VLLM
+
+- **pretrained** (str): HuggingFace Hub model ID name or the path to a pre-trained model to load.
+- **gpu_memory_utilisation** (float): The fraction of GPU memory to use.
+- **batch_size** (int): The batch size for model training.
+- **revision** (str): The revision of the model.
+- **dtype** (str, None): The data type to use for the model.
+- **tensor_parallel_size** (int): The number of tensor parallel units to use.
+- **data_parallel_size** (int): The number of data parallel units to use.
+- **max_model_length** (int): The maximum length of the model.
+- **swap_space** (int): The CPU swap space size (GiB) per GPU.
+- **seed** (int): The seed to use for the model.
+- **trust_remote_code** (bool): Whether to trust remote code during model loading.
+- **use_chat_template** (bool): Whether to use the chat template or not.
+- **add_special_tokens** (bool): Whether to add special tokens to the input sequences.
+- **multichoice_continuations_start_space** (bool): Whether to add a space at the start of each continuation in multichoice generation.
+- **subfolder** (Optional[str]): The subfolder within the model repository.
+
+
 #### Pipeline parallelism
 
 To evaluate a model using pipeline parallelism on 2 or more GPUs, run:
@@ -96,6 +157,6 @@ Nanotron models cannot be evaluated without torchrun.
  ```
 
 The `nproc-per-node` argument should match the data, tensor and pipeline
-parallelism confidured in the `lighteval_config_override_template.yaml` file.
+parallelism confidured in the `lighteval_config_template.yaml` file.
 That is: `nproc-per-node = data_parallelism * tensor_parallelism *
 pipeline_parallelism`.