fix: evaluation docs fixes (#229)

langchain-ai · May 7, 2024 · db880ec · db880ec
2 parents 7aaa1fe + 249229f
commit db880ec
Show file tree

Hide file tree

Showing 7 changed files with 85 additions and 63 deletions.
diff --git a/versioned_docs/version-2.0/how_to_guides/evaluation/bind_evaluator_to_dataset.mdx b/versioned_docs/version-2.0/how_to_guides/evaluation/bind_evaluator_to_dataset.mdx
@@ -0,0 +1,48 @@
+---
+sidebar_position: 2
+---
+
+# Bind an evaluator to a dataset in the UI
+
+While you can specify evaluators to grade the results of your experiments programmatically (see [this guide](./evaluate_llm_application) for more information), you can also bind evaluators to a dataset in the UI.
+This allows you to configure automatic evaluators that grade your experiment results without having to write any code. Currently, only LLM-based evaluators are supported.
+
+The process for configuring this is very similar to the process for configuring an [online evaluator](../monitoring/online_evaluations) for traces.
+
+:::note Only affects subsequent experiment runs
+
+When you configure an evaluator for a dataset, it will only affect the experiment runs that are created after the evaluator is configured. It will not affect the evaluation of experiment runs that were created before the evaluator was configured.
+
+:::
+
+1. **Navigate to the dataset details page** by clicking **Datasets and Testing** in the sidebar and selecting the dataset you want to configure the evaluator for.
+2. **Click on the `Add Evaluator` button** to add an evaluator to the dataset. This will open a modal you can use to configure the evaluator.
+
+![Add Evaluator](../static/add_evaluator.png)
+
+3. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.
+
+![Add evaluator name and prompt](../static/create_evaluator.png)
+
+Importantly, evaluator prompts can only contain the following input variables:
+
+- `input` (required): the input to the target you are evaluating
+- `output` (required): the output of the target you are evaluating
+- `reference`: the reference output, taken from the dataset
+
+:::note
+
+Automatic evaluators you configure in the application will only work if the `inputs` to your evaluation target, `outputs` from your evaluation target, and `examples` in your dataset are all single-key dictionaries.
+LangSmith will automatically extract the values from the dictionaries and pass them to the evaluator.
+
+LangSmith currently doesn't support setting up evaluators in the application that act on multiple keys in the `inputs` or `outputs` or `examples` dictionaries.
+
+:::
+
+You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.
+
+![Evaluator prompt](../static/evaluator_prompt.png)
+
+4. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.
+
+![Playground evaluator results](../static/playground_evaluator_results.png)
diff --git a/versioned_docs/version-2.0/how_to_guides/evaluation/evaluate_llm_application.mdx b/versioned_docs/version-2.0/how_to_guides/evaluation/evaluate_llm_application.mdx
@@ -6,6 +6,7 @@ import {
  CodeTabs,
  python,
  typescript,
+ PythonBlock,
 } from "@site/src/components/InstructionsWithCode";
 
 # Evaluate an LLM Application
@@ -37,6 +38,7 @@ The following example involves evaluating a very simple LLM pipeline as classifi
 
 In this case, we are defining a simple evaluation target consisting of an LLM pipeline that classifies text as toxic or non-toxic.
 We've optionally enabled tracing to capture the inputs and outputs of each step in the pipeline.
+
 To understand how to annotate your code for tracing, please refer to [this guide](../tracing/annotate_code).
 
 <CodeTabs
@@ -168,14 +170,10 @@ Writing evaluators is discussed in more detail in the [following section](#custo
 <CodeTabs
  groupId="client-language"
  tabs={[
- python`
- from langsmith.schemas import Example, Run
- 
- # Row-level evaluator
- def correct_label(root_run: Run, example: Example) -> dict:
- score = root_run.outputs.get("output") == example.outputs.get("label")
- return {"score": int(score)}
- `,
+ PythonBlock(`from langsmith.schemas import Example, Run\n
+def correct_label(root_run: Run, example: Example) -> dict:
+ score = root_run.outputs.get("output") == example.outputs.get("label")
+ return {"score": int(score), "key": "correct_label"}`),
  typescript`
  import type { EvaluationResult } from "langsmith/evaluation";
  import type { Run, Example } from "langsmith/schemas";
@@ -213,7 +211,7 @@ At its simplest, the `evaluate` method takes the following arguments:
  data=dataset_name,
  evaluators=[correct_label],
  experiment_prefix="Toxic Queries",
- descriptionn="Testing the baseline system.",
+ description="Testing the baseline system.", # optional
  )
  `,
  typescript`
@@ -233,15 +231,22 @@ At its simplest, the `evaluate` method takes the following arguments:
 Each invocation of `evaluate` produces an experiment which is bound to the dataset, and can be viewed in the LangSmith UI.
 Evaluation scores are stored against each individual output produced by the target task as feedback, with the name and score configured in the evaluator.
 
-![](../static/view_experiment.gif)
+_If you've annotated your code for tracing, you can open the trace of each row in a side panel view._
 
-With tracing enabled, you can open the trace of each row in a side panel view.
+![](../static/view_experiment.gif)
 
 ## Use custom evaluators
 
-Evaluators are functions that take in a `Run` and an `Example` and return a dictionary or object with a keys `score` (numeric) and `key` (string).
+At a high-level, evaluators are functions that take in a `Run` and an `Example` and return a dictionary or object with a keys `score` (numeric) and `key` (string).
 The `key` will be associated with the score in the LangSmith UI.
 
+:::tip advanced use-cases
+
+- Configure more feedback fields: you can configure other fields in the dictionary as well. Please see the [feedback reference](../../reference/data_formats/feedback_data_format) for more information.
+- Evaluate on intermediate steps: to view a more advanced example that traverses the `root_run` / `rootRun` object, please refer to [this guide](./evaluate_on_intermediate_steps) on evaluating on intermediate steps.
+
+:::
+
 To learn more about the `Run` format, you can read the following [reference](../../reference/data_formats/run_data_format). However, many of the fields are not relevant nor required for writing evaluators.
 The `root_run` / `rootRun` is always available and contains the inputs and outputs of the target task. If tracing is enabled, the `root_run` / `rootRun` will also contain child runs for each step in the pipeline.
 
@@ -250,14 +255,10 @@ Here is an example of a very simple custom evaluator that compares the output of
 <CodeTabs
  groupId="client-language"
  tabs={[
- python`
- from langsmith.schemas import Example, Run
- 
- # Row-level evaluator
- def correct_label(root_run: Run, example: Example) -> dict:
- score = root_run.outputs.get("output") == example.outputs.get("label")
- return {"score": int(score)}
- `,
+ PythonBlock(`from langsmith.schemas import Example, Run\n
+def correct_label(root_run: Run, example: Example) -> dict:
+ score = root_run.outputs.get("output") == example.outputs.get("label")
+ return {"score": int(score), "key": "correct_label"}`),
  typescript`
  import type { EvaluationResult } from "langsmith/evaluation";
  import type { Run, Example } from "langsmith/schemas";
@@ -271,9 +272,9 @@ Here is an example of a very simple custom evaluator that compares the output of
  ]}
 />
 
-:::tip Advanced Example
+:::note default feedback key
 
-To view a more advanced example that traverses the `root_run` / `rootRun` object, please refer to [this guide](./evaluate_on_intermediate_steps) on evaluating on intermediate steps.
+If the "key" field is not provided, the default key name will be the name of the evaluator function.
 
 :::
 
@@ -430,7 +431,7 @@ In the LangSmith UI, you'll the summary evaluator's score displayed with the cor
 
 ## Evaluate a LangChain runnable
 
-You can configure a `LangChain` runnable to be evaluated by passing `runnable.invoke` it to the `evaluate` method.
+You can configure a `LangChain` runnable to be evaluated by passing `runnable.invoke` it to the `evaluate` method in Python, or just the `runnable` in TypeScript.
 
 First, define your `LangChain` runnable:
 

diff --git a/.../version-2.0/how_to_guides/evaluation/run_evaluation_from_prompt_playground.mdx b/.../version-2.0/how_to_guides/evaluation/run_evaluation_from_prompt_playground.mdx
@@ -4,7 +4,7 @@ sidebar_position: 2
 
 # Run an evaluation from the prompt playground
 
-While you can kick off experiments easily using the sdk, as outlined [here](../../#5-create-your-first-evaluation), it's often useful to run experiments directly in the [prompt playground].
+While you can kick off experiments easily using the sdk, as outlined [here](../../#5-create-your-first-evaluation), it's often useful to run experiments directly in the prompt playground.
 
 This allows you to test your prompt / model configuration over a series of inputs to see how well it generalizes across different contexts or scenarios, without having to write any code.
 
@@ -21,41 +21,6 @@ This allows you to test your prompt / model configuration over a series of input
 
 ## Add evaluation scores to the experiment
 
-Kicking off an experiment is no fun without actually running evaluations on the results. You can add evaluation scores to the experiment by configuring an automation rule for the dataset, again without writing any code. This will allow you to add evaluation scores to the experiment and compare the results across different experiments.
-It's also possible to add human annotations to the runs of any experiment.
+You can add evaluation scores to experiments by [binding an evaluator to the dataset](./bind_evaluator_to_dataset), again without writing any code.
 
-We currently support configuring LLM-as-a-judge evaluators on datasets that will evaluate the results of each run in each experiment kicked off from that dataset.
-
-The process for configuring this is very similar to the process for configuring an [online evaluator] for your tracing projects.
-
-1. **Navigate to the dataset details page** by clicking "Datasets and Testing" in the sidebar and selecting the dataset you want to configure the evaluator for.
-2. **Click on the "Add Evaluator" button** to add an evaluator to the dataset. This will open a modal you can use to configure the evaluator.
-
-![Add Evaluator](../static/add_evaluator.png)
-
-3. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.
-
-![Add evaluator name and prompt](../static/create_evaluator.png)
-
-Importantly, evaluator prompts can only contain the following input variables:
-
-- `input` (required): the input to the target you are evaluating
-- `output` (required): the output of the target you are evaluating
-- `reference`: the reference output, taken from the dataset
-
-:::note
-
-Automatic evaluators you configure in the application will only work if the `inputs` to your evaluation target, `outputs` from your evaluation target, and `examples` in your dataset are all single-key dictionaries.
-LangSmith will automatically extract the values from the dictionaries and pass them to the evaluator.
-
-LangSmith currently doesn't support setting up evaluators in the application that act on multiple keys in the `inputs` or `outputs` or `examples` dictionaries.
-
-:::
-
-You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.
-
-![Evaluator prompt](../static/evaluator_prompt.png)
-
-4. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.
-
-![Playground evaluator results](../static/playground_evaluator_results.png)
+You can also programmatically [evaluate an existing experiment](./evaluate_existing_experiment) using the SDK.
diff --git a/versioned_docs/version-2.0/how_to_guides/index.md b/versioned_docs/version-2.0/how_to_guides/index.md
@@ -89,6 +89,7 @@ Evaluate your LLM applications to measure their performance over time.
  - [Evaluate on a subset of a dataset](./how_to_guides/evaluation/evaluate_llm_application#evaluate-on-a-subset-of-a-dataset)
  - [Use a summary evaluator](./how_to_guides/evaluation/evaluate_llm_application#use-a-summary-evaluator)
  - [Evaluate a LangChain runnable](./how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)
+- [Bind and evaluator to a dataset in the UI](./how_to_guides/evaluation/bind_evaluator_to_dataset)
 - [Run an evaluation from the prompt playground](./how_to_guides/evaluation/run_evaluation_from_prompt_playground)
 - [Evaluate on intermediate steps](./how_to_guides/evaluation/evaluate_on_intermediate_steps)
 - [Use LangChain off-the-shelf evaluators (Python only)](./how_to_guides/evaluation/use_langchain_off_the_shelf_evaluators)

diff --git a/versioned_docs/version-2.0/tutorials/evaluation.mdx b/versioned_docs/version-2.0/tutorials/evaluation.mdx
@@ -1,3 +1,7 @@
+---
+sidebar_position: 2
+---
+
 # Evaluate your LLM application
 
 It can be hard to measure the performance of your application with respect to criteria important you or your users.

diff --git a/versioned_docs/version-2.0/tutorials/observability.mdx b/versioned_docs/version-2.0/tutorials/observability.mdx
@@ -1,3 +1,7 @@
+---
+sidebar_position: 1
+---
+
 import {
  CodeTabs,
  python,

diff --git a/versioned_docs/version-2.0/tutorials/optimize_classifier.mdx b/versioned_docs/version-2.0/tutorials/optimize_classifier.mdx
@@ -1,7 +1,6 @@
 ---
 sidebar_label: Optimize a classifier
-sidebar_position: 2
-table_of_contents: true
+sidebar_position: 3
 ---
 
 # Optimize a classifier