Skip to content

Commit

Permalink
fix: evaluation docs fixes (#229)
Browse files Browse the repository at this point in the history
  • Loading branch information
agola11 committed May 7, 2024
2 parents 7aaa1fe + 249229f commit db880ec
Show file tree
Hide file tree
Showing 7 changed files with 85 additions and 63 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
sidebar_position: 2
---

# Bind an evaluator to a dataset in the UI

While you can specify evaluators to grade the results of your experiments programmatically (see [this guide](./evaluate_llm_application) for more information), you can also bind evaluators to a dataset in the UI.
This allows you to configure automatic evaluators that grade your experiment results without having to write any code. Currently, only LLM-based evaluators are supported.

The process for configuring this is very similar to the process for configuring an [online evaluator](../monitoring/online_evaluations) for traces.

:::note Only affects subsequent experiment runs

When you configure an evaluator for a dataset, it will only affect the experiment runs that are created after the evaluator is configured. It will not affect the evaluation of experiment runs that were created before the evaluator was configured.

:::

1. **Navigate to the dataset details page** by clicking **Datasets and Testing** in the sidebar and selecting the dataset you want to configure the evaluator for.
2. **Click on the `Add Evaluator` button** to add an evaluator to the dataset. This will open a modal you can use to configure the evaluator.

![Add Evaluator](../static/add_evaluator.png)

3. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.

![Add evaluator name and prompt](../static/create_evaluator.png)

Importantly, evaluator prompts can only contain the following input variables:

- `input` (required): the input to the target you are evaluating
- `output` (required): the output of the target you are evaluating
- `reference`: the reference output, taken from the dataset

:::note

Automatic evaluators you configure in the application will only work if the `inputs` to your evaluation target, `outputs` from your evaluation target, and `examples` in your dataset are all single-key dictionaries.
LangSmith will automatically extract the values from the dictionaries and pass them to the evaluator.

LangSmith currently doesn't support setting up evaluators in the application that act on multiple keys in the `inputs` or `outputs` or `examples` dictionaries.

:::

You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.

![Evaluator prompt](../static/evaluator_prompt.png)

4. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.

![Playground evaluator results](../static/playground_evaluator_results.png)
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ import {
CodeTabs,
python,
typescript,
PythonBlock,
} from "@site/src/components/InstructionsWithCode";

# Evaluate an LLM Application
Expand Down Expand Up @@ -37,6 +38,7 @@ The following example involves evaluating a very simple LLM pipeline as classifi

In this case, we are defining a simple evaluation target consisting of an LLM pipeline that classifies text as toxic or non-toxic.
We've optionally enabled tracing to capture the inputs and outputs of each step in the pipeline.

To understand how to annotate your code for tracing, please refer to [this guide](../tracing/annotate_code).

<CodeTabs
Expand Down Expand Up @@ -168,14 +170,10 @@ Writing evaluators is discussed in more detail in the [following section](#custo
<CodeTabs
groupId="client-language"
tabs={[
python`
from langsmith.schemas import Example, Run
# Row-level evaluator
def correct_label(root_run: Run, example: Example) -> dict:
score = root_run.outputs.get("output") == example.outputs.get("label")
return {"score": int(score)}
`,
PythonBlock(`from langsmith.schemas import Example, Run\n
def correct_label(root_run: Run, example: Example) -> dict:
score = root_run.outputs.get("output") == example.outputs.get("label")
return {"score": int(score), "key": "correct_label"}`),
typescript`
import type { EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";
Expand Down Expand Up @@ -213,7 +211,7 @@ At its simplest, the `evaluate` method takes the following arguments:
data=dataset_name,
evaluators=[correct_label],
experiment_prefix="Toxic Queries",
descriptionn="Testing the baseline system.",
description="Testing the baseline system.", # optional
)
`,
typescript`
Expand All @@ -233,15 +231,22 @@ At its simplest, the `evaluate` method takes the following arguments:
Each invocation of `evaluate` produces an experiment which is bound to the dataset, and can be viewed in the LangSmith UI.
Evaluation scores are stored against each individual output produced by the target task as feedback, with the name and score configured in the evaluator.

![](../static/view_experiment.gif)
_If you've annotated your code for tracing, you can open the trace of each row in a side panel view._

With tracing enabled, you can open the trace of each row in a side panel view.
![](../static/view_experiment.gif)

## Use custom evaluators

Evaluators are functions that take in a `Run` and an `Example` and return a dictionary or object with a keys `score` (numeric) and `key` (string).
At a high-level, evaluators are functions that take in a `Run` and an `Example` and return a dictionary or object with a keys `score` (numeric) and `key` (string).
The `key` will be associated with the score in the LangSmith UI.

:::tip advanced use-cases

- Configure more feedback fields: you can configure other fields in the dictionary as well. Please see the [feedback reference](../../reference/data_formats/feedback_data_format) for more information.
- Evaluate on intermediate steps: to view a more advanced example that traverses the `root_run` / `rootRun` object, please refer to [this guide](./evaluate_on_intermediate_steps) on evaluating on intermediate steps.

:::

To learn more about the `Run` format, you can read the following [reference](../../reference/data_formats/run_data_format). However, many of the fields are not relevant nor required for writing evaluators.
The `root_run` / `rootRun` is always available and contains the inputs and outputs of the target task. If tracing is enabled, the `root_run` / `rootRun` will also contain child runs for each step in the pipeline.

Expand All @@ -250,14 +255,10 @@ Here is an example of a very simple custom evaluator that compares the output of
<CodeTabs
groupId="client-language"
tabs={[
python`
from langsmith.schemas import Example, Run
# Row-level evaluator
def correct_label(root_run: Run, example: Example) -> dict:
score = root_run.outputs.get("output") == example.outputs.get("label")
return {"score": int(score)}
`,
PythonBlock(`from langsmith.schemas import Example, Run\n
def correct_label(root_run: Run, example: Example) -> dict:
score = root_run.outputs.get("output") == example.outputs.get("label")
return {"score": int(score), "key": "correct_label"}`),
typescript`
import type { EvaluationResult } from "langsmith/evaluation";
import type { Run, Example } from "langsmith/schemas";
Expand All @@ -271,9 +272,9 @@ Here is an example of a very simple custom evaluator that compares the output of
]}
/>

:::tip Advanced Example
:::note default feedback key

To view a more advanced example that traverses the `root_run` / `rootRun` object, please refer to [this guide](./evaluate_on_intermediate_steps) on evaluating on intermediate steps.
If the "key" field is not provided, the default key name will be the name of the evaluator function.

:::

Expand Down Expand Up @@ -430,7 +431,7 @@ In the LangSmith UI, you'll the summary evaluator's score displayed with the cor

## Evaluate a LangChain runnable

You can configure a `LangChain` runnable to be evaluated by passing `runnable.invoke` it to the `evaluate` method.
You can configure a `LangChain` runnable to be evaluated by passing `runnable.invoke` it to the `evaluate` method in Python, or just the `runnable` in TypeScript.

First, define your `LangChain` runnable:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ sidebar_position: 2

# Run an evaluation from the prompt playground

While you can kick off experiments easily using the sdk, as outlined [here](../../#5-create-your-first-evaluation), it's often useful to run experiments directly in the [prompt playground].
While you can kick off experiments easily using the sdk, as outlined [here](../../#5-create-your-first-evaluation), it's often useful to run experiments directly in the prompt playground.

This allows you to test your prompt / model configuration over a series of inputs to see how well it generalizes across different contexts or scenarios, without having to write any code.

Expand All @@ -21,41 +21,6 @@ This allows you to test your prompt / model configuration over a series of input

## Add evaluation scores to the experiment

Kicking off an experiment is no fun without actually running evaluations on the results. You can add evaluation scores to the experiment by configuring an automation rule for the dataset, again without writing any code. This will allow you to add evaluation scores to the experiment and compare the results across different experiments.
It's also possible to add human annotations to the runs of any experiment.
You can add evaluation scores to experiments by [binding an evaluator to the dataset](./bind_evaluator_to_dataset), again without writing any code.

We currently support configuring LLM-as-a-judge evaluators on datasets that will evaluate the results of each run in each experiment kicked off from that dataset.

The process for configuring this is very similar to the process for configuring an [online evaluator] for your tracing projects.

1. **Navigate to the dataset details page** by clicking "Datasets and Testing" in the sidebar and selecting the dataset you want to configure the evaluator for.
2. **Click on the "Add Evaluator" button** to add an evaluator to the dataset. This will open a modal you can use to configure the evaluator.

![Add Evaluator](../static/add_evaluator.png)

3. **Give your evaluator a name** and **set an inline prompt or load a prompt from the prompt hub** that will be used to evaluate the results of the runs in the experiment.

![Add evaluator name and prompt](../static/create_evaluator.png)

Importantly, evaluator prompts can only contain the following input variables:

- `input` (required): the input to the target you are evaluating
- `output` (required): the output of the target you are evaluating
- `reference`: the reference output, taken from the dataset

:::note

Automatic evaluators you configure in the application will only work if the `inputs` to your evaluation target, `outputs` from your evaluation target, and `examples` in your dataset are all single-key dictionaries.
LangSmith will automatically extract the values from the dictionaries and pass them to the evaluator.

LangSmith currently doesn't support setting up evaluators in the application that act on multiple keys in the `inputs` or `outputs` or `examples` dictionaries.

:::

You can specify the scoring criteria in the "schema" field. In this example, we are asking the LLM to grade on "correctness" of the output with respect to the reference, with a boolean output of 0 or 1. The name of the field in the schema will be interpreted as the feedback key and the type will be the type of the score.

![Evaluator prompt](../static/evaluator_prompt.png)

4. **Save the evaluator** and navigate back to the dataset details page. Each **subsequent** experiment run from the dataset will now be evaluated by the evaluator you configured. Note that in the below image, each run in the experiment has a "correctness" score.

![Playground evaluator results](../static/playground_evaluator_results.png)
You can also programmatically [evaluate an existing experiment](./evaluate_existing_experiment) using the SDK.
1 change: 1 addition & 0 deletions versioned_docs/version-2.0/how_to_guides/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ Evaluate your LLM applications to measure their performance over time.
- [Evaluate on a subset of a dataset](./how_to_guides/evaluation/evaluate_llm_application#evaluate-on-a-subset-of-a-dataset)
- [Use a summary evaluator](./how_to_guides/evaluation/evaluate_llm_application#use-a-summary-evaluator)
- [Evaluate a LangChain runnable](./how_to_guides/evaluation/evaluate_llm_application#evaluate-a-langchain-runnable)
- [Bind and evaluator to a dataset in the UI](./how_to_guides/evaluation/bind_evaluator_to_dataset)
- [Run an evaluation from the prompt playground](./how_to_guides/evaluation/run_evaluation_from_prompt_playground)
- [Evaluate on intermediate steps](./how_to_guides/evaluation/evaluate_on_intermediate_steps)
- [Use LangChain off-the-shelf evaluators (Python only)](./how_to_guides/evaluation/use_langchain_off_the_shelf_evaluators)
Expand Down
4 changes: 4 additions & 0 deletions versioned_docs/version-2.0/tutorials/evaluation.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
sidebar_position: 2
---

# Evaluate your LLM application

It can be hard to measure the performance of your application with respect to criteria important you or your users.
Expand Down
4 changes: 4 additions & 0 deletions versioned_docs/version-2.0/tutorials/observability.mdx
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
---
sidebar_position: 1
---

import {
CodeTabs,
python,
Expand Down
3 changes: 1 addition & 2 deletions versioned_docs/version-2.0/tutorials/optimize_classifier.mdx
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
---
sidebar_label: Optimize a classifier
sidebar_position: 2
table_of_contents: true
sidebar_position: 3
---

# Optimize a classifier
Expand Down

0 comments on commit db880ec

Please sign in to comment.