[Python] `evaluate()` #542

hinthornw · 2024-03-23T06:00:38Z

base evaluate api
support wrapper to support OTS evaluators from langchain
add examples

Feedback I'd love (all welcome)

Desired behavior if the project you specify already exists
2-step API (2 functions? predict, evaluate vs. 1...)

To add in a second PR:

async support
hill-climbing as a first-class citizen

python/langsmith/evaluation/_runner.py

Also auto-trace in the runevaluator if possible Related to: #542

hinthornw · 2024-03-26T16:28:03Z

python/langsmith/run_helpers.py

+ reference_example_id = langsmith_extra.get("reference_example_id")
+ id_ = langsmith_extra.get("run_id")
+ if (
+ not project_cv


Still trace if you're manually providing example, project, etc. via the context var

python/langsmith/client.py

agola11 · 2024-03-26T20:28:09Z

python/langsmith/evaluation/_runner.py

+ if isinstance(project, uuid.UUID) or _is_uuid(project):
+ runs = client.list_runs(project_id=project)
+ else:
+ runs = client.list_runs(project_name=project)
+
+ treemap: DefaultDict[uuid.UUID, List[schemas.Run]] = collections.defaultdict(list)
+ results = []
+ all_runs = {}
+ for run in runs:
+ if run.parent_run_id is not None:
+ treemap[run.parent_run_id].append(run)
+ else:
+ results.append(run)
+ all_runs[run.id] = run
+ for run_id, child_runs in treemap.items():
+ all_runs[run_id].child_runs = sorted(child_runs, key=lambda r: r.dotted_order)
+ return results


why don't we just map trace_id to list of runs, then sort each entry based on dotted order?

Still need to reconstruct the tree right? It can be nested at an arbitrary depth

I could also just default to only the roots, which is all people need in 90% of usage rn

python/langsmith/evaluation/_runner.py

agola11 · 2024-03-26T20:33:13Z

python/langsmith/evaluation/_runner.py

+TARGET_T = PIPELINE_T
+# dataset-name, dataset_id, or examples
+DATA_T = Union[str, uuid.UUID, Iterable[schemas.Example]]
+SUMMARY_EVALUATOR_T = Callable[


What is summary evaluator?

aggregate level over whole experiment

python/langsmith/evaluation/_runner.py

V2 API Test

a3b9826

hinthornw commented Mar 25, 2024

View reviewed changes

python/langsmith/evaluation/_runner.py Outdated Show resolved Hide resolved

hinthornw commented Mar 25, 2024

View reviewed changes

python/langsmith/evaluation/_runner.py Show resolved Hide resolved

hinthornw mentioned this pull request Mar 25, 2024

Update exports of eval types #549

Merged

hinthornw added a commit that referenced this pull request Mar 25, 2024

Update exports of eval types (#549)

8ca9a8e

Also auto-trace in the runevaluator if possible Related to: #542

hinthornw added 18 commits March 25, 2024 11:50

Merge branch 'main' into wfh/eval2.0

e553790

Wrapper test

0f74a98

Rename + add subclasses

671310e

Add example

aef9065

Update docs

4e0c221

gs

078b897

Add example

10d2398

Add ability to do a prepare_data fn

bff5758

Add examples

cbbc5d2

context_run

b9567d3

Always trace

3a0d5d8

Context run

d4c4648

Pass client

d0dcc96

Nits

319bb62

pyp

e566093

Merge branch 'main' into wfh/eval2.0

9c8fb54

Update

df06219

Use experiment-prefix

26ec444

hinthornw force-pushed the wfh/eval2.0 branch from 5c0abd1 to 1aa89b8 Compare March 26, 2024 05:18

Prep data

b516b7f

hinthornw force-pushed the wfh/eval2.0 branch from 1aa89b8 to b516b7f Compare March 26, 2024 05:21

hinthornw changed the title ~~V2 API Test~~ [Python] evaluate() Mar 26, 2024

format

c497800

hinthornw marked this pull request as ready for review March 26, 2024 14:07

Client

aee4091

doctest

a5d5581

hinthornw force-pushed the wfh/eval2.0 branch from 25c62ec to a5d5581 Compare March 26, 2024 14:57

hinthornw added 3 commits March 26, 2024 08:58

Warn beta

9871522

rm examples

20f9722

Update

4b06c34

hinthornw force-pushed the wfh/eval2.0 branch from f45ae90 to abe4c92 Compare March 26, 2024 16:25

hinthornw commented Mar 26, 2024

View reviewed changes

hinthornw force-pushed the wfh/eval2.0 branch 2 times, most recently from 1c03f7f to 18bfc35 Compare March 26, 2024 16:31

dep

b2dbda4

hinthornw force-pushed the wfh/eval2.0 branch from 18bfc35 to b2dbda4 Compare March 26, 2024 16:37

Integration tests add tiktoken

59fab3f

agola11 reviewed Mar 26, 2024

View reviewed changes

hinthornw added 5 commits March 26, 2024 16:06

Update

24ccb1b

Add comments

eb3ff0d

comment

0fe3a19

default shallow

483b26b

test

490a2d5

agola11 approved these changes Mar 27, 2024

View reviewed changes

hinthornw added 3 commits March 26, 2024 18:08

rapidly fuzz

4ba3988

Bump

fe9f922

sesh

7c93475

hinthornw merged commit 545b7fa into main Mar 27, 2024
7 checks passed

hinthornw deleted the wfh/eval2.0 branch March 27, 2024 01:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] `evaluate()` #542

[Python] `evaluate()` #542

hinthornw commented Mar 23, 2024 •

edited

hinthornw Mar 26, 2024 •

edited

agola11 Mar 26, 2024

hinthornw Mar 26, 2024 •

edited

agola11 Mar 26, 2024

hinthornw Mar 26, 2024

[Python] evaluate() #542

[Python] evaluate() #542

Conversation

hinthornw commented Mar 23, 2024 • edited

hinthornw Mar 26, 2024 • edited

Choose a reason for hiding this comment

agola11 Mar 26, 2024

Choose a reason for hiding this comment

hinthornw Mar 26, 2024 • edited

Choose a reason for hiding this comment

agola11 Mar 26, 2024

Choose a reason for hiding this comment

hinthornw Mar 26, 2024

Choose a reason for hiding this comment

[Python] `evaluate()` #542

[Python] `evaluate()` #542

hinthornw commented Mar 23, 2024 •

edited

hinthornw Mar 26, 2024 •

edited

hinthornw Mar 26, 2024 •

edited