Skip to content

langchain-ai/langsmith-cookbook

Repository files navigation

LangSmith Cookbook

Release Notes Python Downloads NPM Version JS Downloads

Welcome to the LangSmith Cookbook — your practical guide to mastering LangSmith. While our standard documentation covers the basics, this repository delves into common patterns and some real-world use-cases, empowering you to optimize your LLM applications further.

This repository is your practical guide to maximizing LangSmith. As a tool, LangSmith empowers you to debug, evaluate, test, and improve your LLM applications continuously. These recipes present real-world scenarios for you to adapt and implement.

Your Input Matters

Help us make the cookbook better! If there's a use-case we missed, or if you have insights to share, please raise a GitHub issue (feel free to tag Will) or contact the LangChain development team. Your expertise shapes this community.

Tracing your code

Tracing allows for seamless debugging and improvement of your LLM applications. Here's how:

  • Tracing without LangChain: learn to trace applications independent of LangChain using the Python SDK's @traceable decorator.
  • REST API: get acquainted with the REST API's features for logging LLM and chat model runs, and understand nested runs. The run logging spec can be found in the LangSmith SDK repository.
  • Customizing Run Names: improve UI clarity by assigning bespoke names to LangSmith chain runs—includes examples for chains, lambda functions, and agents.
  • Tracing Nested Calls within Tools: include all nested tool subcalls in a single trace by using run_manager.get_child() and passing to the child callbacks
  • Display Trace Links: add trace links to your app to speed up development. This is useful when prototyping your application in its unique UI, since it lets you quickly see its execution flow, add feedback to a run, or add the run to a dataset.

LangChain Hub

Efficiently manage your LLM components with the LangChain Hub. For dedicated documentation, please see the hub docs.

  • RetrievalQA Chain: use prompts from the hub in an example RAG pipeline.
  • Prompt Versioning: ensure deployment stability by selecting specific prompt versions over the 'latest'.
  • Runnable PromptTemplate: streamline the process of saving prompts to the hub from the playground and integrating them into runnable chains.

Testing & Evaluation

Test and benchmark your LLM systems using methods in these evaluation recipes:

Python Examples

Retrieval Augmented Generation (RAG)

  • Q&A System Correctness: evaluate your retrieval-augmented Q&A pipeline end-to-end on a dataset. Iterate, improve, and keep testing.
  • Evaluating Q&A Systems with Dynamic Data: use evaluators that dereference a labels to handle data that changes over time.
  • RAG Evaluation using Fixed Sources: evaluate the response component of a RAG (retrieval-augmented generation) pipeline by providing retrieved documents in the dataset
  • RAG evaluation with RAGAS: evaluate RAG pipelines using the RAGAS framework. Covers metrics for both the generator AND retriever in both labeled and reference-free contexts (answer correctness, faithfulness, context relevancy, recall and precision).

Chat Bots

  • Chat Bot Evals using Simulated Users: evaluate your chat bot using a simulated user. The user is given a task, and you score your assistant based on how well it helps without being breaking its instructions.
  • Single-turn evals: Evaluate chatbots within multi-turn conversations by treating each data point as an individual dialogue turn. This guide shows how to set up a multi-turn conversation dataset and evaluate a simple chat bot on it.

Extraction

  • Evaluating an Extraction Chain: measure the similarity between the extracted structured content and structured labels using LangChain's json evaluators.
  • Exact Match: deterministic comparison of your system output against a reference label.

Agents

  • Evaluating an Agent's intermediate steps: compare the sequence of actions taken by an agent to an expected trajectory to grade effective tool use.
  • Tool Selection: Evaluate the precision of selected tools. Include an automated prompt writer to improve the tool descriptions based on failure cases.

Multimodel

Fundamentals

  • Backtesting: benchmark new versions of your production app using real inputs. Convert production runs to a test dataset, then compare your new system's performance against the baseline.
  • Adding Metrics to Existing Tests: Apply new evaluators to existing test results without re-running your model, using the compute_test_metrics utility function. This lets you evaluate "post-hoc" and backfill metrics as you define new evaluators.
  • Naming Test Projects: manually name your tests with run_on_dataset(..., project_name='my-project-name')
  • Exporting Tests to CSV: Use the get_test_results beta utility to easily export your test results to a CSV file. This allows you to analyze and report on the performance metrics, errors, runtime, inputs, outputs, and other details of your tests outside of the Langsmith platform.
  • How to download feedback and examples from a test project: goes beyond the utility described above to query and export the predictions, evaluation results, and other information to programmatically add to your reports.

TypeScript / JavaScript Testing Examples

Incorporate LangSmith into your TS/JS testing and evaluation workflow:

We are working to add more JS examples soon. In the meantime, check out the JS eval quickstart the following guides:

Using Feedback

Harness user feedback, "ai-assisted" feedback, and other signals to improve, monitor, and personalize your applications. Feedback can be user-generated or "automated" using functions or even calls to an LLM:

Optimization

Use LangSmith to help optimize your LLM systems, so they can continuously learn and improve.

  • Prompt Bootstrapping: Optimize your prompt over a set of examples by incorporating human feedback and an LLM prompt optimizer. Works by rewriting an optimized system prompt based on feedback.
    • Prompt Bootstrapping for style transfer: Elvis-Bot: Extend prompt bootstrapping to generate outputs in the style of a specific persona. This notebook demonstrates how to create an "Elvis-bot" that mimics the tweet style of @omarsar0 by iteratively refining a prompt using Claude's exceptional prompt engineering capabilities and feedback collected through LangSmith's annotation queue.
  • Automated Few-shot Prompt Bootstrapping: Automatically curate the most informative few-shot examples based on performance metrics, removing the need for manual example engineering. Applied to an entailment task on the SCONE dataset.
  • Iterative Prompt Optimization: Streamlit app demonstrating real-time prompt optimization based on user feedback and dialog, leveraging few-shot learning and a separate "optimizer" model to dynamically improve a tweet-generating system.
  • Online Few-shot Examples Configure online evaluators to add good examples to a dataset. Review, then use them as few-shot examples to boost performance.

Exporting data for fine-tuning

Fine-tune an LLM on collected run data using these recipes:

  • OpenAI Fine-Tuning: list LLM runs and convert them to OpenAI's fine-tuning format efficiently.
  • Lilac Dataset Curation: further curate your LangSmith datasets using Lilac to detect near-duplicates, check for PII, and more.

Exploratory Data Analysis

Turn your trace data into actionable insights:

  • Exporting LLM Runs and Feedback: extract and interpret LangSmith LLM run data, making them ready for various analytical platforms.
  • Lilac: enrich datasets using the open-source analytics tool, Lilac, to better label and organize your data.