From 7f928c656701a0fa1dfff52290b2810a66b2efaa Mon Sep 17 00:00:00 2001 From: Kathryn May Date: Tue, 18 Nov 2025 08:44:24 -0500 Subject: [PATCH 1/4] WIP update eval intro docs for clarity online offline user journey --- src/docs.json | 1 + src/langsmith/evaluation-concepts.mdx | 234 +++++++++++------- src/langsmith/evaluation.mdx | 141 ++++++++++- .../offline-to-online-evaluations.mdx | 160 ++++++++++++ src/langsmith/pytest.mdx | 12 +- src/langsmith/vitest-jest.mdx | 6 +- 6 files changed, 450 insertions(+), 104 deletions(-) create mode 100644 src/langsmith/offline-to-online-evaluations.mdx diff --git a/src/docs.json b/src/docs.json index 1943a5fdb0..076655f90b 100644 --- a/src/docs.json +++ b/src/docs.json @@ -1046,6 +1046,7 @@ "langsmith/evaluation-quickstart", "langsmith/evaluation-concepts", "langsmith/evaluation-approaches", + "langsmith/offline-to-online-evaluations", { "group": "Datasets", "pages": [ diff --git a/src/langsmith/evaluation-concepts.mdx b/src/langsmith/evaluation-concepts.mdx index 4f9dd8b235..eed0472951 100644 --- a/src/langsmith/evaluation-concepts.mdx +++ b/src/langsmith/evaluation-concepts.mdx @@ -3,14 +3,47 @@ title: Evaluation Concepts sidebarTitle: Concepts --- -LangSmith makes building high-quality evaluations easy. This guide explains the key concepts of the LangSmith evaluation framework. The building blocks of the LangSmith framework are: +This page explains the key concepts of LangSmith Evaluation. -* [**Datasets**:](/langsmith/evaluation-concepts#datasets) Collections of test inputs and reference outputs. -* [**Evaluators**](/langsmith/evaluation-concepts#evaluators): Functions for scoring outputs. These can be [online evaluators](/langsmith/evaluation-concepts#online-evaluation) that run on traces in real time or [offline evaluators](/langsmith/evaluation-concepts#offline-evaluation) that run on a dataset. + +**Understand online vs offline evaluations** + +LangSmith supports two types of evaluations based on **what objects they run on**: + +- [Offline evaluations](#offline-evaluation) run on _examples_ from _datasets_. Examples include reference outputs for comparison. Use offline evaluations for pre-deployment testing, benchmarking, and regression testing. +- [Online evaluations](#online-evaluation) run on _runs_ or _threads_ from [tracing](/langsmith/observability-quickstart) projects. These are real production traces without reference outputs. Use online evaluations for production monitoring and real-time performance tracking. + +**The key distinction:** offline evaluations have access to reference outputs (what you expect), while online evaluations only have access to actual inputs and outputs (what actually happened). + + +The building blocks of LangSmith Evaluation are: + +- [Datasets & examples:](#datasets) Collections of test inputs and reference outputs used for offline evaluation. +- [Traces & runs:](#runs-for-online-evaluation) Production execution logs used for online evaluation. +- [Evaluators](#evaluators): Functions for scoring outputs. These can be [online evaluators](#online-evaluation) that run on traces in real time or [offline evaluators](#offline-evaluation) that run on a dataset. + +## Core evaluation objects + +The core objects that evaluations run on: + +- [Examples](#examples) (for offline evaluation) are curated test cases used for pre-deployment testing. Because they include reference outputs, offline evaluators can compare actual vs expected results. +- [Runs](#runs-for-online-evaluation) (for online evaluation) are real production traces. Online evaluators assess these runs in near real-time without reference outputs, focusing on detecting issues, anomalies, or quality degradation. + +## Runs for online evaluation + +A _run_ is a single execution trace from your [deployed application](/langsmith/deployments). Each run contains: +- **Inputs**: The actual user inputs your application received. +- **Outputs**: What your application actually returned. +- **Intermediate steps**: All the child runs (tool calls, LLM calls, and so on). +- **Metadata**: Tags, user feedback, latency metrics, etc. + +_Threads_ are collections of related runs (multi-turn conversations). Online evaluators can also run at the thread level to evaluate entire conversations. + +Learn more about [runs and traces in the Observability concepts](/langsmith/observability-concepts#runs). ## Datasets -A dataset is a collection of examples used for evaluating an application. An example is a test input, reference output pair. +A dataset is a _collection of examples_ used for evaluating an application. An example is a test input, reference output pair. ![Dataset](/langsmith/images/dataset-concept.png) @@ -18,9 +51,9 @@ A dataset is a collection of examples used for evaluating an application. An exa Each example consists of: -* **Inputs**: a dictionary of input variables to pass to your application. -* **Reference outputs** (optional): a dictionary of reference outputs. These do not get passed to your application, they are only used in evaluators. -* **Metadata** (optional): a dictionary of additional information that can be used to create filtered views of a dataset. +- **Inputs**: a dictionary of input variables to pass to your application. +- **Reference outputs** (optional): a dictionary of reference outputs. These do not get passed to your application, they are only used in evaluators. +- **Metadata** (optional): a dictionary of additional information that can be used to create filtered views of a dataset. ![Example](/langsmith/images/example-concept.png) @@ -28,184 +61,230 @@ Each example consists of: There are various ways to build datasets for evaluation, including: +- [Manually curated examples](#manually-curated-examples) +- [Historical traces](#historical-traces) +- [Synthetic data](#synthetic-data) + #### Manually curated examples -This is how we typically recommend people get started creating datasets. From building your application, you probably have some idea of what types of inputs you expect your application to be able to handle, and what "good" responses may be. You probably want to cover a few different common edge cases or situations you can imagine. Even 10-20 high-quality, manually-curated examples can go a long way. +This is the recommended starting point for creating datasets. When building an application, you'll have some idea of what types of inputs the application should handle, and what "good" responses should be. Start by covering a few different common edge cases or situations. Even 10–20 high-quality, manually curated examples can be sufficient for initial testing. #### Historical traces -Once you have an application in production, you start getting valuable information: how are users actually using it? These real-world runs make for great examples because they're, well, the most realistic! +Once an application is in production, it collects valuable information about real-world usage patterns. These production runs make excellent examples because they reflect actual user behavior. -If you're getting a lot of traffic, how can you determine which runs are valuable to add to a dataset? There are a few techniques you can use: +For high-traffic applications, several techniques help identify valuable runs to add to a dataset: -* **User feedback**: If possible - try to collect end user feedback. You can then see which datapoints got negative feedback. That is super valuable! These are spots where your application did not perform well. You should add these to your dataset to test against in the future. -* **Heuristics**: You can also use other heuristics to identify "interesting" datapoints. For example, runs that took a long time to complete could be interesting to look at and add to a dataset. -* **LLM feedback**: You can use another LLM to detect noteworthy runs. For example, you could use an LLM to label chatbot conversations where the user had to rephrase their question or correct the model in some way, indicating the chatbot did not initially respond correctly. +- **User feedback**: Collect end user feedback to identify datapoints that received negative feedback. These represent cases where the application did not perform well and should be added to the dataset to test against in the future. +- **Heuristics**: Use other heuristics to identify interesting datapoints. For example, runs that took a long time to complete are worth examining and potentially adding to a dataset. +- **LLM feedback**: Use another LLM to detect noteworthy runs. For example, an LLM can label chatbot conversations where the user had to rephrase their question or correct the model, indicating the chatbot did not initially respond correctly. #### Synthetic data -Once you have a few examples, you can try to artificially generate some more. It's generally advised to have a few good hand-crafted examples before this, as this synthetic data will often resemble them in some way. This can be a useful way to get a lot of datapoints, quickly. +_Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size. ### Splits -When setting up your evaluation, you may want to partition your dataset into different splits. For example, you might use a smaller split for many rapid and cheap iterations and a larger split for your final evaluation. In addition, splits can be important for the interpretability of your experiments. For example, if you have a RAG application, you may want your dataset splits to focus on different types of questions (e.g., factual, opinion, etc) and to evaluate your application on each split separately. +_Splits_ are partitions of a dataset that enable targeted evaluation on different subsets of examples. Splits serve two primary purposes: performance optimization (using a smaller split for rapid iterations and a larger split for final evaluation) and interpretability (evaluating different types of inputs separately). For example, a RAG application might use splits to focus on different question types (factual, opinion, etc.) and evaluate performance on each type separately. Learn how to [create and manage dataset splits](/langsmith/manage-datasets-in-application#create-and-manage-dataset-splits). ### Versions -Datasets are [versioned](/langsmith/manage-datasets#version-a-dataset) such that every time you add, update, or delete examples in your dataset, a new version of the dataset is created. This makes it easy to inspect and revert changes to your dataset in case you make a mistake. You can also [tag versions](/langsmith/manage-datasets#tag-a-version) of your dataset to give them a more human-readable name. This can be useful for marking important milestones in your dataset's history. +_Dataset versions_ track changes to datasets over time. When you add, update or delete an example, LangSmith creates a new [version](/langsmith/manage-datasets#version-a-dataset) automatically. This enables inspection and reversal of changes when needed. You can [tag](/langsmith/manage-datasets#tag-a-version) versions with human-readable names to mark important milestones in a dataset's history. -You can run evaluations on specific versions of a dataset. This can be useful when running evaluations in CI, to make sure that a dataset update doesn't accidentally break your CI pipelines. +Evaluations can target specific dataset versions. This is particularly useful in CI pipelines to ensure dataset updates do not break existing evaluation workflows. ## Evaluators -Evaluators are functions that score how well your application performs on a particular example. +_Evaluators_ are functions that score how well your application performs on a particular example. -#### Evaluator inputs +There are a number of ways to define and run evaluators: -Evaluators receive these inputs: +- **Custom code**: Define [custom evaluators](/langsmith/code-evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI. +- **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI. -* [Example](/langsmith/evaluation-concepts#examples): The example(s) from your [Dataset](/langsmith/evaluation-concepts#datasets). Contains inputs, (reference) outputs, and metadata. -* [Run](/langsmith/observability-concepts#runs): The actual outputs and intermediate steps (child runs) from passing the example inputs to the application. +You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on particular tracing projects or datasets. -#### Evaluator outputs +Evaluators have the following components: -An evaluator returns one or more metrics. These should be returned as a dictionary or list of dictionaries of the form: +- [Evaluator inputs](#evaluator-inputs) +- [Evaluator outputs](#evaluator-outputs) -* `key`: The name of the metric. -* `score` | `value`: The value of the metric. Use `score` if it's a numerical metric and `value` if it's categorical. -* `comment` (optional): The reasoning or additional string information justifying the score. +### Evaluator inputs -#### Defining evaluators +Evaluator inputs differ based on evaluation type: -There are a number of ways to define and run evaluators: +**Offline evaluators** receive: +- [Example](#examples): The example from your [dataset](#datasets), containing inputs, reference outputs, and metadata. +- [Run](/langsmith/observability-concepts#runs): The actual outputs and intermediate steps from running the application on the example inputs. + +**Online evaluators** receive: +- [Run](/langsmith/observability-concepts#runs): The production trace containing inputs, outputs, and intermediate steps (no reference outputs available). + +### Evaluator outputs -* **Custom code**: Define [custom evaluators](/langsmith/code-evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI. -* **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI. +Evaluators return one or more metrics as a dictionary or list of dictionaries. Each dictionary contains: -You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [Rules](/langsmith/rules) to automatically run them on particular tracing projects or datasets. +- `key`: The metric name. +- `score` | `value`: The metric value (`score` for numerical metrics, `value` for categorical metrics). +- `comment` (optional): Additional reasoning or explanation for the score. -#### Evaluation techniques +## Evaluation techniques -There are a few high-level approaches to LLM evaluation: +LangSmith supports several evaluation approaches: + +- [Human](#human) +- [Heuristic](#heuristic) +- [LLM-as-judge](#llm-as-judge) +- [Pairwise](#pairwise) ### Human -Human evaluation is [often a great starting point for evaluation](https://hamel.dev/blog/posts/evals/#looking-at-your-traces). LangSmith makes it easy to review your LLM application outputs as well as the traces (all intermediate steps). +_Human evaluation_ involves manual review of application outputs and execution traces. This approach is [often an effective starting point for evaluation](https://hamel.dev/blog/posts/evals/#looking-at-your-traces). LangSmith provides tools to review application outputs and traces (all intermediate steps). -LangSmith's [annotation queues](/langsmith/evaluation-concepts#annotation-queues) make it easy to get human feedback on your application's outputs. +[Annotation queues](#annotation-queues) streamline the process of collecting human feedback on application outputs. ### Heuristic -Heuristic evaluators are deterministic, rule-based functions. These are good for simple checks like making sure that a chatbot's response isn't empty, that a snippet of generated code can be compiled, or that a classification is exactly correct. +_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. ### LLM-as-judge -LLM-as-judge evaluators use LLMs to score the application's output. To use them, you typically encode the grading rules / criteria in the LLM prompt. They can be reference-free (e.g., check if system output contains offensive content or adheres to specific criteria). Or, they can compare task output to a reference output (e.g., check if the output is factually accurate relative to the reference). +_LLM-as-judge evaluators_ use LLMs to score application outputs. The grading rules and criteria are typically encoded in the LLM prompt. These evaluators can be: + +- **Reference-free**: Check if output contains offensive content or adheres to specific criteria. +- **Reference-based**: Compare output to a reference (e.g., check factual accuracy relative to the reference). -With LLM-as-judge evaluators, it is important to carefully review the resulting scores and tune the grader prompt if needed. Often it is helpful to write these as few-shot evaluators, where you provide examples of inputs, outputs, and expected grades as part of the grader prompt. +LLM-as-judge evaluators require careful review of scores and prompt tuning. Few-shot evaluators, which include examples of inputs, outputs, and expected grades in the grader prompt, often improve performance. Learn about [how to define an LLM-as-a-judge evaluator](/langsmith/llm-as-judge). ### Pairwise -Pairwise evaluators allow you to compare the outputs of two versions of an application. This can use either a heuristic ("which response is longer"), an LLM (with a specific pairwise prompt), or human (asking them to manually annotate examples). - -**When should you use pairwise evaluation?** +_Pairwise evaluators_ compare outputs from two application versions using heuristics (e.g., which response is longer), LLMs (with pairwise prompts), or human reviewers. -Pairwise evaluation is helpful when it is difficult to directly score an LLM output, but easier to compare two outputs. This can be the case for tasks like summarization - it may be hard to give a summary an absolute score, but easy to choose which of two summaries is more informative. +Pairwise evaluation works well when directly scoring an output is difficult but comparing two outputs is straightforward. For example, in summarization tasks, choosing the more informative of two summaries is often easier than assigning an absolute score to a single summary. Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). ## Experiment -Each time we evaluate an application on a dataset, we are conducting an experiment. An experiment contains the results of running a specific version of your application on the dataset. To understand how to use the LangSmith experiment view, see [how to analyze experiment results](/langsmith/analyze-an-experiment). +An _experiment_ represents the results of evaluating a specific application version on a dataset. Each experiment captures outputs, evaluator scores, and execution traces for every example in the dataset. ![Experiment view](/langsmith/images/experiment-view.png) -Typically, we will run multiple experiments on a given dataset, testing different configurations of our application (e.g., different prompts or LLMs). In LangSmith, you can easily view all the experiments associated with your dataset. Additionally, you can [compare multiple experiments in a comparison view](/langsmith/compare-experiment-results). +Multiple experiments typically run on a given dataset to test different application configurations (e.g., different prompts or LLMs). LangSmith displays all experiments associated with a dataset and supports [comparing multiple experiments](/langsmith/compare-experiment-results) side-by-side. ![Comparison view](/langsmith/images/comparison-view.png) +Learn [how to analyze experiment results](/langsmith/analyze-an-experiment). ## Experiment configuration -LangSmith supports a number of experiment configurations which make it easier to run your evals in the manner you want. +LangSmith supports several configuration options for experiments: + +- [Repetitions](#repetitions) +- [Concurrency](#concurrency) +- [Caching](#caching) ### Repetitions -Running an experiment multiple times can be helpful since LLM outputs are not deterministic and can differ from one repetition to the next. By running multiple repetitions, you can get a more accurate estimate of the performance of your system. +_Repetitions_ run an experiment multiple times to account for LLM output variability. Since LLM outputs are non-deterministic, multiple repetitions provide a more accurate performance estimate. -Repetitions can be configured by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Repeating the experiment involves both re-running the target function to generate outputs and re-running the evaluators. +Configure repetitions by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Each repetition re-runs both the target function and all evaluators. -To learn more about running repetitions on experiments, read the [how-to-guide](/langsmith/repetition). +Learn more in the [repetitions how-to guide](/langsmith/repetition). ### Concurrency -By passing the `max_concurrency` argument to `evaluate` / `aevaluate`, you can specify the concurrency of your experiment. The `max_concurrency` argument has slightly different semantics depending on whether you are using `evaluate` or `aevaluate`. +_Concurrency_ controls how many examples run simultaneously during an experiment. Configure it by passing the `max_concurrency` argument to `evaluate` / `aevaluate`. The semantics differ between the two functions: #### `evaluate` -The `max_concurrency` argument to `evaluate` specifies the maximum number of concurrent threads to use when running the experiment. This is both for when running your target function as well as your evaluators. +The `max_concurrency` argument specifies the maximum number of concurrent threads for running both the target function and evaluators. #### `aevaluate` -The `max_concurrency` argument to `aevaluate` is fairly similar to `evaluate`, but instead uses a semaphore to limit the number of concurrent tasks that can run at once. `aevaluate` works by creating a task for each example in the dataset. Each task consists of running the target function as well as all of the evaluators on that specific example. The `max_concurrency` argument specifies the maximum number of concurrent tasks, or put another way - examples, to run at once. +The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aevaluate` creates a task for each example, where each task runs the target function and all evaluators for that example. The `max_concurrency` argument specifies the maximum number of concurrent examples to process. ### Caching -Lastly, you can also cache the API calls made in your experiment by setting the `LANGSMITH_TEST_CACHE` to a valid folder on your device with write access. This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up. +_Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests. ## Annotation queues -Human feedback is often the most valuable feedback you can gather on your application. With [annotation queues](/langsmith/annotation-queues) you can flag runs of your application for annotation. Human annotators then have a streamlined view to review and provide feedback on the runs in a queue. Often (some subset of) these annotated runs are then transferred to a [dataset](/langsmith/evaluation-concepts#datasets) for future evaluations. While you can always [annotate runs inline](/langsmith/annotate-traces-inline), annotation queues provide another option to group runs together, specify annotation criteria, and configure permissions. +Human feedback often provides the most valuable assessment of application quality. _Annotation queues_ enable structured collection of this feedback by organizing runs for human review. + +[Annotation queues](/langsmith/annotation-queues) flag specific application runs for annotation. Annotators review these runs in a streamlined interface to provide feedback. Annotated runs can then be transferred to a [dataset](#datasets) for future evaluations. + +Annotation queues complement [inline annotation](/langsmith/annotate-traces-inline) by offering additional capabilities: grouping runs together, specifying annotation criteria, and configuring reviewer permissions. Learn more about [annotation queues and human feedback](/langsmith/annotation-queues). +## Offline and online evaluation comparison + +The following table summarizes the key differences between offline and online evaluations: + +| | **Offline Evaluation** | **Online Evaluation** | +|---|---|---| +| **Runs on** | Dataset (Examples) | Tracing Project (Runs/Threads) | +| **Data access** | Inputs, Outputs, Reference Outputs | Inputs, Outputs only | +| **When to use** | Pre-deployment, during development | Production, post-deployment | +| **Primary use cases** | Benchmarking, unit testing, regression testing, backtesting | Real-time monitoring, production feedback, anomaly detection | +| **Evaluation timing** | Batch processing on curated test sets | Real-time or near real-time on live traffic | +| **Setup location** | Evaluation tab (SDK, UI, Prompt Playground) | [Observability tab](/langsmith/online-evaluations) (automated rules) | +| **Data requirements** | Requires dataset curation | No dataset needed, evaluates live traces | + ## Offline evaluation -Evaluating an application on a dataset is what we call "offline" evaluation. It is offline because we're evaluating on a pre-compiled set of data. An online evaluation, on the other hand, is one in which we evaluate a deployed application's outputs on real traffic, in near realtime. Offline evaluations are used for testing a version(s) of your application pre-deployment. +_Offline evaluation_ tests applications on pre-compiled datasets before deployment. The term "offline" refers to evaluating on curated data rather than live traffic. In contrast, online evaluation assesses deployed applications on real traffic in near real-time. Offline evaluation enables pre-deployment testing and version comparison. -You can run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)). You can run them server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or by configuring [automations](/langsmith/rules) to run certain evaluators on every new experiment against a specific dataset. +Offline evaluations run client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules). ![Offline](/langsmith/images/offline.png) ### Benchmarking -Perhaps the most common type of offline evaluation is one in which we curate a dataset of representative inputs, define the key performance metrics, and benchmark multiple versions of our application to find the best one. Benchmarking can be laborious because for many use cases you have to curate a dataset with gold-standard reference outputs and design good metrics for comparing experimental outputs to them. For a RAG Q\&A bot this might look like a dataset of questions and reference answers, and an LLM-as-judge evaluator that determines if the actual answer is semantically equivalent to the reference answer. For a ReACT agent this might look like a dataset of user requests and a reference set of all the tool calls the model is supposed to make, and a heuristic evaluator that checks if all of the reference tool calls were made. +_Benchmarking_ compares multiple application versions on a curated dataset to identify the best performer. This process involves creating a dataset of representative inputs, defining performance metrics, and testing each version. + +Benchmarking requires dataset curation with gold-standard reference outputs and well-designed comparison metrics. Examples: +- **RAG Q&A bot**: Dataset of questions and reference answers, with an LLM-as-judge evaluator checking semantic equivalence between actual and reference answers +- **ReACT agent**: Dataset of user requests and reference tool calls, with a heuristic evaluator verifying all expected tool calls were made ### Unit tests -Unit tests are used in software development to verify the correctness of individual system components. [Unit tests in the context of LLMs are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on LLM inputs or outputs (e.g., checking that LLM-generated code can be compiled, JSON can be loaded, etc.) that validate basic functionality. +_Unit tests_ verify the correctness of individual system components. In LLM contexts, [unit tests are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on inputs or outputs (e.g., verifying LLM-generated code compiles, JSON loads successfully) that validate basic functionality. -Unit tests are often written with the expectation that they should always pass. These types of tests are nice to run as part of CI. Note that when doing so it is useful to set up a cache to minimize LLM calls (because those can quickly rack up!). +Unit tests typically expect consistent passing results, making them suitable for CI pipelines. When running in CI, configure caching to minimize LLM API calls and associated costs. ### Regression tests -Regression tests are used to measure performance across versions of your application over time. They are used to, at the very least, ensure that a new app version does not regress on examples that your current version correctly handles, and ideally to measure how much better your new version is relative to the current. Often these are triggered when you are making app updates (e.g. updating models or architectures) that are expected to influence the user experience. +_Regression tests_ measure performance consistency across application versions over time. They ensure new versions do not degrade performance on cases the current version handles correctly, and ideally demonstrate improvements over the baseline. These tests typically run when making updates expected to affect user experience (e.g., model or architecture changes). -LangSmith's comparison view has native support for regression testing, allowing you to quickly see examples that have changed relative to the baseline. Regressions are highlighted red, improvements green. +LangSmith's comparison view highlights regressions (red) and improvements (green) relative to the baseline, enabling quick identification of changes. ![Comparison view](/langsmith/images/comparison-view.png) ### Backtesting -Backtesting is an approach that combines dataset creation (discussed above) with evaluation. If you have a collection of production logs, you can turn them into a dataset. Then, you can re-run those production examples with newer application versions. This allows you to assess performance on past and realistic user inputs. +_Backtesting_ evaluates new application versions against historical production data. Production logs are converted into a dataset, then newer versions process these examples to assess performance on past, realistic user inputs. -This is commonly used to evaluate new model versions. Anthropic dropped a new model? No problem! Grab the 1000 most recent runs through your application and pass them through the new model. Then compare those results to what actually happened in production. +This approach is commonly used for evaluating new model releases. For example, when a new model becomes available, test it on the most recent production runs and compare results to actual production outcomes. ### Pairwise evaluation -For some tasks [it is easier](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) for a human or LLM grader to determine if "version A is better than B" than to assign an absolute score to either A or B. Pairwise evaluations are just this — a scoring of the outputs of two versions against each other as opposed to against some reference output or absolute criteria. Pairwise evaluations are often useful when using LLM-as-judge evaluators on more general tasks. For example, if you have a summarizer application, it may be easier for an LLM-as-judge to determine "Which of these two summaries is more clear and concise?" than to give an absolute score like "Give this summary a score of 1-10 in terms of clarity and concision." +_Pairwise evaluation_ compares outputs from two versions by determining relative quality rather than assigning absolute scores. For some tasks, [determining "version A is better than B"](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) is easier than scoring each version independently. + +This approach proves particularly useful for LLM-as-judge evaluations on subjective tasks. For example, in summarization, determining "Which summary is clearer and more concise?" is often simpler than assigning numeric clarity scores. Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). ## Online evaluation -Evaluating a deployed application's outputs in (roughly) realtime is what we call "online" evaluation. In this case there is no dataset involved and no possibility of reference outputs — we're running evaluators on real inputs and real outputs as they're produced. This is useful for monitoring your application and flagging unintended behavior. Online evaluation can also work hand-in-hand with offline evaluation: for example, an online evaluator can be used to classify input questions into a set of categories that can later be used to curate a dataset for offline evaluation. +_Online evaluation_ assesses deployed application outputs in near real-time on production traffic. No dataset or reference outputs are involved—evaluators process actual inputs and outputs as the application produces them. This enables production monitoring and detection of unintended behavior. Online evaluation complements offline evaluation: an online evaluator can classify input questions into categories that later inform dataset curation for offline evaluation. -Online evaluators are generally intended to be run server-side. LangSmith has built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) that you can configure, or you can define custom code evaluators that are also run within LangSmith. +Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith. ![Online](/langsmith/images/online.png) @@ -213,28 +292,15 @@ Online evaluators are generally intended to be run server-side. LangSmith has bu ### Evaluations vs testing -Testing and evaluation are very similar and overlapping concepts that often get confused. +Testing and evaluation are similar and overlapping concepts that are frequently conflated. -**An evaluation measures performance according to a metric(s).** Evaluation metrics can be fuzzy or subjective, and are more useful in relative terms than absolute ones. That is, they're often used to compare two systems against each other rather than to assert something about an individual system. +**Evaluation measures performance according to metrics.** Evaluation metrics can be fuzzy or subjective, and prove more useful in relative terms than absolute ones. They typically compare two systems against each other rather than assert properties about an individual system. **Testing asserts correctness.** A system can only be deployed if it passes all tests. -Evaluation metrics can be *turned into* tests. For example, you can write regression tests to assert that any new version of a system must outperform some baseline version of the system on the relevant evaluation metrics. - -It can also be more resource efficient to run tests and evaluations together if your system is expensive to run and you have overlapping datasets for your tests and evaluations. - -You can also choose to write evaluations using standard software testing tools like `pytest` or `vitest/jest` out of convenience. - -### Using `pytest` and `Vitest/Jest` - -The LangSmith SDKs come with integrations for [pytest](/langsmith/pytest) and [`Vitest/Jest`](/langsmith/vitest-jest). These make it easy to: - -* Track test results in LangSmith -* Write evaluations as tests - -Tracking test results in LangSmith makes it easy to share results, compare systems, and debug failing tests. +Evaluation metrics can be converted into tests. For example, regression tests can assert that any new system version must outperform a baseline version on relevant evaluation metrics. -Writing evaluations as tests can be useful when each example you want to evaluate on requires custom logic for running the application and/or evaluators. The standard evaluation flows assume that you can run your application and evaluators in the same way on every example in a dataset. But for more complex systems or comprehensive evals, you may want to evaluate specific subsets of your system with specific types of inputs and metrics. These types of heterogenous evals are much easier to write as a suite of distinct test cases that all get tracked together rather than using the standard evaluate flow. +Running tests and evaluations together can be more resource efficient when a system is expensive to run and has overlapping datasets for tests and evaluations. -Using testing tools is also helpful when you want to *both* evaluate your system's outputs *and* assert some basic things about them. +Evaluations can also be written using standard software testing tools like [pytest](/langsmith/pytest) or [Vitest/Jest](/langsmith/vitest-jest) for convenience. diff --git a/src/langsmith/evaluation.mdx b/src/langsmith/evaluation.mdx index 6e15f5db35..fd41a50ba7 100644 --- a/src/langsmith/evaluation.mdx +++ b/src/langsmith/evaluation.mdx @@ -1,22 +1,139 @@ --- -title: LangSmith Evaluations +title: LangSmith Evaluation sidebarTitle: Overview mode: wide --- import HostingSetup from '/snippets/langsmith/platform-setup-note.mdx'; -The following sections help you create datasets, run evaluations, and analyze results: +LangSmith supports two types of evaluations based on when and where they run: + + + + **Test before you ship** + + Run evaluations on curated datasets during development to compare versions, benchmark performance, and catch regressions. + + + + **Monitor in production** + + Evaluate real user interactions in real-time to detect issues and measure quality on live traffic. + + + + +## Evaluation workflow + + + + + + + Build a collection of test cases called a [dataset](/langsmith/manage-datasets). Each dataset contains [examples](/langsmith/evaluation-concepts#examples). + + You can create datasets from: + - Manually curated examples + - Historical production traces + - Synthetic data generation + + Organize with [splits](/langsmith/evaluation-concepts#splits) for different test scenarios and track changes with [versions](/langsmith/evaluation-concepts#versions). + + + + Create [evaluators](/langsmith/evaluation-concepts#evaluators) that score your application's performance. Choose from: + - [Human](/langsmith/evaluation-concepts#human) review + - [Heuristic](/langsmith/evaluation-concepts#heuristic) rules + - [LLM-as-judge](/langsmith/llm-as-judge) for nuanced quality assessment + - [Pairwise](/langsmith/evaluate-pairwise) comparison between versions + + Evaluators receive both your application's output and the reference output from the dataset example. + + + + Execute your application on the dataset to create an [experiment](/langsmith/evaluation-concepts#experiment). Each experiment: + - Runs your application on every example in the dataset. + - Applies your evaluators to score the outputs. + - Captures all results for analysis. + + Configure [repetitions, concurrency, and caching](/langsmith/evaluation-concepts#experiment-configuration) to optimize your evaluation runs. + + + + Compare experiments to find the best version. Use offline evaluation for: + - [Benchmarking](/langsmith/evaluation-concepts#benchmarking) + - [Unit tests](/langsmith/evaluation-concepts#unit-tests) + - [Regression tests](/langsmith/evaluation-concepts#regression-tests) + - [Backtesting](/langsmith/evaluation-concepts#backtesting) + + + + + + + + + + Your application is running in production, handling real user traffic. Each interaction creates a [run](/langsmith/evaluation-concepts#runs-for-online-evaluation). + + Unlike offline evaluation, there are no reference outputs; you're evaluating real production behavior. + + + + Set up [evaluators](/langsmith/online-evaluations) to run automatically on production traces: + - Safety checks + - Format validation + - Quality heuristics + - Reference-free LLM-as-judge + + Apply [filters and sampling rates](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) to control costs. + + + + Online evaluators run automatically on live traffic, providing: + - Real-time quality monitoring + - Anomaly detection + - Production feedback for each run + - Alerting on critical issues + + Evaluators can run on individual [runs](/langsmith/evaluation-concepts#runs-for-online-evaluation) or entire [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators). + + + + Use insights from online evaluations to improve offline testing: + - Add failing production traces to your [dataset](/langsmith/manage-datasets). + - Create targeted evaluators for discovered issues. + - Run offline experiments to validate fixes. + - Deploy and monitor improvements with online evaluations. + + + + + + + +For more on the differences between offline and online evaluation, refer to the [Evaluation concepts](/langsmith/evaluation-concepts#offline-vs-online-evaluation-quick-comparison) page. + + +## Get started - Review core terminology and concepts to understand how evaluations work in LangSmith. + Get started with offline evaluation. - Evaluate your applications with different evaluators and techniques to measure quality. + Explore evaluation types, techniques, and frameworks for comprehensive testing. - Gather human feedback through annotation queues and inline annotation on outputs. + Monitor production quality in real-time from the Observability tab. B[Testing] + B --> C[Deployment] + C --> D[Monitoring] + D --> E[Iteration] + + A -.-> F[Offline] + B -.-> F + C -.-> G[Online] + D -.-> G + E -.-> H[Both] + + style F fill:#8b5cf6,stroke:#7c3aed,color:#fff + style G fill:#8b5cf6,stroke:#7c3aed,color:#fff + style H fill:#8b5cf6,stroke:#7c3aed,color:#fff +``` + +### 1. Development with offline evaluation + +When building your application before any production deployment, use offline evaluations to validate basic functionality with unit tests, benchmark different models, prompts, or architectures, and build confidence before deploying to users. + +1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. +1. Run [offline evaluations](/langsmith/evaluate-llm-application) to measure performance. +1. Iterate on your application based on results. +1. Compare experiments to find the best configuration. + + + Follow the quickstart to run your first offline evaluation. + + +### 2. Initial deployment with online evaluation + +When your application is deployed and handling real user traffic, use online evaluations to monitor production quality in real-time, detect unexpected issues or edge cases, validate that production performance matches offline testing, and collect real-world data for future offline evaluations. + +1. Set up [online evaluators](/langsmith/online-evaluations) in your tracing project. +1. Start with basic checks (e.g., no errors, response format validation). +1. Configure [alerts](/langsmith/alerts) for critical quality metrics. +1. Review traces that fail online evaluations. + + + Learn how to configure online evaluations for production monitoring. + + +### 3. Continuous improvement + +During ongoing operation and optimization, use both offline and online evaluations together in an iterative feedback loop. Online evaluations surface real-world issues and edge cases, which are then converted into dataset examples for offline testing. Offline evaluations validate fixes before redeploying, and online evaluations confirm improvements in production. + +1. Online evaluations detect an issue in production. +1. Add the failing trace to a dataset as a [new example](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). +1. Reproduce and fix the issue locally. +1. Run offline evaluation to verify the fix. +1. Deploy the updated application. +1. Confirm the fix with online evaluations. + +## Choose an evaluator + +### Reference-free evaluators (work for both) + +These evaluators don't need reference outputs and work for both offline and online evaluation: + +- **Safety checks:** Toxicity, PII detection, content policy violations +- **Format validation:** JSON structure, required fields, schema compliance +- **Heuristics:** Response length, latency, specific keywords +- **Reference-free LLM-as-judge:** Clarity, coherence, helpfulness, tone + +### Reference-based evaluators (offline only) + +These require reference outputs and only work for offline evaluation: + +- **Correctness:** Semantic similarity to reference answer +- **Factual accuracy:** Fact-checking against ground truth +- **Exact match:** Classification tasks with known labels +- **Reference-based LLM-as-judge:** Comparing output quality to a reference + +## Set up online evaluators + +For applications already running offline evaluations, follow these steps to add online monitoring: + +### Step 1. Identify critical quality metrics + +Review your offline evaluators and identify which metrics are most critical for production: +- What failures would impact users most? +- Which metrics have the tightest acceptable ranges? +- What safety checks must always pass? + +### Step 2. Adapt evaluators for online + +Convert critical offline evaluators to work without reference outputs: + +**Example:** Converting a correctness evaluator + +**Offline version (reference-based):** +```python +def evaluate_correctness(run, example): + # Compare output to reference output + actual = run["outputs"]["answer"] + expected = example["outputs"]["answer"] + return llm_judge_compare(actual, expected) +``` + +**Online version (reference-free):** +```python +def evaluate_correctness(run): + # Check output quality without reference + actual = run["outputs"]["answer"] + question = run["inputs"]["question"] + return llm_judge_quality(question, actual, criteria=[ + "Answers the question directly", + "Is factually coherent", + "Provides adequate detail" + ]) +``` + +### Step 3. Configure online evaluators + +1. Go to your [tracing project](/langsmith/observability-concepts#tracing-projects). +1. Create [new online evaluators](/langsmith/online-evaluations#configure-online-evaluators). +1. Start with a [sampling rate](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) of 10% to control costs. +1. Monitor results and adjust sampling rate based on value vs cost. + +### Step 4. Close the loop + +Create a process to convert online evaluation failures into offline test cases: + +1. Set up [alerts](/langsmith/alerts) for online evaluation failures. +1. Review failing traces regularly. +1. Add representative failures to your [offline datasets](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). +1. Fix issues and validate with offline evaluation before redeploying. + +## Learn more + + + + Detailed guide on configuring online evaluators + + + + Core concepts and comparison of evaluation types + + + + Learn how to backtest new versions against production data + + + + Collect human feedback on production traces + + diff --git a/src/langsmith/pytest.mdx b/src/langsmith/pytest.mdx index bba71385cf..3defb7c50f 100644 --- a/src/langsmith/pytest.mdx +++ b/src/langsmith/pytest.mdx @@ -3,12 +3,14 @@ title: How to run evaluations with pytest (beta) sidebarTitle: Run evaluations with pytest --- -The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases. Compared to the standard evaluation flow, this is useful when: +The LangSmith pytest plugin lets Python developers define their datasets and evaluations as pytest test cases. -* Each example requires different evaluation logic -* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines) -* You want pytest-like terminal outputs -* You already use pytest to test your app and want to add LangSmith tracking +Compared to the standard evaluation flow, this is useful when: + +* **Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together. +* **You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them. +* **You want pytest-like terminal outputs**: Get familiar pytest output formatting +* **You already use pytest to test your app**: Add LangSmith tracking to existing pytest workflows The pytest integration is in beta and is subject to change in upcoming releases. diff --git a/src/langsmith/vitest-jest.mdx b/src/langsmith/vitest-jest.mdx index d61de35221..704a5c0aad 100644 --- a/src/langsmith/vitest-jest.mdx +++ b/src/langsmith/vitest-jest.mdx @@ -9,9 +9,9 @@ LangSmith provides integrations with Vitest and Jest that allow JavaScript and T Compared to the `evaluate()` evaluation flow, this is useful when: -* Each example requires different evaluation logic -* You want to assert binary expectations, and both track these assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines) -* You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems +* **Each example requires different evaluation logic**: Standard evaluation flows assume consistent application and evaluator execution across all dataset examples. For more complex systems or comprehensive evaluations, specific system subsets may require evaluation with particular input types and metrics. These heterogeneous evaluations are simpler to write as distinct test case suites that track together. +* **You want to assert binary expectations**: Track assertions in LangSmith and raise assertion errors locally (e.g. in CI pipelines). Testing tools help when both evaluating system outputs and asserting basic properties about them. +* **You want to take advantage of mocks, watch mode, local results, or other features of the Vitest/Jest ecosystems** Requires JS/TS SDK version `langsmith>=0.3.1`. From 7e5bd9429eb08efd2e26f5b01083919acad0db67 Mon Sep 17 00:00:00 2001 From: Kathryn May Date: Thu, 20 Nov 2025 14:43:00 -0500 Subject: [PATCH 2/4] Updates to concepts page --- src/langsmith/evaluation-concepts.mdx | 215 ++++++++++++++++++-------- src/langsmith/evaluation.mdx | 4 +- 2 files changed, 151 insertions(+), 68 deletions(-) diff --git a/src/langsmith/evaluation-concepts.mdx b/src/langsmith/evaluation-concepts.mdx index eed0472951..cf18c106dc 100644 --- a/src/langsmith/evaluation-concepts.mdx +++ b/src/langsmith/evaluation-concepts.mdx @@ -3,11 +3,9 @@ title: Evaluation Concepts sidebarTitle: Concepts --- -This page explains the key concepts of LangSmith Evaluation. +LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. -**Understand online vs offline evaluations** - LangSmith supports two types of evaluations based on **what objects they run on**: - [Offline evaluations](#offline-evaluation) run on _examples_ from _datasets_. Examples include reference outputs for comparison. Use offline evaluations for pre-deployment testing, benchmarking, and regression testing. @@ -16,38 +14,76 @@ LangSmith supports two types of evaluations based on **what objects they run on* **The key distinction:** offline evaluations have access to reference outputs (what you expect), while online evaluations only have access to actual inputs and outputs (what actually happened). -The building blocks of LangSmith Evaluation are: +## Evaluation lifecycle -- [Datasets & examples:](#datasets) Collections of test inputs and reference outputs used for offline evaluation. -- [Traces & runs:](#runs-for-online-evaluation) Production execution logs used for online evaluation. -- [Evaluators](#evaluators): Functions for scoring outputs. These can be [online evaluators](#online-evaluation) that run on traces in real time or [offline evaluators](#offline-evaluation) that run on a dataset. +As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy evolves from pre-deployment testing to production monitoring. LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to improve quality continuously. -## Core evaluation objects +```mermaid +graph LR + A[Development] --> B[Testing] + B --> C[Deployment] + C --> D[Monitoring] + D --> E[Iteration] -The core objects that evaluations run on: + A -.-> F[Offline] + B -.-> F + C -.-> G[Online] + D -.-> G + E -.-> H[Both] -- [Examples](#examples) (for offline evaluation) are curated test cases used for pre-deployment testing. Because they include reference outputs, offline evaluators can compare actual vs expected results. -- [Runs](#runs-for-online-evaluation) (for online evaluation) are real production traces. Online evaluators assess these runs in near real-time without reference outputs, focusing on detecting issues, anomalies, or quality degradation. + style F fill:#8b5cf6,stroke:#7c3aed,color:#fff + style G fill:#8b5cf6,stroke:#7c3aed,color:#fff + style H fill:#8b5cf6,stroke:#7c3aed,color:#fff +``` -## Runs for online evaluation +### 1. Development with offline evaluation -A _run_ is a single execution trace from your [deployed application](/langsmith/deployments). Each run contains: -- **Inputs**: The actual user inputs your application received. -- **Outputs**: What your application actually returned. -- **Intermediate steps**: All the child runs (tool calls, LLM calls, and so on). -- **Metadata**: Tags, user feedback, latency metrics, etc. +Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence. -_Threads_ are collections of related runs (multi-turn conversations). Online evaluators can also run at the thread level to evaluate entire conversations. +1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. +1. Run [offline evaluations](/langsmith/evaluate-llm-application) to measure performance. +1. Iterate on your application based on results. +1. Compare experiments to find the best configuration. -Learn more about [runs and traces in the Observability concepts](/langsmith/observability-concepts#runs). +Follow the [quickstart](/langsmith/evaluation-quickstart) to run your first offline evaluation. + +### 2. Initial deployment with online evaluation + +After deployment, use online evaluations to monitor production quality, detect unexpected issues, and collect real-world data. + +1. Set up [online evaluators](/langsmith/online-evaluations) in your tracing project. +1. Start with basic checks (e.g., no errors, response format validation). +1. Configure [alerts](/langsmith/alerts) for critical quality metrics. +1. Review traces that fail online evaluations. + +Learn how to [configure online evaluations](/langsmith/online-evaluations) for production monitoring. -## Datasets +### 3. Continuous improvement + +Use both evaluation types together in an iterative feedback loop. Online evaluations surface issues that become offline test cases, offline evaluations validate fixes, and online evaluations confirm production improvements. + +1. Online evaluations detect an issue in production. +1. Add the failing trace to a dataset as a [new example](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). +1. Reproduce and fix the issue locally. +1. Run offline evaluation to verify the fix. +1. Deploy the updated application. +1. Confirm the fix with online evaluations. + +## Core evaluation objects + +Evaluations run on different objects depending on whether they are offline or online. Understanding these objects is essential for choosing the right evaluation approach. + +### Objects for offline evaluation + +Offline evaluations run on datasets and examples. The presence of reference outputs enables comparison between expected and actual results. + +#### Datasets A dataset is a _collection of examples_ used for evaluating an application. An example is a test input, reference output pair. ![Dataset](/langsmith/images/dataset-concept.png) -### Examples +#### Examples Each example consists of: @@ -57,7 +93,7 @@ Each example consists of: ![Example](/langsmith/images/example-concept.png) -### Dataset curation +#### Dataset curation There are various ways to build datasets for evaluation, including: @@ -83,33 +119,50 @@ For high-traffic applications, several techniques help identify valuable runs to _Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size. -### Splits +#### Splits _Splits_ are partitions of a dataset that enable targeted evaluation on different subsets of examples. Splits serve two primary purposes: performance optimization (using a smaller split for rapid iterations and a larger split for final evaluation) and interpretability (evaluating different types of inputs separately). For example, a RAG application might use splits to focus on different question types (factual, opinion, etc.) and evaluate performance on each type separately. Learn how to [create and manage dataset splits](/langsmith/manage-datasets-in-application#create-and-manage-dataset-splits). -### Versions +#### Versions _Dataset versions_ track changes to datasets over time. When you add, update or delete an example, LangSmith creates a new [version](/langsmith/manage-datasets#version-a-dataset) automatically. This enables inspection and reversal of changes when needed. You can [tag](/langsmith/manage-datasets#tag-a-version) versions with human-readable names to mark important milestones in a dataset's history. Evaluations can target specific dataset versions. This is particularly useful in CI pipelines to ensure dataset updates do not break existing evaluation workflows. -## Evaluators +### Objects for online evaluation -_Evaluators_ are functions that score how well your application performs on a particular example. +Online evaluations run on runs and threads from production traffic. Without reference outputs, evaluators focus on detecting issues, anomalies, and quality degradation in real-time. -There are a number of ways to define and run evaluators: +#### Runs -- **Custom code**: Define [custom evaluators](/langsmith/code-evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI. -- **Built-in evaluators**: LangSmith has a number of built-in evaluators that you can configure and run via the UI. +A _run_ is a single execution trace from your [deployed application](/langsmith/deployments). Each run contains: +- **Inputs**: The actual user inputs your application received. +- **Outputs**: What your application actually returned. +- **Intermediate steps**: All the child runs (tool calls, LLM calls, and so on). +- **Metadata**: Tags, user feedback, latency metrics, etc. + +Unlike examples in datasets, runs do not include reference outputs. Online evaluators must assess quality without knowing what the "correct" answer should be, relying instead on quality heuristics, safety checks, and reference-free evaluation techniques. + +Learn more about [runs and traces in the Observability concepts](/langsmith/observability-concepts#runs). + +#### Threads + +_Threads_ are collections of related runs representing multi-turn conversations. Online evaluators can run at the thread level to evaluate entire conversations rather than individual turns. This enables assessment of conversation-level properties like coherence across turns, topic maintenance, and user satisfaction throughout an interaction. + +## Evaluators -You can run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on particular tracing projects or datasets. +_Evaluators_ are functions that score application performance. They provide the measurement layer for both offline and online evaluation, adapting their inputs based on what data is available. -Evaluators have the following components: +### Defining and running evaluators -- [Evaluator inputs](#evaluator-inputs) -- [Evaluator outputs](#evaluator-outputs) +Evaluators can be defined and executed in multiple ways: + +- **Custom code**: Define [custom evaluators](/langsmith/code-evaluator) as Python or TypeScript functions and run them client-side using the SDKs or server-side via the UI. +- **Built-in evaluators**: LangSmith provides built-in evaluators that can be configured and run via the UI. + +Run evaluators using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)), via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground), or by configuring [rules](/langsmith/rules) to automatically run them on tracing projects or datasets. ### Evaluator inputs @@ -130,7 +183,7 @@ Evaluators return one or more metrics as a dictionary or list of dictionaries. E - `score` | `value`: The metric value (`score` for numerical metrics, `value` for categorical metrics). - `comment` (optional): Additional reasoning or explanation for the score. -## Evaluation techniques +### Evaluation techniques LangSmith supports several evaluation approaches: @@ -139,17 +192,17 @@ LangSmith supports several evaluation approaches: - [LLM-as-judge](#llm-as-judge) - [Pairwise](#pairwise) -### Human +#### Human _Human evaluation_ involves manual review of application outputs and execution traces. This approach is [often an effective starting point for evaluation](https://hamel.dev/blog/posts/evals/#looking-at-your-traces). LangSmith provides tools to review application outputs and traces (all intermediate steps). [Annotation queues](#annotation-queues) streamline the process of collecting human feedback on application outputs. -### Heuristic +#### Heuristic _Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. -### LLM-as-judge +#### LLM-as-judge _LLM-as-judge evaluators_ use LLMs to score application outputs. The grading rules and criteria are typically encoded in the LLM prompt. These evaluators can be: @@ -160,7 +213,7 @@ LLM-as-judge evaluators require careful review of scores and prompt tuning. Few- Learn about [how to define an LLM-as-a-judge evaluator](/langsmith/llm-as-judge). -### Pairwise +#### Pairwise _Pairwise evaluators_ compare outputs from two application versions using heuristics (e.g., which response is longer), LLMs (with pairwise prompts), or human reviewers. @@ -168,6 +221,24 @@ Pairwise evaluation works well when directly scoring an output is difficult but Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). +### Reference-free vs reference-based evaluators + +Understanding whether an evaluator requires reference outputs is essential for determining when it can be used. + +**Reference-free evaluators** assess quality without comparing to expected outputs. These work for both offline and online evaluation: +- **Safety checks**: Toxicity detection, PII detection, content policy violations +- **Format validation**: JSON structure, required fields, schema compliance +- **Quality heuristics**: Response length, latency, specific keywords +- **Reference-free LLM-as-judge**: Clarity, coherence, helpfulness, tone + +**Reference-based evaluators** require reference outputs and only work for offline evaluation: +- **Correctness**: Semantic similarity to reference answer +- **Factual accuracy**: Fact-checking against ground truth +- **Exact match**: Classification tasks with known labels +- **Reference-based LLM-as-judge**: Comparing output quality to a reference + +When designing an evaluation strategy, reference-free evaluators provide consistency across both offline testing and online monitoring, while reference-based evaluators enable more precise correctness checks during development. + ## Experiment An _experiment_ represents the results of evaluating a specific application version on a dataset. Each experiment captures outputs, evaluator scores, and execution traces for every example in the dataset. @@ -212,35 +283,11 @@ The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aeva _Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests. -## Annotation queues - -Human feedback often provides the most valuable assessment of application quality. _Annotation queues_ enable structured collection of this feedback by organizing runs for human review. - -[Annotation queues](/langsmith/annotation-queues) flag specific application runs for annotation. Annotators review these runs in a streamlined interface to provide feedback. Annotated runs can then be transferred to a [dataset](#datasets) for future evaluations. - -Annotation queues complement [inline annotation](/langsmith/annotate-traces-inline) by offering additional capabilities: grouping runs together, specifying annotation criteria, and configuring reviewer permissions. - -Learn more about [annotation queues and human feedback](/langsmith/annotation-queues). - -## Offline and online evaluation comparison - -The following table summarizes the key differences between offline and online evaluations: - -| | **Offline Evaluation** | **Online Evaluation** | -|---|---|---| -| **Runs on** | Dataset (Examples) | Tracing Project (Runs/Threads) | -| **Data access** | Inputs, Outputs, Reference Outputs | Inputs, Outputs only | -| **When to use** | Pre-deployment, during development | Production, post-deployment | -| **Primary use cases** | Benchmarking, unit testing, regression testing, backtesting | Real-time monitoring, production feedback, anomaly detection | -| **Evaluation timing** | Batch processing on curated test sets | Real-time or near real-time on live traffic | -| **Setup location** | Evaluation tab (SDK, UI, Prompt Playground) | [Observability tab](/langsmith/online-evaluations) (automated rules) | -| **Data requirements** | Requires dataset curation | No dataset needed, evaluates live traces | - ## Offline evaluation -_Offline evaluation_ tests applications on pre-compiled datasets before deployment. The term "offline" refers to evaluating on curated data rather than live traffic. In contrast, online evaluation assesses deployed applications on real traffic in near real-time. Offline evaluation enables pre-deployment testing and version comparison. +Offline evaluation tests applications on curated datasets before deployment. By running evaluations on examples with reference outputs, teams can compare versions, validate functionality, and build confidence before exposing changes to users. -Offline evaluations run client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules). +Run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules). ![Offline](/langsmith/images/offline.png) @@ -282,12 +329,36 @@ Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). ## Online evaluation -_Online evaluation_ assesses deployed application outputs in near real-time on production traffic. No dataset or reference outputs are involved—evaluators process actual inputs and outputs as the application produces them. This enables production monitoring and detection of unintended behavior. Online evaluation complements offline evaluation: an online evaluator can classify input questions into categories that later inform dataset curation for offline evaluation. +Online evaluation assesses production application outputs in near real-time. Without reference outputs, these evaluations focus on detecting issues, monitoring quality trends, and identifying edge cases that inform future offline testing. Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith. ![Online](/langsmith/images/online.png) +### Real-time monitoring + +Monitor application quality continuously as users interact with the system. Online evaluations run automatically on production traffic, providing immediate feedback on each interaction. This enables detection of quality degradation, unusual patterns, or unexpected behaviors before they impact significant user populations. + +### Anomaly detection + +Identify outliers and edge cases that deviate from expected patterns. Online evaluators can flag runs with unusual characteristics—extremely long or short responses, unexpected error rates, or outputs that fail safety checks—for human review and potential addition to offline datasets. + +### Production feedback loop + +Use insights from production to improve offline evaluation. Online evaluations surface real-world issues and usage patterns that may not appear in curated datasets. Failed production runs become candidates for dataset examples, creating an iterative cycle where production experience continuously refines testing coverage. + +## Human feedback mechanisms + +Human feedback often provides the most valuable assessment of application quality, particularly for subjective dimensions that automated evaluators struggle to capture. + +### Annotation queues + +_Annotation queues_ enable structured collection of human feedback by organizing runs for review. [Annotation queues](/langsmith/annotation-queues) flag specific application runs for annotation. Annotators review these runs in a streamlined interface to provide feedback. Annotated runs can then be transferred to a [dataset](#datasets) for future evaluations. + +Annotation queues complement [inline annotation](/langsmith/annotate-traces-inline) by offering additional capabilities: grouping runs together, specifying annotation criteria, and configuring reviewer permissions. + +Learn more about [annotation queues and human feedback](/langsmith/annotation-queues). + ## Testing ### Evaluations vs testing @@ -304,3 +375,17 @@ Running tests and evaluations together can be more resource efficient when a sys Evaluations can also be written using standard software testing tools like [pytest](/langsmith/pytest) or [Vitest/Jest](/langsmith/vitest-jest) for convenience. +## Quick reference: Offline vs online evaluation + +The following table summarizes the key differences between offline and online evaluations: + +| | **Offline Evaluation** | **Online Evaluation** | +|---|---|---| +| **Runs on** | Dataset (Examples) | Tracing Project (Runs/Threads) | +| **Data access** | Inputs, Outputs, Reference Outputs | Inputs, Outputs only | +| **When to use** | Pre-deployment, during development | Production, post-deployment | +| **Primary use cases** | Benchmarking, unit testing, regression testing, backtesting | Real-time monitoring, production feedback, anomaly detection | +| **Evaluation timing** | Batch processing on curated test sets | Real-time or near real-time on live traffic | +| **Setup location** | Evaluation tab (SDK, UI, Prompt Playground) | [Observability tab](/langsmith/online-evaluations) (automated rules) | +| **Data requirements** | Requires dataset curation | No dataset needed, evaluates live traces | + diff --git a/src/langsmith/evaluation.mdx b/src/langsmith/evaluation.mdx index fd41a50ba7..ca2c1e04bf 100644 --- a/src/langsmith/evaluation.mdx +++ b/src/langsmith/evaluation.mdx @@ -12,7 +12,6 @@ LangSmith supports two types of evaluations based on when and where they run: **Test before you ship** @@ -22,7 +21,6 @@ LangSmith supports two types of evaluations based on when and where they run: **Monitor in production** @@ -107,7 +105,7 @@ LangSmith supports two types of evaluations based on when and where they run: Evaluators can run on individual [runs](/langsmith/evaluation-concepts#runs-for-online-evaluation) or entire [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators). - + Use insights from online evaluations to improve offline testing: - Add failing production traces to your [dataset](/langsmith/manage-datasets). - Create targeted evaluators for discovered issues. From 39bb829ed21a21d11907d3210b97e8ee870513ba Mon Sep 17 00:00:00 2001 From: Kathryn May Date: Thu, 20 Nov 2025 14:57:40 -0500 Subject: [PATCH 3/4] Remove test page and update overview --- src/docs.json | 1 - src/langsmith/evaluation.mdx | 60 ++----- .../offline-to-online-evaluations.mdx | 160 ------------------ 3 files changed, 12 insertions(+), 209 deletions(-) delete mode 100644 src/langsmith/offline-to-online-evaluations.mdx diff --git a/src/docs.json b/src/docs.json index 076655f90b..1943a5fdb0 100644 --- a/src/docs.json +++ b/src/docs.json @@ -1046,7 +1046,6 @@ "langsmith/evaluation-quickstart", "langsmith/evaluation-concepts", "langsmith/evaluation-approaches", - "langsmith/offline-to-online-evaluations", { "group": "Datasets", "pages": [ diff --git a/src/langsmith/evaluation.mdx b/src/langsmith/evaluation.mdx index ca2c1e04bf..4be9428915 100644 --- a/src/langsmith/evaluation.mdx +++ b/src/langsmith/evaluation.mdx @@ -36,41 +36,23 @@ LangSmith supports two types of evaluations based on when and where they run: - Build a collection of test cases called a [dataset](/langsmith/manage-datasets). Each dataset contains [examples](/langsmith/evaluation-concepts#examples). - - You can create datasets from: - - Manually curated examples - - Historical production traces - - Synthetic data generation - - Organize with [splits](/langsmith/evaluation-concepts#splits) for different test scenarios and track changes with [versions](/langsmith/evaluation-concepts#versions). + Create a [dataset](/langsmith/manage-datasets) with [examples](/langsmith/evaluation-concepts#examples) from manually curated test cases, historical production traces, or synthetic data generation. - Create [evaluators](/langsmith/evaluation-concepts#evaluators) that score your application's performance. Choose from: - - [Human](/langsmith/evaluation-concepts#human) review - - [Heuristic](/langsmith/evaluation-concepts#heuristic) rules - - [LLM-as-judge](/langsmith/llm-as-judge) for nuanced quality assessment - - [Pairwise](/langsmith/evaluate-pairwise) comparison between versions - - Evaluators receive both your application's output and the reference output from the dataset example. + Create [evaluators](/langsmith/evaluation-concepts#evaluators) to score performance: + - [Human](/langsmith/evaluation-concepts#human) review + - [Heuristic](/langsmith/evaluation-concepts#heuristic) rules + - [LLM-as-judge](/langsmith/llm-as-judge) + - [Pairwise](/langsmith/evaluate-pairwise) comparison - Execute your application on the dataset to create an [experiment](/langsmith/evaluation-concepts#experiment). Each experiment: - - Runs your application on every example in the dataset. - - Applies your evaluators to score the outputs. - - Captures all results for analysis. - - Configure [repetitions, concurrency, and caching](/langsmith/evaluation-concepts#experiment-configuration) to optimize your evaluation runs. + Execute your application on the dataset to create an [experiment](/langsmith/evaluation-concepts#experiment). Configure [repetitions, concurrency, and caching](/langsmith/evaluation-concepts#experiment-configuration) to optimize runs. - Compare experiments to find the best version. Use offline evaluation for: - - [Benchmarking](/langsmith/evaluation-concepts#benchmarking) - - [Unit tests](/langsmith/evaluation-concepts#unit-tests) - - [Regression tests](/langsmith/evaluation-concepts#regression-tests) - - [Backtesting](/langsmith/evaluation-concepts#backtesting) + Compare experiments for [benchmarking](/langsmith/evaluation-concepts#benchmarking), [unit tests](/langsmith/evaluation-concepts#unit-tests), [regression tests](/langsmith/evaluation-concepts#regression-tests), or [backtesting](/langsmith/evaluation-concepts#backtesting). @@ -80,37 +62,19 @@ LangSmith supports two types of evaluations based on when and where they run: - Your application is running in production, handling real user traffic. Each interaction creates a [run](/langsmith/evaluation-concepts#runs-for-online-evaluation). - - Unlike offline evaluation, there are no reference outputs; you're evaluating real production behavior. + Each interaction creates a [run](/langsmith/evaluation-concepts#runs-for-online-evaluation) without reference outputs. - Set up [evaluators](/langsmith/online-evaluations) to run automatically on production traces: - - Safety checks - - Format validation - - Quality heuristics - - Reference-free LLM-as-judge - - Apply [filters and sampling rates](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) to control costs. + Set up [evaluators](/langsmith/online-evaluations) to run automatically on production traces: safety checks, format validation, quality heuristics, and reference-free LLM-as-judge. Apply [filters and sampling rates](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) to control costs. - Online evaluators run automatically on live traffic, providing: - - Real-time quality monitoring - - Anomaly detection - - Production feedback for each run - - Alerting on critical issues - - Evaluators can run on individual [runs](/langsmith/evaluation-concepts#runs-for-online-evaluation) or entire [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators). + Evaluators run automatically on [runs](/langsmith/evaluation-concepts#runs-for-online-evaluation) or [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators), providing real-time monitoring, anomaly detection, and alerting. - Use insights from online evaluations to improve offline testing: - - Add failing production traces to your [dataset](/langsmith/manage-datasets). - - Create targeted evaluators for discovered issues. - - Run offline experiments to validate fixes. - - Deploy and monitor improvements with online evaluations. + Add failing production traces to your [dataset](/langsmith/manage-datasets), create targeted evaluators, validate fixes with offline experiments, and redeploy. diff --git a/src/langsmith/offline-to-online-evaluations.mdx b/src/langsmith/offline-to-online-evaluations.mdx deleted file mode 100644 index fb18497821..0000000000 --- a/src/langsmith/offline-to-online-evaluations.mdx +++ /dev/null @@ -1,160 +0,0 @@ ---- -title: From offline to online evaluations -sidebarTitle: Offline to online ---- - -As you develop and [deploy your application](/langsmith/deployments), your evaluation strategy will evolve from pre-deployment testing (offline) to production monitoring (online). This page explains when and how to transition between these evaluation approaches. - -## Evaluation lifecycle - -LLM applications progress through distinct phases, each requiring different evaluation approaches. During development and testing, offline evaluations validate functionality against curated datasets. After deployment, online evaluations monitor production behavior on live traffic. As applications mature, both evaluation types work together in an iterative feedback loop to continuously improve quality. - -```mermaid -graph LR - A[Development] --> B[Testing] - B --> C[Deployment] - C --> D[Monitoring] - D --> E[Iteration] - - A -.-> F[Offline] - B -.-> F - C -.-> G[Online] - D -.-> G - E -.-> H[Both] - - style F fill:#8b5cf6,stroke:#7c3aed,color:#fff - style G fill:#8b5cf6,stroke:#7c3aed,color:#fff - style H fill:#8b5cf6,stroke:#7c3aed,color:#fff -``` - -### 1. Development with offline evaluation - -When building your application before any production deployment, use offline evaluations to validate basic functionality with unit tests, benchmark different models, prompts, or architectures, and build confidence before deploying to users. - -1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. -1. Run [offline evaluations](/langsmith/evaluate-llm-application) to measure performance. -1. Iterate on your application based on results. -1. Compare experiments to find the best configuration. - - - Follow the quickstart to run your first offline evaluation. - - -### 2. Initial deployment with online evaluation - -When your application is deployed and handling real user traffic, use online evaluations to monitor production quality in real-time, detect unexpected issues or edge cases, validate that production performance matches offline testing, and collect real-world data for future offline evaluations. - -1. Set up [online evaluators](/langsmith/online-evaluations) in your tracing project. -1. Start with basic checks (e.g., no errors, response format validation). -1. Configure [alerts](/langsmith/alerts) for critical quality metrics. -1. Review traces that fail online evaluations. - - - Learn how to configure online evaluations for production monitoring. - - -### 3. Continuous improvement - -During ongoing operation and optimization, use both offline and online evaluations together in an iterative feedback loop. Online evaluations surface real-world issues and edge cases, which are then converted into dataset examples for offline testing. Offline evaluations validate fixes before redeploying, and online evaluations confirm improvements in production. - -1. Online evaluations detect an issue in production. -1. Add the failing trace to a dataset as a [new example](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). -1. Reproduce and fix the issue locally. -1. Run offline evaluation to verify the fix. -1. Deploy the updated application. -1. Confirm the fix with online evaluations. - -## Choose an evaluator - -### Reference-free evaluators (work for both) - -These evaluators don't need reference outputs and work for both offline and online evaluation: - -- **Safety checks:** Toxicity, PII detection, content policy violations -- **Format validation:** JSON structure, required fields, schema compliance -- **Heuristics:** Response length, latency, specific keywords -- **Reference-free LLM-as-judge:** Clarity, coherence, helpfulness, tone - -### Reference-based evaluators (offline only) - -These require reference outputs and only work for offline evaluation: - -- **Correctness:** Semantic similarity to reference answer -- **Factual accuracy:** Fact-checking against ground truth -- **Exact match:** Classification tasks with known labels -- **Reference-based LLM-as-judge:** Comparing output quality to a reference - -## Set up online evaluators - -For applications already running offline evaluations, follow these steps to add online monitoring: - -### Step 1. Identify critical quality metrics - -Review your offline evaluators and identify which metrics are most critical for production: -- What failures would impact users most? -- Which metrics have the tightest acceptable ranges? -- What safety checks must always pass? - -### Step 2. Adapt evaluators for online - -Convert critical offline evaluators to work without reference outputs: - -**Example:** Converting a correctness evaluator - -**Offline version (reference-based):** -```python -def evaluate_correctness(run, example): - # Compare output to reference output - actual = run["outputs"]["answer"] - expected = example["outputs"]["answer"] - return llm_judge_compare(actual, expected) -``` - -**Online version (reference-free):** -```python -def evaluate_correctness(run): - # Check output quality without reference - actual = run["outputs"]["answer"] - question = run["inputs"]["question"] - return llm_judge_quality(question, actual, criteria=[ - "Answers the question directly", - "Is factually coherent", - "Provides adequate detail" - ]) -``` - -### Step 3. Configure online evaluators - -1. Go to your [tracing project](/langsmith/observability-concepts#tracing-projects). -1. Create [new online evaluators](/langsmith/online-evaluations#configure-online-evaluators). -1. Start with a [sampling rate](/langsmith/online-evaluations#4-optional-configure-a-sampling-rate) of 10% to control costs. -1. Monitor results and adjust sampling rate based on value vs cost. - -### Step 4. Close the loop - -Create a process to convert online evaluation failures into offline test cases: - -1. Set up [alerts](/langsmith/alerts) for online evaluation failures. -1. Review failing traces regularly. -1. Add representative failures to your [offline datasets](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). -1. Fix issues and validate with offline evaluation before redeploying. - -## Learn more - - - - Detailed guide on configuring online evaluators - - - - Core concepts and comparison of evaluation types - - - - Learn how to backtest new versions against production data - - - - Collect human feedback on production traces - - From 062b4e8d439e892fe48d2489e9378007fd3cc6ae Mon Sep 17 00:00:00 2001 From: Kathryn May Date: Mon, 24 Nov 2025 15:32:53 -0500 Subject: [PATCH 4/4] Add updates from feedback --- src/docs.json | 2 + src/langsmith/evaluation-concepts.mdx | 237 +++++++-------------- src/langsmith/evaluation-types.mdx | 70 ++++++ src/langsmith/evaluation.mdx | 12 +- src/langsmith/experiment-configuration.mdx | 34 +++ 5 files changed, 188 insertions(+), 167 deletions(-) create mode 100644 src/langsmith/evaluation-types.mdx create mode 100644 src/langsmith/experiment-configuration.mdx diff --git a/src/docs.json b/src/docs.json index 1943a5fdb0..62c77cdc57 100644 --- a/src/docs.json +++ b/src/docs.json @@ -1074,6 +1074,7 @@ { "group": "Evaluation types", "pages": [ + "langsmith/evaluation-types", "langsmith/code-evaluator", "langsmith/llm-as-judge", "langsmith/composite-evaluators", @@ -1093,6 +1094,7 @@ { "group": "Evaluation techniques", "pages": [ + "langsmith/experiment-configuration", "langsmith/define-target-function", "langsmith/evaluate-on-intermediate-steps", "langsmith/multiple-scores", diff --git a/src/langsmith/evaluation-concepts.mdx b/src/langsmith/evaluation-concepts.mdx index cf18c106dc..aea88a8c06 100644 --- a/src/langsmith/evaluation-concepts.mdx +++ b/src/langsmith/evaluation-concepts.mdx @@ -1,18 +1,45 @@ --- -title: Evaluation Concepts +title: Evaluation concepts sidebarTitle: Concepts --- -LLM outputs are non-deterministic, subjective quality often matters more than correctness, and real-world performance can diverge significantly from controlled tests. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. +LLM outputs are non-deterministic, which makes response quality hard to assess. Evaluations (evals) are a way to breakdown what "good" looks like and measure it. LangSmith Evaluation provides a framework for measuring quality throughout the application lifecycle, from pre-deployment testing to production monitoring. - -LangSmith supports two types of evaluations based on **what objects they run on**: +## What to evaluate -- [Offline evaluations](#offline-evaluation) run on _examples_ from _datasets_. Examples include reference outputs for comparison. Use offline evaluations for pre-deployment testing, benchmarking, and regression testing. -- [Online evaluations](#online-evaluation) run on _runs_ or _threads_ from [tracing](/langsmith/observability-quickstart) projects. These are real production traces without reference outputs. Use online evaluations for production monitoring and real-time performance tracking. +Before building evaluations, identify what matters for your application. Break down your system into its critical components—LLM calls, retrieval steps, tool invocations, output formatting—and determine quality criteria for each. -**The key distinction:** offline evaluations have access to reference outputs (what you expect), while online evaluations only have access to actual inputs and outputs (what actually happened). - +**Start with manually curated examples.** Create 5-10 examples of what "good" looks like for each critical component. These examples serve as your ground truth and inform which evaluation approaches to use. For instance: +- **RAG system**: Examples of good retrievals (relevant documents) and good answers (accurate, complete). +- **Agent**: Examples of correct tool selection and proper argument formatting. +- **Chatbot**: Examples of helpful, on-brand responses that address user intent. + +Once you've defined "good" through examples, you can measure how often your system produces similar quality outputs. + +## Offline and online evaluations + +LangSmith supports two types of evaluations that serve different purposes in your development workflow: + +### Offline evaluations + + Use offline evaluations for **pre-deployment testing**: +- **Benchmarking**: Compare multiple versions to find the best performer. +- **Regression testing**: Ensure new versions don't degrade quality. +- **Unit testing**: Verify correctness of individual components. +- **Backtesting**: Test new versions against historical data. + +Offline evaluations target [_examples_](#examples) from [_datasets_](#datasets)—curated test cases with reference outputs that define what "good" looks like. + +### Online evaluations + + Use online evaluations for **production monitoring**: +- **Real-time monitoring**: Track quality continuously on live traffic. +- **Anomaly detection**: Flag unusual patterns or edge cases. +- **Production feedback**: Identify issues to add to offline datasets. + +Online evaluations target [_runs_](#runs) and [_threads_](#threads) from [tracing](/langsmith/observability-quickstart)—real production traces without reference outputs. + +This difference in targets determines what you can evaluate: offline evaluations can check correctness against expected answers, while online evaluations focus on quality patterns, safety, and real-world behavior. ## Evaluation lifecycle @@ -40,40 +67,23 @@ graph LR Before production deployment, use offline evaluations to validate functionality, benchmark different approaches, and build confidence. -1. Create a [dataset](/langsmith/manage-datasets) with representative test cases. -1. Run [offline evaluations](/langsmith/evaluate-llm-application) to measure performance. -1. Iterate on your application based on results. -1. Compare experiments to find the best configuration. - Follow the [quickstart](/langsmith/evaluation-quickstart) to run your first offline evaluation. ### 2. Initial deployment with online evaluation After deployment, use online evaluations to monitor production quality, detect unexpected issues, and collect real-world data. -1. Set up [online evaluators](/langsmith/online-evaluations) in your tracing project. -1. Start with basic checks (e.g., no errors, response format validation). -1. Configure [alerts](/langsmith/alerts) for critical quality metrics. -1. Review traces that fail online evaluations. - Learn how to [configure online evaluations](/langsmith/online-evaluations) for production monitoring. ### 3. Continuous improvement Use both evaluation types together in an iterative feedback loop. Online evaluations surface issues that become offline test cases, offline evaluations validate fixes, and online evaluations confirm production improvements. -1. Online evaluations detect an issue in production. -1. Add the failing trace to a dataset as a [new example](/langsmith/manage-datasets-in-application#add-runs-to-a-dataset). -1. Reproduce and fix the issue locally. -1. Run offline evaluation to verify the fix. -1. Deploy the updated application. -1. Confirm the fix with online evaluations. +## Core evaluation targets -## Core evaluation objects +Evaluations run on different targets depending on whether they are offline or online. Understanding these targets is essential for choosing the right evaluation approach. -Evaluations run on different objects depending on whether they are offline or online. Understanding these objects is essential for choosing the right evaluation approach. - -### Objects for offline evaluation +### Targets for offline evaluation Offline evaluations run on datasets and examples. The presence of reference outputs enables comparison between expected and actual results. @@ -93,45 +103,9 @@ Each example consists of: ![Example](/langsmith/images/example-concept.png) -#### Dataset curation - -There are various ways to build datasets for evaluation, including: - -- [Manually curated examples](#manually-curated-examples) -- [Historical traces](#historical-traces) -- [Synthetic data](#synthetic-data) - -#### Manually curated examples +Learn more about [managing datasets](/langsmith/manage-datasets). -This is the recommended starting point for creating datasets. When building an application, you'll have some idea of what types of inputs the application should handle, and what "good" responses should be. Start by covering a few different common edge cases or situations. Even 10–20 high-quality, manually curated examples can be sufficient for initial testing. - -#### Historical traces - -Once an application is in production, it collects valuable information about real-world usage patterns. These production runs make excellent examples because they reflect actual user behavior. - -For high-traffic applications, several techniques help identify valuable runs to add to a dataset: - -- **User feedback**: Collect end user feedback to identify datapoints that received negative feedback. These represent cases where the application did not perform well and should be added to the dataset to test against in the future. -- **Heuristics**: Use other heuristics to identify interesting datapoints. For example, runs that took a long time to complete are worth examining and potentially adding to a dataset. -- **LLM feedback**: Use another LLM to detect noteworthy runs. For example, an LLM can label chatbot conversations where the user had to rephrase their question or correct the model, indicating the chatbot did not initially respond correctly. - -#### Synthetic data - -_Synthetic data generation_ creates additional examples artificially from existing ones. This approach works best when starting with several high-quality, hand-crafted examples, because the synthetic data typically uses these as templates. This provides a quick way to expand dataset size. - -#### Splits - -_Splits_ are partitions of a dataset that enable targeted evaluation on different subsets of examples. Splits serve two primary purposes: performance optimization (using a smaller split for rapid iterations and a larger split for final evaluation) and interpretability (evaluating different types of inputs separately). For example, a RAG application might use splits to focus on different question types (factual, opinion, etc.) and evaluate performance on each type separately. - -Learn how to [create and manage dataset splits](/langsmith/manage-datasets-in-application#create-and-manage-dataset-splits). - -#### Versions - -_Dataset versions_ track changes to datasets over time. When you add, update or delete an example, LangSmith creates a new [version](/langsmith/manage-datasets#version-a-dataset) automatically. This enables inspection and reversal of changes when needed. You can [tag](/langsmith/manage-datasets#tag-a-version) versions with human-readable names to mark important milestones in a dataset's history. - -Evaluations can target specific dataset versions. This is particularly useful in CI pipelines to ensure dataset updates do not break existing evaluation workflows. - -### Objects for online evaluation +### Targets for online evaluation Online evaluations run on runs and threads from production traffic. Without reference outputs, evaluators focus on detecting issues, anomalies, and quality degradation in real-time. @@ -188,7 +162,7 @@ Evaluators return one or more metrics as a dictionary or list of dictionaries. E LangSmith supports several evaluation approaches: - [Human](#human) -- [Heuristic](#heuristic) +- [Code](#code) - [LLM-as-judge](#llm-as-judge) - [Pairwise](#pairwise) @@ -198,9 +172,9 @@ _Human evaluation_ involves manual review of application outputs and execution t [Annotation queues](#annotation-queues) streamline the process of collecting human feedback on application outputs. -#### Heuristic +#### Code -_Heuristic evaluators_ are deterministic, rule-based functions. They work well for simple checks such as verifying that a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. +_Code evaluators_ are deterministic, rule-based functions. They work well for checks such as verifying the structure of a chatbot's response is not empty, that generated code compiles, or that a classification matches exactly. #### LLM-as-judge @@ -251,129 +225,70 @@ Multiple experiments typically run on a given dataset to test different applicat Learn [how to analyze experiment results](/langsmith/analyze-an-experiment). -## Experiment configuration - -LangSmith supports several configuration options for experiments: - -- [Repetitions](#repetitions) -- [Concurrency](#concurrency) -- [Caching](#caching) +## Evaluation types -### Repetitions +LangSmith supports various evaluation approaches for different stages of development and deployment. Understanding when to use each type helps build a comprehensive evaluation strategy. -_Repetitions_ run an experiment multiple times to account for LLM output variability. Since LLM outputs are non-deterministic, multiple repetitions provide a more accurate performance estimate. +Offline and online evaluations serve different purposes: +- **Offline evaluation types** test pre-deployment on curated datasets with reference outputs +- **Online evaluation types** monitor production behavior on live traffic without reference outputs -Configure repetitions by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Each repetition re-runs both the target function and all evaluators. +Learn more about [evaluation types and when to use each](/langsmith/evaluation-types). -Learn more in the [repetitions how-to guide](/langsmith/repetition). +## Best practices -### Concurrency +### Building datasets -_Concurrency_ controls how many examples run simultaneously during an experiment. Configure it by passing the `max_concurrency` argument to `evaluate` / `aevaluate`. The semantics differ between the two functions: +There are various strategies for building datasets: -#### `evaluate` +**Manually curated examples** -The `max_concurrency` argument specifies the maximum number of concurrent threads for running both the target function and evaluators. +This is the recommended starting point. Create 10–20 high-quality examples covering common scenarios and edge cases. These examples define what "good" looks like for your application. -#### `aevaluate` +**Historical traces** -The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aevaluate` creates a task for each example, where each task runs the target function and all evaluators for that example. The `max_concurrency` argument specifies the maximum number of concurrent examples to process. +Once in production, convert real traces into examples. For high-traffic applications: +- **User feedback**: Add runs that received negative feedback to test against. +- **Heuristics**: Identify interesting runs (e.g., long latency, errors). +- **LLM feedback**: Use LLMs to detect noteworthy conversations. -### Caching +**Synthetic data** -_Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests. +Generate additional examples from existing ones. Works best when starting with several high-quality, hand-crafted examples as templates. -## Offline evaluation +### Dataset organization -Offline evaluation tests applications on curated datasets before deployment. By running evaluations on examples with reference outputs, teams can compare versions, validate functionality, and build confidence before exposing changes to users. +**Splits** -Run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules). +Partition datasets into subsets for targeted evaluation. Use splits for performance optimization (smaller splits for rapid iteration) and interpretability (evaluate different input types separately). -![Offline](/langsmith/images/offline.png) - -### Benchmarking - -_Benchmarking_ compares multiple application versions on a curated dataset to identify the best performer. This process involves creating a dataset of representative inputs, defining performance metrics, and testing each version. - -Benchmarking requires dataset curation with gold-standard reference outputs and well-designed comparison metrics. Examples: -- **RAG Q&A bot**: Dataset of questions and reference answers, with an LLM-as-judge evaluator checking semantic equivalence between actual and reference answers -- **ReACT agent**: Dataset of user requests and reference tool calls, with a heuristic evaluator verifying all expected tool calls were made - -### Unit tests - -_Unit tests_ verify the correctness of individual system components. In LLM contexts, [unit tests are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on inputs or outputs (e.g., verifying LLM-generated code compiles, JSON loads successfully) that validate basic functionality. - -Unit tests typically expect consistent passing results, making them suitable for CI pipelines. When running in CI, configure caching to minimize LLM API calls and associated costs. - -### Regression tests - -_Regression tests_ measure performance consistency across application versions over time. They ensure new versions do not degrade performance on cases the current version handles correctly, and ideally demonstrate improvements over the baseline. These tests typically run when making updates expected to affect user experience (e.g., model or architecture changes). - -LangSmith's comparison view highlights regressions (red) and improvements (green) relative to the baseline, enabling quick identification of changes. - -![Comparison view](/langsmith/images/comparison-view.png) - -### Backtesting - -_Backtesting_ evaluates new application versions against historical production data. Production logs are converted into a dataset, then newer versions process these examples to assess performance on past, realistic user inputs. - -This approach is commonly used for evaluating new model releases. For example, when a new model becomes available, test it on the most recent production runs and compare results to actual production outcomes. - -### Pairwise evaluation - -_Pairwise evaluation_ compares outputs from two versions by determining relative quality rather than assigning absolute scores. For some tasks, [determining "version A is better than B"](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) is easier than scoring each version independently. - -This approach proves particularly useful for LLM-as-judge evaluations on subjective tasks. For example, in summarization, determining "Which summary is clearer and more concise?" is often simpler than assigning numeric clarity scores. - -Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). - -## Online evaluation - -Online evaluation assesses production application outputs in near real-time. Without reference outputs, these evaluations focus on detecting issues, monitoring quality trends, and identifying edge cases that inform future offline testing. - -Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith. - -![Online](/langsmith/images/online.png) - -### Real-time monitoring - -Monitor application quality continuously as users interact with the system. Online evaluations run automatically on production traffic, providing immediate feedback on each interaction. This enables detection of quality degradation, unusual patterns, or unexpected behaviors before they impact significant user populations. - -### Anomaly detection - -Identify outliers and edge cases that deviate from expected patterns. Online evaluators can flag runs with unusual characteristics—extremely long or short responses, unexpected error rates, or outputs that fail safety checks—for human review and potential addition to offline datasets. - -### Production feedback loop - -Use insights from production to improve offline evaluation. Online evaluations surface real-world issues and usage patterns that may not appear in curated datasets. Failed production runs become candidates for dataset examples, creating an iterative cycle where production experience continuously refines testing coverage. +Learn how to [create and manage dataset splits](/langsmith/manage-datasets-in-application#create-and-manage-dataset-splits). -## Human feedback mechanisms +**Versions** -Human feedback often provides the most valuable assessment of application quality, particularly for subjective dimensions that automated evaluators struggle to capture. +LangSmith automatically creates dataset [versions](/langsmith/manage-datasets#version-a-dataset) when examples change. [Tag versions](/langsmith/manage-datasets#tag-a-version) to mark important milestones. Target specific versions in CI pipelines to ensure dataset updates don't break workflows. -### Annotation queues +### Human feedback collection -_Annotation queues_ enable structured collection of human feedback by organizing runs for review. [Annotation queues](/langsmith/annotation-queues) flag specific application runs for annotation. Annotators review these runs in a streamlined interface to provide feedback. Annotated runs can then be transferred to a [dataset](#datasets) for future evaluations. +Human feedback often provides the most valuable assessment, particularly for subjective quality dimensions. -Annotation queues complement [inline annotation](/langsmith/annotate-traces-inline) by offering additional capabilities: grouping runs together, specifying annotation criteria, and configuring reviewer permissions. +**Annotation queues** -Learn more about [annotation queues and human feedback](/langsmith/annotation-queues). +[Annotation queues](/langsmith/annotation-queues) enable structured collection of human feedback. Flag specific runs for review, collect annotations in a streamlined interface, and transfer annotated runs to datasets for future evaluations. -## Testing +Annotation queues complement [inline annotation](/langsmith/annotate-traces-inline) by offering additional capabilities: grouping runs, specifying criteria, and configuring reviewer permissions. ### Evaluations vs testing -Testing and evaluation are similar and overlapping concepts that are frequently conflated. +Testing and evaluation are similar but distinct concepts. -**Evaluation measures performance according to metrics.** Evaluation metrics can be fuzzy or subjective, and prove more useful in relative terms than absolute ones. They typically compare two systems against each other rather than assert properties about an individual system. +**Evaluation measures performance according to metrics.** Metrics can be fuzzy or subjective, and prove more useful in relative terms. They typically compare systems against each other. **Testing asserts correctness.** A system can only be deployed if it passes all tests. -Evaluation metrics can be converted into tests. For example, regression tests can assert that any new system version must outperform a baseline version on relevant evaluation metrics. - -Running tests and evaluations together can be more resource efficient when a system is expensive to run and has overlapping datasets for tests and evaluations. +Evaluation metrics can be converted into tests. For example, regression tests can assert that new versions must outperform baseline versions on relevant metrics. Run tests and evaluations together for efficiency when systems are expensive to run. -Evaluations can also be written using standard software testing tools like [pytest](/langsmith/pytest) or [Vitest/Jest](/langsmith/vitest-jest) for convenience. +Evaluations can be written using standard testing tools like [pytest](/langsmith/pytest) or [Vitest/Jest](/langsmith/vitest-jest). ## Quick reference: Offline vs online evaluation diff --git a/src/langsmith/evaluation-types.mdx b/src/langsmith/evaluation-types.mdx new file mode 100644 index 0000000000..ddcd4a8f06 --- /dev/null +++ b/src/langsmith/evaluation-types.mdx @@ -0,0 +1,70 @@ +--- +title: Evaluation types +sidebarTitle: Evaluation types +--- + +LangSmith supports various evaluation types for different stages of development and deployment. Understanding when to use each helps build a comprehensive evaluation strategy. + +## Offline evaluation types + +Offline evaluation tests applications on curated datasets before deployment. By running evaluations on examples with reference outputs, teams can compare versions, validate functionality, and build confidence before exposing changes to users. + +Run offline evaluations client-side using the LangSmith SDK ([Python](https://docs.smith.langchain.com/reference/python/reference) and [TypeScript](https://docs.smith.langchain.com/reference/js)) or server-side via the [Prompt Playground](/langsmith/observability-concepts#prompt-playground) or [automations](/langsmith/rules). + +![Offline](/langsmith/images/offline.png) + +### Benchmarking + +_Benchmarking_ compares multiple application versions on a curated dataset to identify the best performer. This process involves creating a dataset of representative inputs, defining performance metrics, and testing each version. + +Benchmarking requires dataset curation with gold-standard reference outputs and well-designed comparison metrics. Examples: +- **RAG Q&A bot**: Dataset of questions and reference answers, with an LLM-as-judge evaluator checking semantic equivalence between actual and reference answers. +- **ReACT agent**: Dataset of user requests and reference tool calls, with a heuristic evaluator verifying all expected tool calls were made. + +### Unit tests + +_Unit tests_ verify the correctness of individual system components. In LLM contexts, [unit tests are often rule-based assertions](https://hamel.dev/blog/posts/evals/#level-1-unit-tests) on inputs or outputs (e.g., verifying LLM-generated code compiles, JSON loads successfully) that validate basic functionality. + +Unit tests typically expect consistent passing results, making them suitable for CI pipelines. When running in CI, configure caching to minimize LLM API calls and associated costs. + +### Regression tests + +_Regression tests_ measure performance consistency across application versions over time. They ensure new versions do not degrade performance on cases the current version handles correctly, and ideally demonstrate improvements over the baseline. These tests typically run when making updates expected to affect user experience (e.g., model or architecture changes). + +LangSmith's comparison view highlights regressions (red) and improvements (green) relative to the baseline, enabling quick identification of changes. + +![Comparison view](/langsmith/images/comparison-view.png) + +### Backtesting + +_Backtesting_ evaluates new application versions against historical production data. Production logs are converted into a dataset, then newer versions process these examples to assess performance on past, realistic user inputs. + +This approach is commonly used for evaluating new model releases. For example, when a new model becomes available, test it on the most recent production runs and compare results to actual production outcomes. + +### Pairwise evaluation + +_Pairwise evaluation_ compares outputs from two versions by determining relative quality rather than assigning absolute scores. For some tasks, [determining "version A is better than B"](https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/) is easier than scoring each version independently. + +This approach proves particularly useful for LLM-as-judge evaluations on subjective tasks. For example, in summarization, determining "Which summary is clearer and more concise?" is often simpler than assigning numeric clarity scores. + +Learn [how run pairwise evaluations](/langsmith/evaluate-pairwise). + +## Online evaluation types + +Online evaluation assesses production application outputs in near real-time. Without reference outputs, these evaluations focus on detecting issues, monitoring quality trends, and identifying edge cases that inform future offline testing. + +Online evaluators typically run server-side. LangSmith provides built-in [LLM-as-judge evaluators](/langsmith/llm-as-judge) for configuration, and supports custom code evaluators that run within LangSmith. + +![Online](/langsmith/images/online.png) + +### Real-time monitoring + +Monitor application quality continuously as users interact with the system. Online evaluations run automatically on production traffic, providing immediate feedback on each interaction. This enables detection of quality degradation, unusual patterns, or unexpected behaviors before they impact significant user populations. + +### Anomaly detection + +Identify outliers and edge cases that deviate from expected patterns. Online evaluators can flag runs with unusual characteristics—extremely long or short responses, unexpected error rates, or outputs that fail safety checks—for human review and potential addition to offline datasets. + +### Production feedback loop + +Use insights from production to improve offline evaluation. Online evaluations surface real-world issues and usage patterns that may not appear in curated datasets. Failed production runs become candidates for dataset examples, creating an iterative cycle where production experience continuously refines testing coverage. diff --git a/src/langsmith/evaluation.mdx b/src/langsmith/evaluation.mdx index 4be9428915..532e9c7734 100644 --- a/src/langsmith/evaluation.mdx +++ b/src/langsmith/evaluation.mdx @@ -42,17 +42,17 @@ LangSmith supports two types of evaluations based on when and where they run: Create [evaluators](/langsmith/evaluation-concepts#evaluators) to score performance: - [Human](/langsmith/evaluation-concepts#human) review - - [Heuristic](/langsmith/evaluation-concepts#heuristic) rules + - [Code](/langsmith/evaluation-concepts#code) rules - [LLM-as-judge](/langsmith/llm-as-judge) - [Pairwise](/langsmith/evaluate-pairwise) comparison - Execute your application on the dataset to create an [experiment](/langsmith/evaluation-concepts#experiment). Configure [repetitions, concurrency, and caching](/langsmith/evaluation-concepts#experiment-configuration) to optimize runs. + Execute your application on the dataset to create an [experiment](/langsmith/evaluation-concepts#experiment). Configure [repetitions, concurrency, and caching](/langsmith/experiment-configuration) to optimize runs. - Compare experiments for [benchmarking](/langsmith/evaluation-concepts#benchmarking), [unit tests](/langsmith/evaluation-concepts#unit-tests), [regression tests](/langsmith/evaluation-concepts#regression-tests), or [backtesting](/langsmith/evaluation-concepts#backtesting). + Compare experiments for [benchmarking](/langsmith/evaluation-types#benchmarking), [unit tests](/langsmith/evaluation-types#unit-tests), [regression tests](/langsmith/evaluation-types#regression-tests), or [backtesting](/langsmith/evaluation-types#backtesting). @@ -62,7 +62,7 @@ LangSmith supports two types of evaluations based on when and where they run: - Each interaction creates a [run](/langsmith/evaluation-concepts#runs-for-online-evaluation) without reference outputs. + Each interaction creates a [run](/langsmith/evaluation-concepts#runs) without reference outputs. @@ -70,7 +70,7 @@ LangSmith supports two types of evaluations based on when and where they run: - Evaluators run automatically on [runs](/langsmith/evaluation-concepts#runs-for-online-evaluation) or [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators), providing real-time monitoring, anomaly detection, and alerting. + Evaluators run automatically on [runs](/langsmith/evaluation-concepts#runs) or [threads](/langsmith/online-evaluations#configure-multi-turn-online-evaluators), providing real-time monitoring, anomaly detection, and alerting. @@ -82,7 +82,7 @@ LangSmith supports two types of evaluations based on when and where they run: -For more on the differences between offline and online evaluation, refer to the [Evaluation concepts](/langsmith/evaluation-concepts#offline-vs-online-evaluation-quick-comparison) page. +For more on the differences between offline and online evaluation, refer to the [Evaluation concepts](/langsmith/evaluation-concepts#quick-reference-offline-vs-online-evaluation) page. ## Get started diff --git a/src/langsmith/experiment-configuration.mdx b/src/langsmith/experiment-configuration.mdx new file mode 100644 index 0000000000..ed513ac74c --- /dev/null +++ b/src/langsmith/experiment-configuration.mdx @@ -0,0 +1,34 @@ +--- +title: Experiment configuration +sidebarTitle: Experiment configuration +--- + +LangSmith supports several configuration options for experiments: + +- [Repetitions](#repetitions) +- [Concurrency](#concurrency) +- [Caching](#caching) + +### Repetitions + +_Repetitions_ run an experiment multiple times to account for LLM output variability. Since LLM outputs are non-deterministic, multiple repetitions provide a more accurate performance estimate. + +Configure repetitions by passing the `num_repetitions` argument to `evaluate` / `aevaluate` ([Python](https://docs.smith.langchain.com/reference/python/evaluation/langsmith.evaluation._runner.evaluate), [TypeScript](https://docs.smith.langchain.com/reference/js/interfaces/evaluation.EvaluateOptions#numrepetitions)). Each repetition re-runs both the target function and all evaluators. + +Learn more in the [repetitions how-to guide](/langsmith/repetition). + +### Concurrency + +_Concurrency_ controls how many examples run simultaneously during an experiment. Configure it by passing the `max_concurrency` argument to `evaluate` / `aevaluate`. The semantics differ between the two functions: + +#### `evaluate` + +The `max_concurrency` argument specifies the maximum number of concurrent threads for running both the target function and evaluators. + +#### `aevaluate` + +The `max_concurrency` argument uses a semaphore to limit concurrent tasks. `aevaluate` creates a task for each example, where each task runs the target function and all evaluators for that example. The `max_concurrency` argument specifies the maximum number of concurrent examples to process. + +### Caching + +_Caching_ stores API call results to disk to speed up future experiments. Set the `LANGSMITH_TEST_CACHE` environment variable to a valid folder path with write access. Future experiments that make identical API calls will reuse cached results instead of making new requests.