Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions docs/guides/web-ui/live-evals.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
---
title: "Live Evaluations: Monitor AI Systems in Production"
description: "View real-time online-evaluation activity for your agents and functions in Pydantic Logfire. Track pass rates, categorical labels, and numeric scores across production traffic."
---
# Live Evaluations

Live Evaluations streams evaluator results from production (or staging) traffic into the Pydantic Logfire web UI. Every target that emits evaluation events appears here, grouped by the target's name, with one row per evaluator and a sparkline showing activity over the selected time window.

To wire up evaluators on the Python side, see the [Online Evaluation guide](https://ai.pydantic.dev/evals/online-evaluation/) in the Pydantic AI docs. Live Evaluations renders any ingested `gen_ai.evaluation.result` OpenTelemetry events that follow the GenAI semantic conventions — the `@evaluate` decorator and the [`OnlineEvaluation`](https://ai.pydantic.dev/evals/online-evaluation/#agent-integration) agent capability from `pydantic-evals` are the supported entry points today.

## The Directory

Click **Evals: Live** in the sidebar to open the directory. Each row is one **target** — the name of the function or agent that produced the evaluation event.

Each row shows:

- **Target** --- the target name, with an agent or function icon derived from the parent span
- **Type** --- `agent` if the parent span carries `gen_ai.agent.name`, otherwise `function`
- **Evaluations** --- one compact cell per evaluator with a pass rate, numeric average, or categorical label summary plus a sparkline
- **Events** --- total number of evaluation events in the window
- **Last activity** --- when the most recent event arrived

Use the time-window tabs in the page header (**1h**, **6h**, **24h**, **7d**, **30d**) to adjust the range. Narrowing the window never empties the page: a target that was active anywhere in the last 30 days still appears as a silent row so you can see that evaluators are configured even if traffic is quiet right now.

The column headers are sortable --- click **Target**, **Type**, **Events**, or **Last activity** to re-order. The sort state is persisted in the URL so you can share a specific view.

If you haven't wired up any evaluators yet, the empty state shows getting-started snippets for the two entry points (the `@evaluate` decorator for any function, and the `OnlineEvaluation` capability for a Pydantic AI agent). Once events arrive, the same snippets remain accessible via the collapsible **How do I send events here?** strip above the directory.

## Target Detail Page

Click a target row to open its detail page. The page shows:

- Each evaluator attached to the target as its own row, with a larger sparkline, numeric/pass-rate/categorical summary, and a failure count when evaluators raised
- A **Recent events** table with the 20 most recent evaluation events for the target

The **Evaluator** filter dropdown narrows the recent-events table to a single evaluator. The time-window tabs behave the same as on the directory.

Each recent-events row includes an **open trace** link that jumps to the live trace view for the span the evaluation was attached to --- useful for seeing the full context of a low-scoring call.

## Evaluator Shapes

The **Evaluations** cell adapts to the shape of scores an evaluator produces:

- **Pass rate** --- evaluators that return `bool`. The cell shows a percentage and colors a dot based on health (≥95% green, ≥80% amber, otherwise red).
- **Numeric average** --- evaluators that return `int` or `float`. The cell shows the average score over the window.
- **Categorical labels** --- evaluators that return a string. The cell shows the most common label with a **`+N`** badge indicating how many other distinct labels were seen.
- **Error-only** --- evaluators that only raised in the window. The cell shows the failure count in red.

Evaluators that emit multiple score keys (a mapping return type) show up as one row per key.

## Integration with Traces

Each `gen_ai.evaluation.result` event is parented to the span whose call was evaluated, so the evaluation appears nested inside the original function call in the trace view. This makes it easy to jump from a failing evaluator back to the exact prompt, tool calls, and response that produced the result.

The event attributes follow the OTel GenAI semantic conventions:

- `gen_ai.evaluation.target` --- the function or agent name
- `gen_ai.evaluation.name` --- the evaluator name
- `gen_ai.evaluation.score.value` / `gen_ai.evaluation.score.label` --- score payload
- `gen_ai.evaluation.explanation` --- optional free-text reason
- `gen_ai.evaluation.evaluator_version` --- optional version tag so retired evaluator revisions can be filtered out
- `error.type` --- set when the evaluator raised
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -197,6 +197,7 @@ nav:
- Detect Service is Down: how-to-guides/detect-service-is-down.md
- Evaluate:
- Evals: guides/web-ui/evals.md
- Live Evaluations: guides/web-ui/live-evals.md
- Datasets:
- Datasets: evaluate/datasets/index.md
- Web UI Guide: evaluate/datasets/ui.md
Expand Down Expand Up @@ -466,6 +467,7 @@ plugins:
- how-to-guides/detect-service-is-down.md
Evaluate:
- guides/web-ui/evals.md
- guides/web-ui/live-evals.md
- evaluate/datasets/*.md
Manage & Configure:
- guides/web-ui/organizations-and-projects.md
Expand Down
Loading