From c3b3362f4ccf095a9672fd1b97cddfb4b5f4f42f Mon Sep 17 00:00:00 2001 From: adk-bot Date: Mon, 26 Jan 2026 22:25:25 +0000 Subject: [PATCH] docs: Add section on Custom Metrics in evaluation criteria. --- docs/evaluate/criteria.md | 76 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 76 insertions(+) diff --git a/docs/evaluate/criteria.md b/docs/evaluate/criteria.md index 7dc333502..9ae34be9d 100644 --- a/docs/evaluate/criteria.md +++ b/docs/evaluate/criteria.md @@ -544,3 +544,79 @@ turns in which the user simulator's response was judged to be valid according to the conversation scenario. A score of 1.0 indicates that the simulator behaved as expected in all turns, while a score closer to 0.0 indicates that the simulator deviated in many turns. Higher values are better. + +## Custom Metrics + +In addition to the standard criteria, you can define and use your own custom +metrics to evaluate agent performance. This allows you to tailor the evaluation +to your specific needs by providing your own Python functions to calculate +scores. + +### When To Use Custom Metrics? + +Use custom metrics when you need to: + +* Evaluate aspects of agent performance that are not covered by the standard + criteria. +* Implement a domain-specific scoring logic that is unique to your use case. +* Experiment with new or alternative evaluation methods. + +### How To Use Custom Metrics? + +To use a custom metric, you need to: + +1. **Define the metric in a Python function.** This function takes the agent's + `Trajectory` and the `EvalCase` as input and returns a score. + +2. **Configure the metric in your `test_config.json` file.** You need to add + your metric to the `criteria` dictionary and provide its implementation + details in the `custom_metrics` section. + +Under the `custom_metrics` field, you specify a `code_config` that points to the +Python function for your metric. + +Example `EvalConfig` entry: + +```json +{ + "criteria": { + "my_custom_metric": 0.5, + "my_simple_metric": 0.8 + }, + "custom_metrics": { + "my_simple_metric": { + "code_config": { + "name": "path.to.my.simple.metric.function" + } + }, + "my_custom_metric": { + "code_config": { + "name": "path.to.my.custom.metric.function" + }, + "metric": { + "metric_name": "my_custom_metric", + "min_value": -10.0, + "max_value": 10.0, + "description": "My custom metric." + } + } + } +} +``` + +In this example: + +* `my_custom_metric` and `my_simple_metric` are two custom metrics. +* The `criteria` dictionary sets the passing thresholds for these metrics. +* The `custom_metrics` dictionary maps each metric to its configuration. +* `code_config.name` provides the import path to the Python function that + implements the metric's logic. +* You can also optionally provide `metric` information like `min_value`, + `max_value`, and a `description`. + +### Output And How To Interpret + +The output of a custom metric is a score, which is interpreted based on the +logic you define in your Python function. The score is then compared against the +threshold set in the `criteria` dictionary to determine if the test passes or +fails.