Skip to content

Commit 5044ba1

Browse files
authored
FEATURE: Add docs for AI evals (#82)
1 parent 0cdb8db commit 5044ba1

File tree

1 file changed

+131
-0
lines changed

1 file changed

+131
-0
lines changed
Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
---
2+
title: Run Discourse AI evals
3+
short_title: AI evals
4+
id: ai-evals
5+
---
6+
7+
# Overview
8+
9+
The Discourse AI plugin ships a Ruby CLI under `plugins/discourse-ai/evals` that exercises AI features against YAML definitions and records results. Use it to benchmark prompts, compare model outputs, and regression-test AI behaviors without touching the app database.
10+
11+
## Core concepts (what users need to know)
12+
13+
- **Eval case**: A YAML definition under `evals/cases/<group>/<id>.yml` that pairs inputs (`args`) with an expected outcome. Evals can check exact strings, regexes, or expected tool calls.
14+
- **Feature**: The Discourse AI behavior under test, identified as `module:feature_name` (for example, `summarization:topic_summaries`). `--list-features` shows the valid keys.
15+
- **Persona**: The system prompt wrapped around the LLM call. Runs default to the built-in prompt unless you pass `--persona-keys` to load alternate prompts from `evals/personas/*.yml`. Add multiple keys to compare prompts in one run.
16+
- **Judge**: A rubric embedded in some evals that requires a second LLM to grade outputs. Think of it as an automated reviewer: it reads the model output and scores it against the criteria. If an eval defines `judge`, you must supply or accept the default judge model (`--judge`, default `gpt-4o`). Without a judge, outputs are matched directly against the expected value.
17+
- **Comparison modes**: `--compare personas` (one model, many personas) or `--compare llms` (one persona, many models). The judge picks a winner and reports ratings; non-comparison runs just report pass/fail.
18+
- **Datasets**: Instead of YAML cases, pass `--dataset path.csv --feature module:feature` to build cases from CSV rows (`content` and `expected_output` columns required).
19+
- **Logs**: Every run writes plain text and structured traces to `plugins/discourse-ai/evals/log/` with timestamps and persona keys. Use them to inspect failures, skipped models, and judge decisions.
20+
21+
## Prerequisites
22+
23+
- Have a working Discourse development environment with the Discourse AI plugin present. The runner loads `config/environment` (defaulting to the repository root or `DISCOURSE_PATH` if set).
24+
- LLMs are defined in `plugins/discourse-ai/config/eval-llms.yml`; copy it to `eval-llms.local.yml` to override entries locally. Each entry expects an `api_key_env` (or inline `api_key`), so export the matching environment variables before running, for example:
25+
- `OPENAI_API_KEY=...`
26+
- `ANTHROPIC_API_KEY=...`
27+
- `GEMINI_API_KEY=...`
28+
- From the repository root, change into `plugins/discourse-ai/evals` and run `./run --help` to confirm the CLI is wired up. If `evals/cases` is missing it will be cloned automatically from `discourse/discourse-ai-evals`.
29+
30+
## Discover available inputs
31+
32+
- `./run --list` lists all eval ids from `evals/cases/*/*.yml`.
33+
- `./run --list-features` prints feature keys grouped by module (format: `module:feature`).
34+
- `./run --list-models` shows LLM configs that can be hydrated from `eval-llms.yml`/`.local.yml`.
35+
- `./run --list-personas` lists persona keys defined under `evals/personas/*.yml` plus the built-in `default`.
36+
37+
## Run evals
38+
39+
- Run a single eval against specific models:
40+
41+
```sh
42+
OPENAI_API_KEY=... ./run --eval simple_summarization --models gpt-4o-mini
43+
```
44+
45+
- Run every eval for a feature (or the whole suite) against multiple models:
46+
47+
```sh
48+
./run --feature summarization:topic_summaries --models gpt-4o-mini,claude-3-5-sonnet-latest
49+
```
50+
51+
Omitting `--models` hydrates every configured LLM. Models that cannot hydrate (missing API keys, etc.) are skipped with a log message.
52+
53+
- Some evals define a `judge` block. When any selected eval requires judging, the runner defaults to `--judge gpt-4o` unless you pass `--judge name`. Invalid or missing judge configs cause the CLI to exit before running.
54+
55+
## Personas and comparison modes
56+
57+
- Supply custom prompts with `--persona-keys key1,key2`. Keys resolve to YAML files in `evals/personas`; each needs `key` (optional, defaults to the filename), `system_prompt`, and an optional `description`.
58+
- Minimal persona example (`evals/personas/topic_summary_eval.yml`):
59+
60+
```yml
61+
key: topic_summary_eval
62+
description: Variant tuned for eval comparisons
63+
system_prompt: |
64+
Summarize the topic in 2–4 sentences. Keep the original language and avoid new facts.
65+
```
66+
67+
- `--compare personas` runs one model against multiple personas. The built-in `default` persona is automatically prepended so you can compare YAML prompts against stock behavior, and at least two personas are required.
68+
- `--compare llms` runs one persona (default unless overridden) across multiple models and asks the judge to score them side by side.
69+
- Non-comparison runs accept a single persona; pass one `--persona-keys` value or rely on the default prompt.
70+
71+
## Dataset-driven runs
72+
73+
- Generate eval cases from a CSV instead of YAML by passing `--dataset path/to/file.csv --feature module:feature`. The CSV must include `content` and `expected_output` columns; each row becomes its own eval id (`dataset-<filename>-<row>`).
74+
- Minimal CSV example:
75+
76+
```csv
77+
content,expected_output
78+
"This is spam!!! Buy now!",true
79+
"Genuine question about hosting",false
80+
```
81+
82+
- Example:
83+
84+
```sh
85+
./run --dataset evals/cases/spam/spam_eval_dataset.csv --feature spam:inspect_posts --models gpt-4o-mini
86+
```
87+
88+
## Writing eval cases
89+
90+
- Store cases under `evals/cases/<group>/<name>.yml`. Each file must declare `id`, `name`, `description`, and `feature` (the `module:feature` key registered with the plugin).
91+
- Provide inputs under `args`. Keys ending in `_path` (or `path`) are expanded relative to the YAML directory so you can reference fixture files. For multi-case files, `args` can contain arrays (for example, `cases:`) that runners iterate over.
92+
- Expected results can be declared with one of:
93+
- `expected_output`: exact string match
94+
- `expected_output_regex`: treated as a multiline regular expression
95+
- `expected_tool_call`: expected tool invocation payload
96+
- Set `vision: true` for evals that require a vision-capable model. Include a `judge` section (`pass_rating`, `criteria`, and optional `label`) to have outputs scored by a judge LLM.
97+
98+
## Results and logs
99+
100+
- CLI output shows pass/fail per model and prints expected vs actual details on failures. Comparison runs also stream the judge’s winner and ratings.
101+
- Example pass/fail snippet:
102+
103+
```
104+
gpt-4o-mini: Passed 🟢
105+
claude-3-5-sonnet-latest: Failed 🔴
106+
---- Expected ----
107+
true
108+
---- Actual ----
109+
false
110+
```
111+
112+
- Comparison winner snippet:
113+
114+
```
115+
Comparing personas for topic-summary
116+
Winner: topic_summary_eval
117+
Reason: Captured key details and stayed concise.
118+
- default: 7/10 — missed concrete use case
119+
- topic_summary_eval: 9/10 — mentioned service dogs and tone was neutral
120+
```
121+
122+
- Each run writes plain logs and structured traces to `plugins/discourse-ai/evals/log/` (timestamped `.log` and `.json` files). The JSON files are formatted for [ui.perfetto.dev](https://ui.perfetto.dev) to inspect the structured steps.
123+
- On completion the runner echoes the log paths; use them to audit skipped models, judge decisions, and raw outputs when iterating on prompts or features.
124+
125+
## Common features (what to try first)
126+
127+
- `summarization:topic_summaries`: Summarize a conversation.
128+
- `spam:inspect_posts`: Spam/ham classification.
129+
- `translation:topic_title_translator`: Translate topic titles while preserving tone/formatting.
130+
- `ai_helper:rewrite`: Prompt the AI helper for rewrites.
131+
- `tool_calls:tool_calls_with_no_tool` and `tool_calls:tool_call_chains`: Validate structured tool call behavior.

0 commit comments

Comments
 (0)