|
| 1 | +--- |
| 2 | +title: Run Discourse AI evals |
| 3 | +short_title: AI evals |
| 4 | +id: ai-evals |
| 5 | +--- |
| 6 | + |
| 7 | +# Overview |
| 8 | + |
| 9 | +The Discourse AI plugin ships a Ruby CLI under `plugins/discourse-ai/evals` that exercises AI features against YAML definitions and records results. Use it to benchmark prompts, compare model outputs, and regression-test AI behaviors without touching the app database. |
| 10 | + |
| 11 | +## Core concepts (what users need to know) |
| 12 | + |
| 13 | +- **Eval case**: A YAML definition under `evals/cases/<group>/<id>.yml` that pairs inputs (`args`) with an expected outcome. Evals can check exact strings, regexes, or expected tool calls. |
| 14 | +- **Feature**: The Discourse AI behavior under test, identified as `module:feature_name` (for example, `summarization:topic_summaries`). `--list-features` shows the valid keys. |
| 15 | +- **Persona**: The system prompt wrapped around the LLM call. Runs default to the built-in prompt unless you pass `--persona-keys` to load alternate prompts from `evals/personas/*.yml`. Add multiple keys to compare prompts in one run. |
| 16 | +- **Judge**: A rubric embedded in some evals that requires a second LLM to grade outputs. Think of it as an automated reviewer: it reads the model output and scores it against the criteria. If an eval defines `judge`, you must supply or accept the default judge model (`--judge`, default `gpt-4o`). Without a judge, outputs are matched directly against the expected value. |
| 17 | +- **Comparison modes**: `--compare personas` (one model, many personas) or `--compare llms` (one persona, many models). The judge picks a winner and reports ratings; non-comparison runs just report pass/fail. |
| 18 | +- **Datasets**: Instead of YAML cases, pass `--dataset path.csv --feature module:feature` to build cases from CSV rows (`content` and `expected_output` columns required). |
| 19 | +- **Logs**: Every run writes plain text and structured traces to `plugins/discourse-ai/evals/log/` with timestamps and persona keys. Use them to inspect failures, skipped models, and judge decisions. |
| 20 | + |
| 21 | +## Prerequisites |
| 22 | + |
| 23 | +- Have a working Discourse development environment with the Discourse AI plugin present. The runner loads `config/environment` (defaulting to the repository root or `DISCOURSE_PATH` if set). |
| 24 | +- LLMs are defined in `plugins/discourse-ai/config/eval-llms.yml`; copy it to `eval-llms.local.yml` to override entries locally. Each entry expects an `api_key_env` (or inline `api_key`), so export the matching environment variables before running, for example: |
| 25 | + - `OPENAI_API_KEY=...` |
| 26 | + - `ANTHROPIC_API_KEY=...` |
| 27 | + - `GEMINI_API_KEY=...` |
| 28 | +- From the repository root, change into `plugins/discourse-ai/evals` and run `./run --help` to confirm the CLI is wired up. If `evals/cases` is missing it will be cloned automatically from `discourse/discourse-ai-evals`. |
| 29 | + |
| 30 | +## Discover available inputs |
| 31 | + |
| 32 | +- `./run --list` lists all eval ids from `evals/cases/*/*.yml`. |
| 33 | +- `./run --list-features` prints feature keys grouped by module (format: `module:feature`). |
| 34 | +- `./run --list-models` shows LLM configs that can be hydrated from `eval-llms.yml`/`.local.yml`. |
| 35 | +- `./run --list-personas` lists persona keys defined under `evals/personas/*.yml` plus the built-in `default`. |
| 36 | + |
| 37 | +## Run evals |
| 38 | + |
| 39 | +- Run a single eval against specific models: |
| 40 | + |
| 41 | + ```sh |
| 42 | + OPENAI_API_KEY=... ./run --eval simple_summarization --models gpt-4o-mini |
| 43 | + ``` |
| 44 | + |
| 45 | +- Run every eval for a feature (or the whole suite) against multiple models: |
| 46 | + |
| 47 | + ```sh |
| 48 | + ./run --feature summarization:topic_summaries --models gpt-4o-mini,claude-3-5-sonnet-latest |
| 49 | + ``` |
| 50 | + |
| 51 | + Omitting `--models` hydrates every configured LLM. Models that cannot hydrate (missing API keys, etc.) are skipped with a log message. |
| 52 | + |
| 53 | +- Some evals define a `judge` block. When any selected eval requires judging, the runner defaults to `--judge gpt-4o` unless you pass `--judge name`. Invalid or missing judge configs cause the CLI to exit before running. |
| 54 | + |
| 55 | +## Personas and comparison modes |
| 56 | + |
| 57 | +- Supply custom prompts with `--persona-keys key1,key2`. Keys resolve to YAML files in `evals/personas`; each needs `key` (optional, defaults to the filename), `system_prompt`, and an optional `description`. |
| 58 | +- Minimal persona example (`evals/personas/topic_summary_eval.yml`): |
| 59 | + |
| 60 | + ```yml |
| 61 | + key: topic_summary_eval |
| 62 | + description: Variant tuned for eval comparisons |
| 63 | + system_prompt: | |
| 64 | + Summarize the topic in 2–4 sentences. Keep the original language and avoid new facts. |
| 65 | + ``` |
| 66 | +
|
| 67 | +- `--compare personas` runs one model against multiple personas. The built-in `default` persona is automatically prepended so you can compare YAML prompts against stock behavior, and at least two personas are required. |
| 68 | +- `--compare llms` runs one persona (default unless overridden) across multiple models and asks the judge to score them side by side. |
| 69 | +- Non-comparison runs accept a single persona; pass one `--persona-keys` value or rely on the default prompt. |
| 70 | + |
| 71 | +## Dataset-driven runs |
| 72 | + |
| 73 | +- Generate eval cases from a CSV instead of YAML by passing `--dataset path/to/file.csv --feature module:feature`. The CSV must include `content` and `expected_output` columns; each row becomes its own eval id (`dataset-<filename>-<row>`). |
| 74 | +- Minimal CSV example: |
| 75 | + |
| 76 | + ```csv |
| 77 | + content,expected_output |
| 78 | + "This is spam!!! Buy now!",true |
| 79 | + "Genuine question about hosting",false |
| 80 | + ``` |
| 81 | + |
| 82 | +- Example: |
| 83 | + |
| 84 | + ```sh |
| 85 | + ./run --dataset evals/cases/spam/spam_eval_dataset.csv --feature spam:inspect_posts --models gpt-4o-mini |
| 86 | + ``` |
| 87 | + |
| 88 | +## Writing eval cases |
| 89 | + |
| 90 | +- Store cases under `evals/cases/<group>/<name>.yml`. Each file must declare `id`, `name`, `description`, and `feature` (the `module:feature` key registered with the plugin). |
| 91 | +- Provide inputs under `args`. Keys ending in `_path` (or `path`) are expanded relative to the YAML directory so you can reference fixture files. For multi-case files, `args` can contain arrays (for example, `cases:`) that runners iterate over. |
| 92 | +- Expected results can be declared with one of: |
| 93 | + - `expected_output`: exact string match |
| 94 | + - `expected_output_regex`: treated as a multiline regular expression |
| 95 | + - `expected_tool_call`: expected tool invocation payload |
| 96 | +- Set `vision: true` for evals that require a vision-capable model. Include a `judge` section (`pass_rating`, `criteria`, and optional `label`) to have outputs scored by a judge LLM. |
| 97 | + |
| 98 | +## Results and logs |
| 99 | + |
| 100 | +- CLI output shows pass/fail per model and prints expected vs actual details on failures. Comparison runs also stream the judge’s winner and ratings. |
| 101 | +- Example pass/fail snippet: |
| 102 | + |
| 103 | + ``` |
| 104 | + gpt-4o-mini: Passed 🟢 |
| 105 | + claude-3-5-sonnet-latest: Failed 🔴 |
| 106 | + ---- Expected ---- |
| 107 | + true |
| 108 | + ---- Actual ---- |
| 109 | + false |
| 110 | + ``` |
| 111 | + |
| 112 | +- Comparison winner snippet: |
| 113 | + |
| 114 | + ``` |
| 115 | + Comparing personas for topic-summary |
| 116 | + Winner: topic_summary_eval |
| 117 | + Reason: Captured key details and stayed concise. |
| 118 | + - default: 7/10 — missed concrete use case |
| 119 | + - topic_summary_eval: 9/10 — mentioned service dogs and tone was neutral |
| 120 | + ``` |
| 121 | + |
| 122 | +- Each run writes plain logs and structured traces to `plugins/discourse-ai/evals/log/` (timestamped `.log` and `.json` files). The JSON files are formatted for [ui.perfetto.dev](https://ui.perfetto.dev) to inspect the structured steps. |
| 123 | +- On completion the runner echoes the log paths; use them to audit skipped models, judge decisions, and raw outputs when iterating on prompts or features. |
| 124 | + |
| 125 | +## Common features (what to try first) |
| 126 | + |
| 127 | +- `summarization:topic_summaries`: Summarize a conversation. |
| 128 | +- `spam:inspect_posts`: Spam/ham classification. |
| 129 | +- `translation:topic_title_translator`: Translate topic titles while preserving tone/formatting. |
| 130 | +- `ai_helper:rewrite`: Prompt the AI helper for rewrites. |
| 131 | +- `tool_calls:tool_calls_with_no_tool` and `tool_calls:tool_call_chains`: Validate structured tool call behavior. |
0 commit comments