Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
162 changes: 162 additions & 0 deletions .cursor/skills/run-interact-evals/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,162 @@
---
name: run-interact-evals
description: Run the @wix/interact rules evaluation pipeline using subagents, then score and visualize results. Use when the user asks to run evals, test rules, benchmark interact, run the eval pipeline, compare with-rules vs no-context, or check rule quality.
---

# Run Interact Evals

Evaluates @wix/interact rule files by generating LLM outputs via subagents and scoring them locally. Compares **with-rules** (trigger-specific rules) vs **no-context** (bare prompt).

The evals package lives at `packages/interact-evals/` in the monorepo.

## Prerequisites

Ensure dependencies are installed from the repo root:

```bash
nvm use
yarn install
```

## Workflow

### Step 0: Clean previous results

Always start fresh to avoid stale data leaking into subagent context.

```bash
find packages/interact-evals/results -type f ! -name 'dashboard.html' ! -name '.gitkeep' -delete
```

This removes all batch files, response files, parsed outputs, scores, manifest, and prompts — keeping only the dashboard template and .gitkeep.

### Step 1: Generate batch prompts

```bash
cd packages/interact-evals
node scripts/cursor-batch.js with-rules
node scripts/cursor-batch.js no-context
```

This creates `results/cursor-batch-{variant}-{category}.md` files (6 categories x 2 variants = 12 files).

### Step 2: Dispatch subagents

Launch subagents to process each batch file. **Max 4 concurrent subagents.**

Categories: `click`, `hover`, `viewenter`, `viewprogress`, `pointermove`, `integration`
Variants: `with-rules`, `no-context`

For each combination, launch a Task subagent with this prompt template:

```
CRITICAL ISOLATION RULES — read these first:
- Do NOT read, search, or explore ANY other files in the workspace.
- Do NOT use Glob, Grep, SemanticSearch, or Read on any file except the ONE batch file specified below.
- Do NOT look at tests/, assertions/, .rules-cache/, scripts/, or any other directory.
- Generate code ONLY from the instructions and context provided in the batch file.
- Your ONLY tools should be: Read (for the one batch file), and Write (for the response file).

Now read {workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}.md and follow ALL
its instructions exactly. Generate the requested code for every test case. Use the
EXACT "=== ID: {test-id} ===" format. Write the complete response to
{workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}-response.md using the
Write tool. You MUST write the file.
```

Use `subagent_type: "generalPurpose"`.

For **integration** tests, add to the prompt: "Generate the COMPLETE code including imports, config, and HTML/JSX."

#### Dispatch order (respecting max 4 concurrent)

Batch 1 (4 agents): with-rules click, hover, viewenter, viewprogress
Batch 2 (4 agents): with-rules pointermove, with-rules integration, no-context click, no-context hover
Batch 3 (4 agents): no-context viewenter, viewprogress, pointermove, integration

### Step 3: Verify all response files exist

After each batch, verify response files were written. If any are missing, re-dispatch that subagent with an emphasized "You MUST use the Write tool" instruction.

Check with:

```bash
ls packages/interact-evals/results/cursor-batch-*-response.md | wc -l
# Expected: 12
```

### Step 4: Parse and score

```bash
cd packages/interact-evals
node scripts/generate-prompts.js
node scripts/parse-cursor-response.js with-rules
node scripts/parse-cursor-response.js no-context
node scripts/score-outputs.js
```

Note: `generate-prompts.js` must run before `score-outputs.js` to create `manifest.json`.

Expected output: 29 outputs per variant (58 total).

If any variant parses fewer than 29, check which response files are missing and re-run Step 3 for those.

### Step 5: Rebuild dashboard and display results

After scoring, rebuild the HTML dashboard:

```bash
cd packages/interact-evals
node -e "
const fs = require('fs');
const scores = fs.readFileSync('results/scores.json', 'utf8');
let html = fs.readFileSync('results/dashboard.html', 'utf8');
html = html.replace(/\/\*SCORES_DATA\*\/\[[\s\S]*?\];/, '/*SCORES_DATA*/' + scores.trim() + ';');
fs.writeFileSync('results/dashboard.html', html);
"
```

The scorer prints three tables:

1. **Scores** — pass rate per category per variant with delta
2. **Input tokens** — context window cost per category
3. **Output tokens** — response size per category

Report the results to the user. Highlight:

- Categories where with-rules outperforms no-context
- Categories where no-context matches (rules add cost but not quality)
- Any failed tests with their failure reasons
- Token cost tradeoffs (input tokens for with-rules vs no-context)

## Adding a new variant

To add a variant, update these files:

- `packages/interact-evals/scripts/cursor-batch.js` — add to the variant validation list
- `packages/interact-evals/scripts/score-outputs.js` — add to `VARIANTS` and `VARIANT_LABELS`
- `packages/interact-evals/run-eval.sh` — add to the `VARIANTS` array

## Adding new test cases

Add YAML entries to `packages/interact-evals/tests/{category}.yaml`. Each test needs:

```yaml
- description: 'Short description'
vars:
rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/{trigger}.md
prompt: >
The user request to generate code for.
expected_trigger: click|hover|viewEnter|viewProgress|pointerMove
expected:
params_type: alternate|repeat|state
effect_type: namedEffect|keyframeEffect|transition|customEffect
assert:
- type: contains-any
value: ['expected', 'strings']
weight: 2
metric: metric_name
```

Rules are fetched from GitHub and cached locally in `packages/interact-evals/.rules-cache/`. To force a
re-fetch (e.g. after rule updates on master), delete that directory.
9 changes: 9 additions & 0 deletions packages/interact-evals/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
.rules-cache/

# Generated eval results (keep only dashboard template and .gitkeep)
results/cursor-batch-*.md
results/outputs/
results/prompts/
results/manifest.json
results/scores.json
results/latest.json
190 changes: 190 additions & 0 deletions packages/interact-evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,190 @@
# @wix/interact — Rules Evaluation Pipeline

Automated evaluation of the interact LLM rules using [Promptfoo](https://promptfoo.dev).

## What this tests

Every test case runs **twice** — once with the full rules (SKILL.md + trigger-specific rule file) and once **without rules** (SKILL.md only). This lets you:

1. **Measure rule quality** — see if rules improve LLM output vs baseline
2. **Detect regressions** — compare scores before and after editing a rule
3. **Compare models** — uncomment additional providers in the config

## Quick start

### Option A: With API key (fully automated via Promptfoo)

```bash
cd evals
npm install
export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY
npx promptfoo@latest eval # run all tests
npx promptfoo@latest view # open the web UI to see results
```

### Option B: Without API key (using Cursor or any LLM)

No API key needed — use Cursor, ChatGPT, or any LLM you have access to.

```bash
cd evals
npm install

# Step 1: Generate batch prompt files (one per trigger category)
node scripts/cursor-batch.js with-rules
node scripts/cursor-batch.js without-rules
```

This creates files like:
- `results/cursor-batch-with-rules-click.md` (5 tests)
- `results/cursor-batch-with-rules-hover.md` (5 tests)
- `results/cursor-batch-without-rules-click.md` (5 tests)
- ...etc (6 categories × 2 variants = 12 files)

```bash
# Step 2: For each batch file, paste into Cursor chat and save the response
# e.g. paste cursor-batch-with-rules-click.md into Cursor
# Save response as: results/cursor-batch-with-rules-click-response.md
# Repeat for all 12 files

# Step 3: Parse all responses into individual outputs
node scripts/parse-cursor-response.js with-rules
node scripts/parse-cursor-response.js without-rules

# Step 4: Score everything locally (no API key needed)
node scripts/score-outputs.js
```

The scorer prints a comparison table:

```
Category With Rules Without Rules Delta
─────────────────────────────────────────────────────────
click 92% 68% +24% ✓
hover 88% 72% +16% ✓
viewenter 95% 60% +35% ✓
...
OVERALL 90% 67% +23%
```

Alternatively, generate individual prompt files for one-by-one collection:

```bash
node scripts/generate-prompts.js
# Prompts are in results/prompts/ — paste each into your LLM
# Save responses in results/outputs/ with matching filenames
node scripts/score-outputs.js
```

## Structure

```
evals/
├── promptfooconfig.yaml # Promptfoo config (for Option A with API key)
├── package.json # Dependencies (yaml) and npm scripts
├── prompts/
│ ├── with-rules.json # Chat template: SKILL.md + trigger rules
│ └── without-rules.json # Chat template: SKILL.md only (baseline)
├── context/
│ └── skill.md # General interact library reference
├── assertions/
│ ├── valid-config.js # Structural validation (all tests)
│ ├── anti-patterns.js # Known pitfall detection (all tests)
│ ├── click-checks.js # Click-specific semantic checks
│ ├── hover-checks.js # Hover-specific semantic checks
│ ├── viewenter-checks.js # ViewEnter-specific semantic checks
│ ├── viewprogress-checks.js # ViewProgress-specific semantic checks
│ └── pointermove-checks.js # PointerMove-specific semantic checks
├── tests/
│ ├── click.yaml # 5 click trigger test cases
│ ├── hover.yaml # 5 hover trigger test cases
│ ├── viewenter.yaml # 5 viewEnter trigger test cases
│ ├── viewprogress.yaml # 5 viewProgress trigger test cases
│ ├── pointermove.yaml # 5 pointerMove trigger test cases
│ └── integration.yaml # 4 integration/setup test cases
├── scripts/
│ ├── generate-prompts.js # Generate individual prompt files
│ ├── cursor-batch.js # Generate a single Cursor-friendly prompt
│ ├── parse-cursor-response.js # Parse Cursor batch response into outputs
│ └── score-outputs.js # Score collected outputs locally (no API key)
└── results/ # Output directory (gitignored)
├── prompts/ # Generated prompt files
├── outputs/ # Collected LLM outputs
├── scores.json # Detailed scoring results
└── manifest.json # Test case metadata
```

## How scoring works

Each test case is scored across multiple metrics:

| Metric | Source | What it checks |
|---|---|---|
| `structure` | `valid-config.js` | Has key, trigger, effects, valid effect type |
| `anti_patterns` | `anti-patterns.js` | No keyframeEffect+pointerMove 2D, no layout props, etc. |
| `semantic` | `*-checks.js` | Correct trigger, params, effect type, cross-targeting |
| `effect_choice` | inline asserts | Uses the right named effect preset |
| `completeness` | inline asserts | Includes all required properties |
| `compliance` | inline asserts | Doesn't refuse the task |

The web UI shows per-test scores and aggregate metrics, with side-by-side "with rules" vs "without rules" comparison.

## Workflow for rule changes

1. Run baseline: `npx promptfoo@latest eval`
2. Edit a rule file (e.g., `packages/interact/rules/hover.md`)
3. Run again: `npx promptfoo@latest eval`
4. Compare: `npx promptfoo@latest view` — the UI shows both runs

## Adding new test cases

Add a new entry to the relevant YAML file in `tests/`. Each test case needs:

```yaml
- description: 'Short description of what is tested'
vars:
rules: file://../packages/interact/rules/<trigger>.md
prompt: >
Natural language description of the desired interaction.
expected_trigger: <trigger>
expected:
params_type: alternate # optional: expected params.type
params_method: toggle # optional: expected params.method
effect_type: namedEffect # optional: namedEffect|keyframeEffect|transition|customEffect
named_effect: FadeIn # optional: specific preset name
cross_target: true # optional: source != target
has_fill_both: true # optional: needs fill:'both'
has_reversed: true # optional: needs reversed:true
has_duration: true # optional: needs a duration value
has_easing: true # optional: needs an easing value
assert:
- type: javascript
value: file://assertions/<trigger>-checks.js
weight: 3
metric: semantic
# Add any extra inline assertions
```

## Adding LLM-as-judge (optional)

For subjective quality scoring, add `llm-rubric` assertions:

```yaml
assert:
- type: llm-rubric
value: >
The output should be a valid @wix/interact config that uses
the click trigger with alternate type, has separate source and
target keys, and includes fill:'both' and reversed:true.
weight: 2
metric: quality
```

This costs ~$0.01-0.05 per assertion but catches nuanced quality issues.

## Tips

- Set `temperature: 0` in the provider config for reproducible results
- Run multiple times (`npx promptfoo@latest eval --repeat 3`) to reduce noise
- Use `--filter-pattern "Click"` to run only matching tests
- Add `--no-cache` to force fresh LLM calls
Loading
Loading