wix · bar-goldenberg · Feb 26, 2026
diff --git a/.cursor/skills/run-interact-evals/SKILL.md b/.cursor/skills/run-interact-evals/SKILL.md
@@ -0,0 +1,162 @@
+---
+name: run-interact-evals
+description: Run the @wix/interact rules evaluation pipeline using subagents, then score and visualize results. Use when the user asks to run evals, test rules, benchmark interact, run the eval pipeline, compare with-rules vs no-context, or check rule quality.
+---
+
+# Run Interact Evals
+
+Evaluates @wix/interact rule files by generating LLM outputs via subagents and scoring them locally. Compares **with-rules** (trigger-specific rules) vs **no-context** (bare prompt).
+
+The evals package lives at `packages/interact-evals/` in the monorepo.
+
+## Prerequisites
+
+Ensure dependencies are installed from the repo root:
+
+```bash
+nvm use
+yarn install
+```
+
+## Workflow
+
+### Step 0: Clean previous results
+
+Always start fresh to avoid stale data leaking into subagent context.
+
+```bash
+find packages/interact-evals/results -type f ! -name 'dashboard.html' ! -name '.gitkeep' -delete
+```
+
+This removes all batch files, response files, parsed outputs, scores, manifest, and prompts — keeping only the dashboard template and .gitkeep.
+
+### Step 1: Generate batch prompts
+
+```bash
+cd packages/interact-evals
+node scripts/cursor-batch.js with-rules
+node scripts/cursor-batch.js no-context
+```
+
+This creates `results/cursor-batch-{variant}-{category}.md` files (6 categories x 2 variants = 12 files).
+
+### Step 2: Dispatch subagents
+
+Launch subagents to process each batch file. **Max 4 concurrent subagents.**
+
+Categories: `click`, `hover`, `viewenter`, `viewprogress`, `pointermove`, `integration`
+Variants: `with-rules`, `no-context`
+
+For each combination, launch a Task subagent with this prompt template:
+
+```
+CRITICAL ISOLATION RULES — read these first:
+- Do NOT read, search, or explore ANY other files in the workspace.
+- Do NOT use Glob, Grep, SemanticSearch, or Read on any file except the ONE batch file specified below.
+- Do NOT look at tests/, assertions/, .rules-cache/, scripts/, or any other directory.
+- Generate code ONLY from the instructions and context provided in the batch file.
+- Your ONLY tools should be: Read (for the one batch file), and Write (for the response file).
+
+Now read {workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}.md and follow ALL
+its instructions exactly. Generate the requested code for every test case. Use the
+EXACT "=== ID: {test-id} ===" format. Write the complete response to
+{workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}-response.md using the
+Write tool. You MUST write the file.
+```
+
+Use `subagent_type: "generalPurpose"`.
+
+For **integration** tests, add to the prompt: "Generate the COMPLETE code including imports, config, and HTML/JSX."
+
+#### Dispatch order (respecting max 4 concurrent)
+
+Batch 1 (4 agents): with-rules click, hover, viewenter, viewprogress
+Batch 2 (4 agents): with-rules pointermove, with-rules integration, no-context click, no-context hover
+Batch 3 (4 agents): no-context viewenter, viewprogress, pointermove, integration
+
+### Step 3: Verify all response files exist
+
+After each batch, verify response files were written. If any are missing, re-dispatch that subagent with an emphasized "You MUST use the Write tool" instruction.
+
+Check with:
+
+```bash
+ls packages/interact-evals/results/cursor-batch-*-response.md | wc -l
+# Expected: 12
+```
+
+### Step 4: Parse and score
+
+```bash
+cd packages/interact-evals
+node scripts/generate-prompts.js
+node scripts/parse-cursor-response.js with-rules
+node scripts/parse-cursor-response.js no-context
+node scripts/score-outputs.js
+```
+
+Note: `generate-prompts.js` must run before `score-outputs.js` to create `manifest.json`.
+
+Expected output: 29 outputs per variant (58 total).
+
+If any variant parses fewer than 29, check which response files are missing and re-run Step 3 for those.
+
+### Step 5: Rebuild dashboard and display results
+
+After scoring, rebuild the HTML dashboard:
+
+```bash
+cd packages/interact-evals
+node -e "
+const fs = require('fs');
+const scores = fs.readFileSync('results/scores.json', 'utf8');
+let html = fs.readFileSync('results/dashboard.html', 'utf8');
+html = html.replace(/\/\*SCORES_DATA\*\/\[[\s\S]*?\];/, '/*SCORES_DATA*/' + scores.trim() + ';');
+fs.writeFileSync('results/dashboard.html', html);
+"
+```
+
+The scorer prints three tables:
+
+1. **Scores** — pass rate per category per variant with delta
+2. **Input tokens** — context window cost per category
+3. **Output tokens** — response size per category
+
+Report the results to the user. Highlight:
+
+- Categories where with-rules outperforms no-context
+- Categories where no-context matches (rules add cost but not quality)
+- Any failed tests with their failure reasons
+- Token cost tradeoffs (input tokens for with-rules vs no-context)
+
+## Adding a new variant
+
+To add a variant, update these files:
+
+- `packages/interact-evals/scripts/cursor-batch.js` — add to the variant validation list
+- `packages/interact-evals/scripts/score-outputs.js` — add to `VARIANTS` and `VARIANT_LABELS`
+- `packages/interact-evals/run-eval.sh` — add to the `VARIANTS` array
+
+## Adding new test cases
+
+Add YAML entries to `packages/interact-evals/tests/{category}.yaml`. Each test needs:
+
+```yaml
+- description: 'Short description'
+  vars:
+    rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/{trigger}.md
+    prompt: >
+      The user request to generate code for.
+    expected_trigger: click|hover|viewEnter|viewProgress|pointerMove
+    expected:
+      params_type: alternate|repeat|state
+      effect_type: namedEffect|keyframeEffect|transition|customEffect
+  assert:
+    - type: contains-any
+      value: ['expected', 'strings']
+      weight: 2
+      metric: metric_name
+```
+
+Rules are fetched from GitHub and cached locally in `packages/interact-evals/.rules-cache/`. To force a
+re-fetch (e.g. after rule updates on master), delete that directory.
diff --git a/packages/interact-evals/.gitignore b/packages/interact-evals/.gitignore
@@ -0,0 +1,9 @@
+.rules-cache/
+
+# Generated eval results (keep only dashboard template and .gitkeep)
+results/cursor-batch-*.md
+results/outputs/
+results/prompts/
+results/manifest.json
+results/scores.json
+results/latest.json
diff --git a/packages/interact-evals/README.md b/packages/interact-evals/README.md
@@ -0,0 +1,190 @@
+# @wix/interact — Rules Evaluation Pipeline
+
+Automated evaluation of the interact LLM rules using [Promptfoo](https://promptfoo.dev).
+
+## What this tests
+
+Every test case runs **twice** — once with the full rules (SKILL.md + trigger-specific rule file) and once **without rules** (SKILL.md only). This lets you:
+
+1. **Measure rule quality** — see if rules improve LLM output vs baseline
+2. **Detect regressions** — compare scores before and after editing a rule
+3. **Compare models** — uncomment additional providers in the config
+
+## Quick start
+
+### Option A: With API key (fully automated via Promptfoo)
+
+```bash
+cd evals
+npm install
+export ANTHROPIC_API_KEY=sk-ant-...   # or OPENAI_API_KEY
+npx promptfoo@latest eval             # run all tests
+npx promptfoo@latest view             # open the web UI to see results
+```
+
+### Option B: Without API key (using Cursor or any LLM)
+
+No API key needed — use Cursor, ChatGPT, or any LLM you have access to.
+
+```bash
+cd evals
+npm install
+
+# Step 1: Generate batch prompt files (one per trigger category)
+node scripts/cursor-batch.js with-rules
+node scripts/cursor-batch.js without-rules
+```
+
+This creates files like:
+- `results/cursor-batch-with-rules-click.md` (5 tests)
+- `results/cursor-batch-with-rules-hover.md` (5 tests)
+- `results/cursor-batch-without-rules-click.md` (5 tests)
+- ...etc (6 categories × 2 variants = 12 files)
+
+```bash
+# Step 2: For each batch file, paste into Cursor chat and save the response
+#         e.g. paste cursor-batch-with-rules-click.md into Cursor
+#         Save response as: results/cursor-batch-with-rules-click-response.md
+#         Repeat for all 12 files
+
+# Step 3: Parse all responses into individual outputs
+node scripts/parse-cursor-response.js with-rules
+node scripts/parse-cursor-response.js without-rules
+
+# Step 4: Score everything locally (no API key needed)
+node scripts/score-outputs.js
+```
+
+The scorer prints a comparison table:
+
+```
+  Category          With Rules    Without Rules    Delta
+  ─────────────────────────────────────────────────────────
+  click                  92%           68%          +24% ✓
+  hover                  88%           72%          +16% ✓
+  viewenter              95%           60%          +35% ✓
+  ...
+  OVERALL                90%           67%          +23%
+```
+
+Alternatively, generate individual prompt files for one-by-one collection:
+
+```bash
+node scripts/generate-prompts.js
+# Prompts are in results/prompts/ — paste each into your LLM
+# Save responses in results/outputs/ with matching filenames
+node scripts/score-outputs.js
+```
+
+## Structure
+
+```
+evals/
+├── promptfooconfig.yaml        # Promptfoo config (for Option A with API key)
+├── package.json                # Dependencies (yaml) and npm scripts
+├── prompts/
+│   ├── with-rules.json         # Chat template: SKILL.md + trigger rules
+│   └── without-rules.json      # Chat template: SKILL.md only (baseline)
+├── context/
+│   └── skill.md                # General interact library reference
+├── assertions/
+│   ├── valid-config.js         # Structural validation (all tests)
+│   ├── anti-patterns.js        # Known pitfall detection (all tests)
+│   ├── click-checks.js         # Click-specific semantic checks
+│   ├── hover-checks.js         # Hover-specific semantic checks
+│   ├── viewenter-checks.js     # ViewEnter-specific semantic checks
+│   ├── viewprogress-checks.js  # ViewProgress-specific semantic checks
+│   └── pointermove-checks.js   # PointerMove-specific semantic checks
+├── tests/
+│   ├── click.yaml              # 5 click trigger test cases
+│   ├── hover.yaml              # 5 hover trigger test cases
+│   ├── viewenter.yaml          # 5 viewEnter trigger test cases
+│   ├── viewprogress.yaml       # 5 viewProgress trigger test cases
+│   ├── pointermove.yaml        # 5 pointerMove trigger test cases
+│   └── integration.yaml        # 4 integration/setup test cases
+├── scripts/
+│   ├── generate-prompts.js     # Generate individual prompt files
+│   ├── cursor-batch.js         # Generate a single Cursor-friendly prompt
+│   ├── parse-cursor-response.js # Parse Cursor batch response into outputs
+│   └── score-outputs.js        # Score collected outputs locally (no API key)
+└── results/                    # Output directory (gitignored)
+    ├── prompts/                # Generated prompt files
+    ├── outputs/                # Collected LLM outputs
+    ├── scores.json             # Detailed scoring results
+    └── manifest.json           # Test case metadata
+```
+
+## How scoring works
+
+Each test case is scored across multiple metrics:
+
+| Metric | Source | What it checks |
+|---|---|---|
+| `structure` | `valid-config.js` | Has key, trigger, effects, valid effect type |
+| `anti_patterns` | `anti-patterns.js` | No keyframeEffect+pointerMove 2D, no layout props, etc. |
+| `semantic` | `*-checks.js` | Correct trigger, params, effect type, cross-targeting |
+| `effect_choice` | inline asserts | Uses the right named effect preset |
+| `completeness` | inline asserts | Includes all required properties |
+| `compliance` | inline asserts | Doesn't refuse the task |
+
+The web UI shows per-test scores and aggregate metrics, with side-by-side "with rules" vs "without rules" comparison.
+
+## Workflow for rule changes
+
+1. Run baseline: `npx promptfoo@latest eval`
+2. Edit a rule file (e.g., `packages/interact/rules/hover.md`)
+3. Run again: `npx promptfoo@latest eval`
+4. Compare: `npx promptfoo@latest view` — the UI shows both runs
+
+## Adding new test cases
+
+Add a new entry to the relevant YAML file in `tests/`. Each test case needs:
+
+```yaml
+- description: 'Short description of what is tested'
+  vars:
+    rules: file://../packages/interact/rules/<trigger>.md
+    prompt: >
+      Natural language description of the desired interaction.
+    expected_trigger: <trigger>
+    expected:
+      params_type: alternate       # optional: expected params.type
+      params_method: toggle        # optional: expected params.method
+      effect_type: namedEffect     # optional: namedEffect|keyframeEffect|transition|customEffect
+      named_effect: FadeIn         # optional: specific preset name
+      cross_target: true           # optional: source != target
+      has_fill_both: true          # optional: needs fill:'both'
+      has_reversed: true           # optional: needs reversed:true
+      has_duration: true           # optional: needs a duration value
+      has_easing: true             # optional: needs an easing value
+  assert:
+    - type: javascript
+      value: file://assertions/<trigger>-checks.js
+      weight: 3
+      metric: semantic
+    # Add any extra inline assertions
+```
+
+## Adding LLM-as-judge (optional)
+
+For subjective quality scoring, add `llm-rubric` assertions:
+
+```yaml
+assert:
+  - type: llm-rubric
+    value: >
+      The output should be a valid @wix/interact config that uses
+      the click trigger with alternate type, has separate source and
+      target keys, and includes fill:'both' and reversed:true.
+    weight: 2
+    metric: quality
+```
+
+This costs ~$0.01-0.05 per assertion but catches nuanced quality issues.
+
+## Tips
+
+- Set `temperature: 0` in the provider config for reproducible results
+- Run multiple times (`npx promptfoo@latest eval --repeat 3`) to reduce noise
+- Use `--filter-pattern "Click"` to run only matching tests
+- Add `--no-cache` to force fresh LLM calls