diff --git a/.cursor/skills/run-interact-evals/SKILL.md b/.cursor/skills/run-interact-evals/SKILL.md new file mode 100644 index 00000000..0cfa0f5f --- /dev/null +++ b/.cursor/skills/run-interact-evals/SKILL.md @@ -0,0 +1,162 @@ +--- +name: run-interact-evals +description: Run the @wix/interact rules evaluation pipeline using subagents, then score and visualize results. Use when the user asks to run evals, test rules, benchmark interact, run the eval pipeline, compare with-rules vs no-context, or check rule quality. +--- + +# Run Interact Evals + +Evaluates @wix/interact rule files by generating LLM outputs via subagents and scoring them locally. Compares **with-rules** (trigger-specific rules) vs **no-context** (bare prompt). + +The evals package lives at `packages/interact-evals/` in the monorepo. + +## Prerequisites + +Ensure dependencies are installed from the repo root: + +```bash +nvm use +yarn install +``` + +## Workflow + +### Step 0: Clean previous results + +Always start fresh to avoid stale data leaking into subagent context. + +```bash +find packages/interact-evals/results -type f ! -name 'dashboard.html' ! -name '.gitkeep' -delete +``` + +This removes all batch files, response files, parsed outputs, scores, manifest, and prompts — keeping only the dashboard template and .gitkeep. + +### Step 1: Generate batch prompts + +```bash +cd packages/interact-evals +node scripts/cursor-batch.js with-rules +node scripts/cursor-batch.js no-context +``` + +This creates `results/cursor-batch-{variant}-{category}.md` files (6 categories x 2 variants = 12 files). + +### Step 2: Dispatch subagents + +Launch subagents to process each batch file. **Max 4 concurrent subagents.** + +Categories: `click`, `hover`, `viewenter`, `viewprogress`, `pointermove`, `integration` +Variants: `with-rules`, `no-context` + +For each combination, launch a Task subagent with this prompt template: + +``` +CRITICAL ISOLATION RULES — read these first: +- Do NOT read, search, or explore ANY other files in the workspace. +- Do NOT use Glob, Grep, SemanticSearch, or Read on any file except the ONE batch file specified below. +- Do NOT look at tests/, assertions/, .rules-cache/, scripts/, or any other directory. +- Generate code ONLY from the instructions and context provided in the batch file. +- Your ONLY tools should be: Read (for the one batch file), and Write (for the response file). + +Now read {workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}.md and follow ALL +its instructions exactly. Generate the requested code for every test case. Use the +EXACT "=== ID: {test-id} ===" format. Write the complete response to +{workspace}/packages/interact-evals/results/cursor-batch-{variant}-{category}-response.md using the +Write tool. You MUST write the file. +``` + +Use `subagent_type: "generalPurpose"`. + +For **integration** tests, add to the prompt: "Generate the COMPLETE code including imports, config, and HTML/JSX." + +#### Dispatch order (respecting max 4 concurrent) + +Batch 1 (4 agents): with-rules click, hover, viewenter, viewprogress +Batch 2 (4 agents): with-rules pointermove, with-rules integration, no-context click, no-context hover +Batch 3 (4 agents): no-context viewenter, viewprogress, pointermove, integration + +### Step 3: Verify all response files exist + +After each batch, verify response files were written. If any are missing, re-dispatch that subagent with an emphasized "You MUST use the Write tool" instruction. + +Check with: + +```bash +ls packages/interact-evals/results/cursor-batch-*-response.md | wc -l +# Expected: 12 +``` + +### Step 4: Parse and score + +```bash +cd packages/interact-evals +node scripts/generate-prompts.js +node scripts/parse-cursor-response.js with-rules +node scripts/parse-cursor-response.js no-context +node scripts/score-outputs.js +``` + +Note: `generate-prompts.js` must run before `score-outputs.js` to create `manifest.json`. + +Expected output: 29 outputs per variant (58 total). + +If any variant parses fewer than 29, check which response files are missing and re-run Step 3 for those. + +### Step 5: Rebuild dashboard and display results + +After scoring, rebuild the HTML dashboard: + +```bash +cd packages/interact-evals +node -e " +const fs = require('fs'); +const scores = fs.readFileSync('results/scores.json', 'utf8'); +let html = fs.readFileSync('results/dashboard.html', 'utf8'); +html = html.replace(/\/\*SCORES_DATA\*\/\[[\s\S]*?\];/, '/*SCORES_DATA*/' + scores.trim() + ';'); +fs.writeFileSync('results/dashboard.html', html); +" +``` + +The scorer prints three tables: + +1. **Scores** — pass rate per category per variant with delta +2. **Input tokens** — context window cost per category +3. **Output tokens** — response size per category + +Report the results to the user. Highlight: + +- Categories where with-rules outperforms no-context +- Categories where no-context matches (rules add cost but not quality) +- Any failed tests with their failure reasons +- Token cost tradeoffs (input tokens for with-rules vs no-context) + +## Adding a new variant + +To add a variant, update these files: + +- `packages/interact-evals/scripts/cursor-batch.js` — add to the variant validation list +- `packages/interact-evals/scripts/score-outputs.js` — add to `VARIANTS` and `VARIANT_LABELS` +- `packages/interact-evals/run-eval.sh` — add to the `VARIANTS` array + +## Adding new test cases + +Add YAML entries to `packages/interact-evals/tests/{category}.yaml`. Each test needs: + +```yaml +- description: 'Short description' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/{trigger}.md + prompt: > + The user request to generate code for. + expected_trigger: click|hover|viewEnter|viewProgress|pointerMove + expected: + params_type: alternate|repeat|state + effect_type: namedEffect|keyframeEffect|transition|customEffect + assert: + - type: contains-any + value: ['expected', 'strings'] + weight: 2 + metric: metric_name +``` + +Rules are fetched from GitHub and cached locally in `packages/interact-evals/.rules-cache/`. To force a +re-fetch (e.g. after rule updates on master), delete that directory. diff --git a/packages/interact-evals/.gitignore b/packages/interact-evals/.gitignore new file mode 100644 index 00000000..ab90080e --- /dev/null +++ b/packages/interact-evals/.gitignore @@ -0,0 +1,9 @@ +.rules-cache/ + +# Generated eval results (keep only dashboard template and .gitkeep) +results/cursor-batch-*.md +results/outputs/ +results/prompts/ +results/manifest.json +results/scores.json +results/latest.json diff --git a/packages/interact-evals/README.md b/packages/interact-evals/README.md new file mode 100644 index 00000000..f5c5b5a5 --- /dev/null +++ b/packages/interact-evals/README.md @@ -0,0 +1,190 @@ +# @wix/interact — Rules Evaluation Pipeline + +Automated evaluation of the interact LLM rules using [Promptfoo](https://promptfoo.dev). + +## What this tests + +Every test case runs **twice** — once with the full rules (SKILL.md + trigger-specific rule file) and once **without rules** (SKILL.md only). This lets you: + +1. **Measure rule quality** — see if rules improve LLM output vs baseline +2. **Detect regressions** — compare scores before and after editing a rule +3. **Compare models** — uncomment additional providers in the config + +## Quick start + +### Option A: With API key (fully automated via Promptfoo) + +```bash +cd evals +npm install +export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY +npx promptfoo@latest eval # run all tests +npx promptfoo@latest view # open the web UI to see results +``` + +### Option B: Without API key (using Cursor or any LLM) + +No API key needed — use Cursor, ChatGPT, or any LLM you have access to. + +```bash +cd evals +npm install + +# Step 1: Generate batch prompt files (one per trigger category) +node scripts/cursor-batch.js with-rules +node scripts/cursor-batch.js without-rules +``` + +This creates files like: +- `results/cursor-batch-with-rules-click.md` (5 tests) +- `results/cursor-batch-with-rules-hover.md` (5 tests) +- `results/cursor-batch-without-rules-click.md` (5 tests) +- ...etc (6 categories × 2 variants = 12 files) + +```bash +# Step 2: For each batch file, paste into Cursor chat and save the response +# e.g. paste cursor-batch-with-rules-click.md into Cursor +# Save response as: results/cursor-batch-with-rules-click-response.md +# Repeat for all 12 files + +# Step 3: Parse all responses into individual outputs +node scripts/parse-cursor-response.js with-rules +node scripts/parse-cursor-response.js without-rules + +# Step 4: Score everything locally (no API key needed) +node scripts/score-outputs.js +``` + +The scorer prints a comparison table: + +``` + Category With Rules Without Rules Delta + ───────────────────────────────────────────────────────── + click 92% 68% +24% ✓ + hover 88% 72% +16% ✓ + viewenter 95% 60% +35% ✓ + ... + OVERALL 90% 67% +23% +``` + +Alternatively, generate individual prompt files for one-by-one collection: + +```bash +node scripts/generate-prompts.js +# Prompts are in results/prompts/ — paste each into your LLM +# Save responses in results/outputs/ with matching filenames +node scripts/score-outputs.js +``` + +## Structure + +``` +evals/ +├── promptfooconfig.yaml # Promptfoo config (for Option A with API key) +├── package.json # Dependencies (yaml) and npm scripts +├── prompts/ +│ ├── with-rules.json # Chat template: SKILL.md + trigger rules +│ └── without-rules.json # Chat template: SKILL.md only (baseline) +├── context/ +│ └── skill.md # General interact library reference +├── assertions/ +│ ├── valid-config.js # Structural validation (all tests) +│ ├── anti-patterns.js # Known pitfall detection (all tests) +│ ├── click-checks.js # Click-specific semantic checks +│ ├── hover-checks.js # Hover-specific semantic checks +│ ├── viewenter-checks.js # ViewEnter-specific semantic checks +│ ├── viewprogress-checks.js # ViewProgress-specific semantic checks +│ └── pointermove-checks.js # PointerMove-specific semantic checks +├── tests/ +│ ├── click.yaml # 5 click trigger test cases +│ ├── hover.yaml # 5 hover trigger test cases +│ ├── viewenter.yaml # 5 viewEnter trigger test cases +│ ├── viewprogress.yaml # 5 viewProgress trigger test cases +│ ├── pointermove.yaml # 5 pointerMove trigger test cases +│ └── integration.yaml # 4 integration/setup test cases +├── scripts/ +│ ├── generate-prompts.js # Generate individual prompt files +│ ├── cursor-batch.js # Generate a single Cursor-friendly prompt +│ ├── parse-cursor-response.js # Parse Cursor batch response into outputs +│ └── score-outputs.js # Score collected outputs locally (no API key) +└── results/ # Output directory (gitignored) + ├── prompts/ # Generated prompt files + ├── outputs/ # Collected LLM outputs + ├── scores.json # Detailed scoring results + └── manifest.json # Test case metadata +``` + +## How scoring works + +Each test case is scored across multiple metrics: + +| Metric | Source | What it checks | +|---|---|---| +| `structure` | `valid-config.js` | Has key, trigger, effects, valid effect type | +| `anti_patterns` | `anti-patterns.js` | No keyframeEffect+pointerMove 2D, no layout props, etc. | +| `semantic` | `*-checks.js` | Correct trigger, params, effect type, cross-targeting | +| `effect_choice` | inline asserts | Uses the right named effect preset | +| `completeness` | inline asserts | Includes all required properties | +| `compliance` | inline asserts | Doesn't refuse the task | + +The web UI shows per-test scores and aggregate metrics, with side-by-side "with rules" vs "without rules" comparison. + +## Workflow for rule changes + +1. Run baseline: `npx promptfoo@latest eval` +2. Edit a rule file (e.g., `packages/interact/rules/hover.md`) +3. Run again: `npx promptfoo@latest eval` +4. Compare: `npx promptfoo@latest view` — the UI shows both runs + +## Adding new test cases + +Add a new entry to the relevant YAML file in `tests/`. Each test case needs: + +```yaml +- description: 'Short description of what is tested' + vars: + rules: file://../packages/interact/rules/.md + prompt: > + Natural language description of the desired interaction. + expected_trigger: + expected: + params_type: alternate # optional: expected params.type + params_method: toggle # optional: expected params.method + effect_type: namedEffect # optional: namedEffect|keyframeEffect|transition|customEffect + named_effect: FadeIn # optional: specific preset name + cross_target: true # optional: source != target + has_fill_both: true # optional: needs fill:'both' + has_reversed: true # optional: needs reversed:true + has_duration: true # optional: needs a duration value + has_easing: true # optional: needs an easing value + assert: + - type: javascript + value: file://assertions/-checks.js + weight: 3 + metric: semantic + # Add any extra inline assertions +``` + +## Adding LLM-as-judge (optional) + +For subjective quality scoring, add `llm-rubric` assertions: + +```yaml +assert: + - type: llm-rubric + value: > + The output should be a valid @wix/interact config that uses + the click trigger with alternate type, has separate source and + target keys, and includes fill:'both' and reversed:true. + weight: 2 + metric: quality +``` + +This costs ~$0.01-0.05 per assertion but catches nuanced quality issues. + +## Tips + +- Set `temperature: 0` in the provider config for reproducible results +- Run multiple times (`npx promptfoo@latest eval --repeat 3`) to reduce noise +- Use `--filter-pattern "Click"` to run only matching tests +- Add `--no-cache` to force fresh LLM calls diff --git a/packages/interact-evals/assertions/anti-patterns.js b/packages/interact-evals/assertions/anti-patterns.js new file mode 100644 index 00000000..8e7a51cf --- /dev/null +++ b/packages/interact-evals/assertions/anti-patterns.js @@ -0,0 +1,90 @@ +/** + * Anti-pattern detection: checks for known mistakes from the rules' "Critical Pitfalls" section. + * These are things that indicate the LLM misunderstood the library. + */ +module.exports = (output, { vars }) => { + const reasons = []; + let score = 1.0; + const penalty = 0.25; + + const has = (str) => output.includes(str); + const hasI = (str) => output.toLowerCase().includes(str.toLowerCase()); + const triggerType = vars.expected_trigger || ''; + + // Anti-pattern: using vanilla JS event listeners instead of declarative config + if (/addEventListener\s*\(/.test(output)) { + reasons.push('Uses addEventListener — should use declarative InteractConfig'); + score -= penalty; + } + + // Anti-pattern: direct DOM manipulation instead of config + if (/document\.(querySelector|getElementById|getElementsBy)/.test(output) && !has('customEffect')) { + reasons.push('Uses direct DOM queries outside customEffect — should use declarative key-based targeting'); + score -= penalty; + } + + // Anti-pattern: CSS @keyframes or style tags instead of interact config + if (/@keyframes\s/.test(output) && !has('customEffect')) { + reasons.push('Uses CSS @keyframes instead of interact keyframeEffect/namedEffect'); + score -= penalty; + } + + // Anti-pattern: keyframeEffect with pointerMove for 2D effects + if (triggerType === 'pointerMove' && has('keyframeEffect') && !has('axis:')) { + reasons.push('Uses keyframeEffect with pointerMove without axis param — should use namedEffect mouse presets or customEffect for 2D'); + score -= penalty; + } + + // Anti-pattern: viewProgress namedEffect missing range option + if (triggerType === 'viewProgress' && has('namedEffect') && !has('rangeStart') && !has('rangeEnd')) { + reasons.push('viewProgress namedEffect missing rangeStart/rangeEnd'); + score -= penalty; + } + + // Anti-pattern: viewEnter alternate/repeat with same source and target key + if (triggerType === 'viewEnter') { + const altOrRepeat = /type:\s*['"](?:alternate|repeat)['"]/i.test(output); + if (altOrRepeat) { + const keys = [...output.matchAll(/key:\s*['"]([^'"]+)['"]/g)].map((m) => m[1]); + if (keys.length > 0 && new Set(keys).size === 1) { + reasons.push('viewEnter alternate/repeat uses same source and target key — animation may retrigger'); + score -= penalty; + } + } + } + + // Anti-pattern: using width/height/margin in keyframe animations (layout thrashing) + const layoutProps = ['width:', 'height:', 'margin:', 'padding:']; + const hasLayoutAnimation = layoutProps.some((p) => { + return output.includes('keyframes') && output.includes(p); + }); + if (hasLayoutAnimation) { + const transformBased = has('transform') || has('opacity') || has('filter'); + if (!transformBased) { + reasons.push('Animates layout properties without transform/opacity — may cause jank'); + score -= penalty * 0.5; + } + } + + // Anti-pattern: overflow: hidden with viewProgress + if (triggerType === 'viewProgress' && /overflow:\s*['"]?hidden/i.test(output)) { + reasons.push('Uses overflow:hidden with viewProgress — should use overflow:clip'); + score -= penalty; + } + + // Anti-pattern: using wrong trigger name (e.g. 'mousemove' instead of 'pointerMove') + const wrongTriggers = ['mousemove', 'mouseenter', 'mouseleave', 'scroll', 'intersection']; + for (const wt of wrongTriggers) { + if (new RegExp(`trigger:\\s*['"]${wt}['"]`, 'i').test(output)) { + reasons.push(`Uses incorrect trigger name '${wt}' instead of interact trigger`); + score -= penalty; + } + } + + score = Math.max(0, Math.min(1, score)); + return { + pass: score >= 0.5, + score, + reason: reasons.length ? reasons.join('; ') : 'No anti-patterns detected', + }; +}; diff --git a/packages/interact-evals/assertions/click-checks.js b/packages/interact-evals/assertions/click-checks.js new file mode 100644 index 00000000..7cab37a5 --- /dev/null +++ b/packages/interact-evals/assertions/click-checks.js @@ -0,0 +1,73 @@ +/** + * Semantic checks for click trigger test cases. + */ +module.exports = (output, { vars }) => { + const checks = []; + let passed = 0; + let total = 0; + + const check = (name, condition) => { + total++; + if (condition) passed++; + else checks.push(`FAIL: ${name}`); + }; + + const expected = vars.expected || {}; + + check('has click trigger', /trigger:\s*['"]click['"]/.test(output)); + + if (expected.params_type) { + check( + `has params type '${expected.params_type}'`, + new RegExp(`type:\\s*['"]${expected.params_type}['"]`).test(output), + ); + check( + 'params block exists', + /params\s*:\s*\{/.test(output), + ); + } + + if (expected.params_method) { + check( + `has params method '${expected.params_method}'`, + new RegExp(`method:\\s*['"]${expected.params_method}['"]`).test(output), + ); + } + + if (expected.effect_type === 'transition') { + check('uses transition effect', /transition\s*:/.test(output) || /transitionProperties\s*:/.test(output)); + check('does not use namedEffect for transition case', !/namedEffect/.test(output)); + } + + if (expected.effect_type === 'namedEffect') { + check('uses namedEffect', /namedEffect\s*:/.test(output)); + } + + if (expected.cross_target) { + const keys = [...output.matchAll(/key:\s*['"]([^'"]+)['"]/g)].map((m) => m[1]); + check('source and target keys differ (cross-targeting)', new Set(keys).size >= 2); + } + + if (expected.has_fill_both) { + check('has fill: both', /fill:\s*['"]both['"]/.test(output)); + } + + if (expected.has_reversed) { + check('has reversed: true', /reversed:\s*true/.test(output)); + } + + if (expected.has_duration) { + check('has duration', /duration:\s*\d+/.test(output)); + } + + if (expected.has_easing) { + check('has easing', /easing:\s*['"]/.test(output)); + } + + const score = total > 0 ? passed / total : 1; + return { + pass: score >= 0.8, + score, + reason: checks.length ? checks.join('; ') : `All ${total} checks passed`, + }; +}; diff --git a/packages/interact-evals/assertions/hover-checks.js b/packages/interact-evals/assertions/hover-checks.js new file mode 100644 index 00000000..9cbc9b1b --- /dev/null +++ b/packages/interact-evals/assertions/hover-checks.js @@ -0,0 +1,58 @@ +/** + * Semantic checks for hover trigger test cases. + */ +module.exports = (output, { vars }) => { + const checks = []; + let passed = 0; + let total = 0; + + const check = (name, condition) => { + total++; + if (condition) passed++; + else checks.push(`FAIL: ${name}`); + }; + + const expected = vars.expected || {}; + + check('has hover trigger', /trigger:\s*['"]hover['"]/.test(output)); + + check('does not use CSS :hover pseudo-class', !/:hover\s*\{/.test(output)); + + if (expected.effect_type === 'transition') { + check( + 'uses transition effect', + /transition\s*:/.test(output) || /transitionProperties\s*:/.test(output), + ); + check('does not use keyframeEffect for transition case', !/keyframeEffect/.test(output)); + } else if (expected.effect_type === 'keyframeEffect') { + check('uses keyframeEffect', /keyframeEffect\s*:/.test(output)); + } else if (expected.effect_type === 'namedEffect') { + check('uses namedEffect', /namedEffect\s*:/.test(output)); + } + + if (expected.has_fill_both) { + check('has fill: both', /fill:\s*['"]both['"]/.test(output)); + } + + if (expected.has_duration) { + check('has duration', /duration:\s*\d+/.test(output)); + } + + if (expected.has_transform) { + check('uses transform', /transform/.test(output)); + } + + if (expected.cross_target) { + const keys = [...output.matchAll(/key:\s*['"]([^'"]+)['"]/g)].map( + (m) => m[1], + ); + check('source and target keys differ', new Set(keys).size >= 2); + } + + const score = total > 0 ? passed / total : 1; + return { + pass: score >= 0.8, + score, + reason: checks.length ? checks.join('; ') : `All ${total} checks passed`, + }; +}; diff --git a/packages/interact-evals/assertions/pointermove-checks.js b/packages/interact-evals/assertions/pointermove-checks.js new file mode 100644 index 00000000..98ec5162 --- /dev/null +++ b/packages/interact-evals/assertions/pointermove-checks.js @@ -0,0 +1,63 @@ +/** + * Semantic checks for pointerMove trigger test cases. + */ +module.exports = (output, { vars }) => { + const checks = []; + let passed = 0; + let total = 0; + + const check = (name, condition) => { + total++; + if (condition) passed++; + else checks.push(`FAIL: ${name}`); + }; + + const expected = vars.expected || {}; + + check('has pointerMove trigger', /trigger:\s*['"]pointerMove['"]/.test(output)); + + check('does not use mousemove event listener', !/addEventListener\s*\(\s*['"]mousemove/.test(output)); + + if (expected.hit_area) { + check( + `has hitArea '${expected.hit_area}'`, + new RegExp(`hitArea:\\s*['"]${expected.hit_area}['"]`).test(output), + ); + } + + if (expected.effect_type === 'namedEffect') { + check('uses namedEffect (correct for pointerMove)', /namedEffect\s*:/.test(output)); + check('does NOT use keyframeEffect for 2D pointer (anti-pattern)', !/keyframeEffect/.test(output)); + } else if (expected.effect_type === 'customEffect') { + check('uses customEffect', /customEffect\s*:/.test(output)); + } else if (expected.effect_type === 'keyframeEffect') { + check('uses keyframeEffect', /keyframeEffect\s*:/.test(output)); + check('has axis param for keyframeEffect', /axis:\s*['"]/.test(output)); + } + + if (expected.named_effect) { + check( + `uses ${expected.named_effect} preset`, + new RegExp(expected.named_effect, 'i').test(output), + ); + } + + if (expected.has_centeredToTarget !== undefined) { + check( + `centeredToTarget is ${expected.has_centeredToTarget}`, + new RegExp(`centeredToTarget:\\s*${expected.has_centeredToTarget}`).test(output), + ); + } + + if (expected.multi_layer) { + const effectCount = (output.match(/namedEffect\s*:|customEffect\s*:/g) || []).length; + check('has multiple effects (multi-layer)', effectCount >= 2); + } + + const score = total > 0 ? passed / total : 1; + return { + pass: score >= 0.8, + score, + reason: checks.length ? checks.join('; ') : `All ${total} checks passed`, + }; +}; diff --git a/packages/interact-evals/assertions/valid-config.js b/packages/interact-evals/assertions/valid-config.js new file mode 100644 index 00000000..0dabaab2 --- /dev/null +++ b/packages/interact-evals/assertions/valid-config.js @@ -0,0 +1,75 @@ +/** + * Structural validation: checks that the LLM output looks like a valid InteractConfig. + * Runs on every test case regardless of trigger type. + */ +module.exports = (output, { vars }) => { + const reasons = []; + let score = 1.0; + const penalty = 0.25; + + const has = (str) => output.includes(str); + + if (!has('trigger:')) { + reasons.push("Missing 'trigger' field"); + score -= penalty; + } + + const expectedTrigger = vars.expected_trigger; + if (expectedTrigger) { + const triggerRegex = new RegExp(`trigger:\\s*['"]${expectedTrigger}['"]`); + if (!triggerRegex.test(output)) { + reasons.push(`Wrong or missing trigger value (expected '${expectedTrigger}')`); + score -= penalty; + } + } + + if (!has('key:')) { + reasons.push("Missing interaction 'key' field"); + score -= penalty; + } + + if (!has('effects:') && !has('effects :')) { + reasons.push("Missing 'effects' array"); + score -= penalty; + } + + const effectTypes = [ + 'namedEffect', + 'keyframeEffect', + 'transition:', + 'transitionProperties', + 'customEffect', + 'effectId', + ]; + if (!effectTypes.some((t) => has(t))) { + reasons.push('No valid effect type found'); + score -= penalty; + } + + const expected = vars.expected || {}; + if (expected.effect_type) { + const etMap = { + namedEffect: /namedEffect\s*:/, + keyframeEffect: /keyframeEffect\s*:/, + transition: /transition\s*:|transitionProperties\s*:/, + customEffect: /customEffect\s*:/, + }; + const regex = etMap[expected.effect_type]; + if (regex && !regex.test(output)) { + reasons.push(`Expected effect type '${expected.effect_type}' not found`); + score -= penalty; + } + } + + if (!has('interactions') && !has('trigger:')) { + reasons.push("Missing 'interactions' array or inline interaction"); + score -= penalty; + } + + score = Math.max(0, Math.min(1, score)); + return { + pass: score >= 0.75, + score, + reason: reasons.length ? reasons.join('; ') : 'Valid config structure', + }; +}; diff --git a/packages/interact-evals/assertions/viewenter-checks.js b/packages/interact-evals/assertions/viewenter-checks.js new file mode 100644 index 00000000..2ca0b959 --- /dev/null +++ b/packages/interact-evals/assertions/viewenter-checks.js @@ -0,0 +1,77 @@ +/** + * Semantic checks for viewEnter trigger test cases. + */ +module.exports = (output, { vars }) => { + const checks = []; + let passed = 0; + let total = 0; + + const check = (name, condition) => { + total++; + if (condition) passed++; + else checks.push(`FAIL: ${name}`); + }; + + const expected = vars.expected || {}; + + check('has viewEnter trigger', /trigger:\s*['"]viewEnter['"]/.test(output)); + + check('does not use IntersectionObserver directly', !/IntersectionObserver/.test(output)); + + if (expected.params_type) { + check( + `has params type '${expected.params_type}'`, + new RegExp(`type:\\s*['"]${expected.params_type}['"]`).test(output), + ); + check( + 'params block exists', + /params\s*:\s*\{/.test(output), + ); + } + + if (expected.has_threshold) { + check('has threshold', /threshold:\s*[\d.]+/.test(output)); + } + + if (expected.effect_type === 'namedEffect') { + check('uses namedEffect', /namedEffect\s*:/.test(output)); + } else if (expected.effect_type === 'customEffect') { + check('uses customEffect', /customEffect\s*:/.test(output)); + } + + if (expected.named_effect) { + check( + `uses ${expected.named_effect} preset`, + new RegExp(expected.named_effect, 'i').test(output), + ); + } + + if (expected.has_duration) { + check('has duration', /duration:\s*\d+/.test(output)); + } + + if (expected.staggered) { + check('has staggered delays', /delay:\s*\d+/.test(output)); + const delays = [...output.matchAll(/delay:\s*(\d+)/g)].map((m) => + parseInt(m[1]), + ); + check( + 'delays increase progressively', + delays.length >= 2 && delays[delays.length - 1] > delays[0], + ); + } + + if (expected.cross_target) { + const keys = [...output.matchAll(/key:\s*['"]([^'"]+)['"]/g)].map( + (m) => m[1], + ); + check('source and target keys differ', new Set(keys).size >= 2); + } + + const score = total > 0 ? passed / total : 1; + return { + pass: score >= 0.8, + score, + reason: checks.length ? checks.join('; ') : `All ${total} checks passed`, + }; +}; diff --git a/packages/interact-evals/assertions/viewprogress-checks.js b/packages/interact-evals/assertions/viewprogress-checks.js new file mode 100644 index 00000000..26fba39b --- /dev/null +++ b/packages/interact-evals/assertions/viewprogress-checks.js @@ -0,0 +1,66 @@ +/** + * Semantic checks for viewProgress trigger test cases. + */ +module.exports = (output, { vars }) => { + const checks = []; + let passed = 0; + let total = 0; + + const check = (name, condition) => { + total++; + if (condition) passed++; + else checks.push(`FAIL: ${name}`); + }; + + const expected = vars.expected || {}; + + check( + 'has viewProgress trigger', + /trigger:\s*['"]viewProgress['"]/.test(output), + ); + + check('has rangeStart', /rangeStart/.test(output)); + check('has rangeEnd', /rangeEnd/.test(output)); + + check('does not use scroll event listener', !/addEventListener\s*\(\s*['"]scroll/.test(output)); + + if (expected.range_name) { + check( + `uses '${expected.range_name}' range`, + new RegExp(`name:\\s*['"]${expected.range_name}['"]`).test(output), + ); + } + + if (expected.effect_type === 'namedEffect') { + check('uses namedEffect', /namedEffect\s*:/.test(output)); + } else if (expected.effect_type === 'keyframeEffect') { + check('uses keyframeEffect', /keyframeEffect\s*:/.test(output)); + } else if (expected.effect_type === 'customEffect') { + check('uses customEffect', /customEffect\s*:/.test(output)); + } + + if (expected.named_effect) { + check( + `uses ${expected.named_effect} preset`, + new RegExp(expected.named_effect, 'i').test(output), + ); + } + + if (expected.has_easing_linear) { + check('uses linear easing', /easing:\s*['"]linear['"]/.test(output)); + } + + if (expected.cross_target) { + const keys = [...output.matchAll(/key:\s*['"]([^'"]+)['"]/g)].map( + (m) => m[1], + ); + check('source and target keys differ', new Set(keys).size >= 2); + } + + const score = total > 0 ? passed / total : 1; + return { + pass: score >= 0.8, + score, + reason: checks.length ? checks.join('; ') : `All ${total} checks passed`, + }; +}; diff --git a/packages/interact-evals/package.json b/packages/interact-evals/package.json new file mode 100644 index 00000000..c96c7ca0 --- /dev/null +++ b/packages/interact-evals/package.json @@ -0,0 +1,18 @@ +{ + "name": "@wix/interact-evals", + "version": "1.0.0", + "private": true, + "description": "Evaluation pipeline for @wix/interact LLM rules", + "scripts": { + "eval": "npx promptfoo@latest eval", + "view": "npx promptfoo@latest view", + "generate": "node scripts/generate-prompts.js", + "score": "node scripts/score-outputs.js", + "cursor:with-rules": "node scripts/cursor-batch.js with-rules", + "cursor:without-rules": "node scripts/cursor-batch.js without-rules", + "cursor:parse": "node scripts/parse-cursor-response.js" + }, + "dependencies": { + "yaml": "^2.7.0" + } +} diff --git a/packages/interact-evals/promptfooconfig.yaml b/packages/interact-evals/promptfooconfig.yaml new file mode 100644 index 00000000..f3125071 --- /dev/null +++ b/packages/interact-evals/promptfooconfig.yaml @@ -0,0 +1,71 @@ +# @wix/interact Rules Evaluation Pipeline +# +# Two prompts are compared side-by-side: +# 1. "with-rules" — SKILL.md + trigger-specific rule file +# 2. "without-rules" — SKILL.md only (baseline, no detailed rules) +# +# Run: npx promptfoo eval +# View: npx promptfoo view + +description: '@wix/interact rules evaluation' + +prompts: + - id: with-rules + label: 'With Rules' + raw: file://prompts/with-rules.json + - id: without-rules + label: 'Without Rules' + raw: file://prompts/without-rules.json + +providers: + - id: anthropic:messages:claude-sonnet-4-20250514 + config: + max_tokens: 4096 + temperature: 0 + +# To compare models, uncomment additional providers: +# - id: openai:gpt-4o +# config: +# max_tokens: 4096 +# temperature: 0 + +# Default assertions applied to EVERY test case +defaultTest: + vars: + skill: file://context/skill.md + assert: + - type: javascript + value: file://assertions/valid-config.js + weight: 2 + metric: structure + - type: javascript + value: file://assertions/anti-patterns.js + weight: 2 + metric: anti_patterns + - type: not-icontains + value: "I can't" + weight: 1 + metric: compliance + - type: not-icontains + value: 'I cannot' + weight: 1 + metric: compliance + +# Test cases by trigger type +tests: + - file://tests/click.yaml + - file://tests/hover.yaml + - file://tests/viewenter.yaml + - file://tests/viewprogress.yaml + - file://tests/pointermove.yaml + - file://tests/integration.yaml + +# Aggregate scoring across named metrics +derivedMetrics: + - name: overall_quality + value: '(structure + semantic + anti_patterns + compliance) / 4' + - name: rules_effectiveness + value: '(semantic + effect_choice + completeness) / 3' + +# Output results +outputPath: results/latest.json diff --git a/packages/interact-evals/prompts/with-rules.json b/packages/interact-evals/prompts/with-rules.json new file mode 100644 index 00000000..74308551 --- /dev/null +++ b/packages/interact-evals/prompts/with-rules.json @@ -0,0 +1,10 @@ +[ + { + "role": "system", + "content": "You are an expert at the @wix/interact animation library. Generate valid InteractConfig code based on user requests. Output ONLY the config object (or the interactions array and effects registry if applicable). Do not include imports, HTML, or explanatory text — just the config.\n\n{{skill}}\n\n{{rules}}" + }, + { + "role": "user", + "content": "{{prompt}}" + } +] diff --git a/packages/interact-evals/prompts/without-rules.json b/packages/interact-evals/prompts/without-rules.json new file mode 100644 index 00000000..bdce4894 --- /dev/null +++ b/packages/interact-evals/prompts/without-rules.json @@ -0,0 +1,10 @@ +[ + { + "role": "system", + "content": "You are an expert at the @wix/interact animation library. Generate valid InteractConfig code based on user requests. Output ONLY the config object (or the interactions array and effects registry if applicable). Do not include imports, HTML, or explanatory text — just the config.\n\n{{skill}}" + }, + { + "role": "user", + "content": "{{prompt}}" + } +] diff --git a/packages/interact-evals/results/dashboard.html b/packages/interact-evals/results/dashboard.html new file mode 100644 index 00000000..c54a901c --- /dev/null +++ b/packages/interact-evals/results/dashboard.html @@ -0,0 +1,3825 @@ + + + + + +Interact Eval Dashboard + + + + +
+

@wix/interact Eval Dashboard

+
+
+ +
+ +
+
Pass Rate by Category
+
+ With Rules + No Context +
+
+
+ +
+
Token Usage
+
+ With Rules + No Context +
+
+
+
Avg Input Tokens
+
+
+
+
Avg Output Tokens
+
+
+
+
+ +
+
Test Results
+
+ + + + + + + + + + + + + + +
TestCategoryVariantScoreStatusInput TokOutput TokChecks
+
+ + + + + diff --git a/packages/interact-evals/run-eval.sh b/packages/interact-evals/run-eval.sh new file mode 100755 index 00000000..1559928a --- /dev/null +++ b/packages/interact-evals/run-eval.sh @@ -0,0 +1,150 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +cd "$SCRIPT_DIR" + +RESULTS_DIR="$SCRIPT_DIR/results" +CATEGORIES=(click hover viewenter viewprogress pointermove integration) +VARIANTS=(with-rules no-context) + +BOLD='\033[1m' +DIM='\033[2m' +GREEN='\033[0;32m' +YELLOW='\033[0;33m' +CYAN='\033[0;36m' +RED='\033[0;31m' +RESET='\033[0m' + +step=0 +total_steps=5 +step() { + step=$((step + 1)) + echo "" + echo -e "${BOLD}[$step/$total_steps] $1${RESET}" + echo "" +} + +# ────────────────────────────────────────────────────────────────────── +step "Installing dependencies" +# ────────────────────────────────────────────────────────────────────── + +if ! node -e "require('yaml')" 2>/dev/null; then + echo -e "${DIM}Running yarn install from repo root...${RESET}" + (cd "$(git rev-parse --show-toplevel)" && yarn install) +else + echo -e "${DIM}Dependencies available, skipping install.${RESET}" +fi + +# ────────────────────────────────────────────────────────────────────── +step "Generating batch prompt files" +# ────────────────────────────────────────────────────────────────────── + +for variant in "${VARIANTS[@]}"; do + node scripts/cursor-batch.js "$variant" +done + +echo "" +echo -e "${GREEN}Generated prompt files in results/${RESET}" + +# ────────────────────────────────────────────────────────────────────── +step "Collecting LLM responses via Cursor" +# ────────────────────────────────────────────────────────────────────── + +echo -e "For each prompt file below, you need to:" +echo -e " ${CYAN}1.${RESET} Copy the file content to clipboard" +echo -e " ${CYAN}2.${RESET} Open a ${BOLD}new Cursor chat${RESET} and paste it" +echo -e " ${CYAN}3.${RESET} Wait for the full response" +echo -e " ${CYAN}4.${RESET} Copy the ${BOLD}entire${RESET} response and save it to the response file" +echo "" +echo -e "${DIM}Tip: You can open multiple Cursor chats in parallel to speed this up.${RESET}" +echo "" + +pending=() +for variant in "${VARIANTS[@]}"; do + for category in "${CATEGORIES[@]}"; do + prompt_file="$RESULTS_DIR/cursor-batch-${variant}-${category}.md" + response_file="$RESULTS_DIR/cursor-batch-${variant}-${category}-response.md" + + if [ ! -f "$prompt_file" ]; then + continue + fi + + if [ -f "$response_file" ] && [ -s "$response_file" ]; then + echo -e " ${GREEN}✓${RESET} ${variant}/${category} — response already exists, skipping" + continue + fi + + pending+=("${variant}|${category}") + done +done + +if [ ${#pending[@]} -eq 0 ]; then + echo -e "${GREEN}All responses already collected! Skipping to scoring.${RESET}" +else + echo -e "${YELLOW}${#pending[@]} prompt(s) still need responses:${RESET}" + echo "" + + for entry in "${pending[@]}"; do + IFS='|' read -r variant category <<< "$entry" + prompt_file="cursor-batch-${variant}-${category}.md" + response_file="cursor-batch-${variant}-${category}-response.md" + echo -e " ${BOLD}${variant}/${category}${RESET}" + echo -e " Prompt: ${CYAN}results/${prompt_file}${RESET}" + echo -e " Save to: ${CYAN}results/${response_file}${RESET}" + echo "" + done + + echo -e "${BOLD}How to do this quickly:${RESET}" + echo "" + echo " For each pending prompt above, run in a separate terminal:" + echo "" + echo -e " ${DIM}# Copy prompt to clipboard (macOS)${RESET}" + echo -e " cat results/ | pbcopy" + echo "" + echo -e " ${DIM}# Paste into Cursor chat, get response, then save it:${RESET}" + echo -e " pbpaste > results/" + echo "" + + echo -e "${YELLOW}Press Enter when all responses are saved...${RESET}" + read -r + + # Verify all responses exist + missing=0 + for entry in "${pending[@]}"; do + IFS='|' read -r variant category <<< "$entry" + response_file="$RESULTS_DIR/cursor-batch-${variant}-${category}-response.md" + if [ ! -f "$response_file" ] || [ ! -s "$response_file" ]; then + echo -e " ${RED}✗${RESET} Missing: results/cursor-batch-${variant}-${category}-response.md" + missing=$((missing + 1)) + else + echo -e " ${GREEN}✓${RESET} Found: results/cursor-batch-${variant}-${category}-response.md" + fi + done + + if [ $missing -gt 0 ]; then + echo "" + echo -e "${YELLOW}Warning: $missing response(s) missing. Scoring will be partial.${RESET}" + echo -e "Press Enter to continue anyway, or Ctrl+C to abort..." + read -r + fi +fi + +# ────────────────────────────────────────────────────────────────────── +step "Parsing responses into individual outputs" +# ────────────────────────────────────────────────────────────────────── + +for variant in "${VARIANTS[@]}"; do + echo -e "${CYAN}Parsing ${variant}...${RESET}" + node scripts/parse-cursor-response.js "$variant" || true +done + +# ────────────────────────────────────────────────────────────────────── +step "Scoring outputs" +# ────────────────────────────────────────────────────────────────────── + +node scripts/score-outputs.js + +echo "" +echo -e "${GREEN}${BOLD}Done!${RESET}" +echo -e "Detailed scores: ${CYAN}results/scores.json${RESET}" diff --git a/packages/interact-evals/scripts/cursor-batch.js b/packages/interact-evals/scripts/cursor-batch.js new file mode 100644 index 00000000..b47cd298 --- /dev/null +++ b/packages/interact-evals/scripts/cursor-batch.js @@ -0,0 +1,153 @@ +#!/usr/bin/env node + +/** + * Generate a single Cursor-friendly prompt file that asks the LLM to produce + * outputs for ALL test cases in one shot. The response can then be split and + * scored automatically. + * + * Usage: + * node scripts/cursor-batch.js [with-rules|no-context] + * + * Output: + * results/cursor-batch-{variant}-{category}.md — paste into Cursor chat + * + * After getting the response, save it as: + * results/cursor-batch-{variant}-response.md + * + * Then run: + * node scripts/parse-cursor-response.js {variant} + * node scripts/score-outputs.js + */ + +const fs = require("fs"); +const path = require("path"); +const yaml = require("yaml"); +const { resolveRules } = require("./resolve-rules"); +const { generateTestId } = require("./test-id"); + +const EVALS_DIR = path.resolve(__dirname, ".."); +const RESULTS_DIR = path.join(EVALS_DIR, "results"); + +function ensureDir(dir) { + if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true }); +} + +function main() { + const variant = process.argv[2] || "with-rules"; + if (!["with-rules", "no-context"].includes(variant)) { + console.error( + "Usage: node scripts/cursor-batch.js [with-rules|no-context]", + ); + process.exit(1); + } + + ensureDir(RESULTS_DIR); + + const testFiles = fs + .readdirSync(path.join(EVALS_DIR, "tests")) + .filter((f) => f.endsWith(".yaml")) + .sort(); + + // Group tests by category, one file per category (keeps each prompt manageable) + const categorized = {}; + let index = 0; + + for (const testFile of testFiles) { + const tests = yaml.parse( + fs.readFileSync(path.join(EVALS_DIR, "tests", testFile), "utf8"), + ); + const category = testFile.replace(".yaml", ""); + + if (!categorized[category]) categorized[category] = []; + + for (const test of tests) { + index++; + const id = generateTestId(index, category, test.description); + const userPrompt = + typeof test.vars.prompt === "string" + ? test.vars.prompt.trim() + : test.vars.prompt; + + let rulesContent = ""; + if (variant === "with-rules" && test.vars.rules) { + rulesContent = resolveRules(test.vars.rules, EVALS_DIR); + } + + categorized[category].push({ id, userPrompt, rulesContent }); + } + } + + const totalTests = index; + const files = []; + + for (const [category, entries] of Object.entries(categorized)) { + const parts = []; + + parts.push("You are an expert at the @wix/interact animation library."); + parts.push(""); + + if (variant === "with-rules" && entries[0].rulesContent) { + parts.push(""); + parts.push(entries[0].rulesContent); + parts.push(""); + parts.push(""); + } + + parts.push( + `I need you to generate @wix/interact InteractConfig code for ${entries.length} different requests.`, + ); + parts.push(""); + if (category === "integration") { + parts.push( + "For each request, output the COMPLETE code including imports, config, and HTML/JSX as requested. No explanatory text.", + ); + } else { + parts.push( + "For each request, output ONLY the config object. No imports, no HTML, no explanation.", + ); + } + parts.push(""); + parts.push("Use this EXACT format for each response:"); + parts.push(""); + parts.push("```"); + parts.push("=== ID: {test-id} ==="); + parts.push("{config code here}"); + parts.push("```"); + parts.push(""); + + for (const entry of entries) { + parts.push(`---`); + parts.push(""); + parts.push(`### Test: ${entry.id}`); + parts.push(""); + parts.push(`**Request:** ${entry.userPrompt}`); + parts.push(""); + } + + const outputPath = path.join( + RESULTS_DIR, + `cursor-batch-${variant}-${category}.md`, + ); + fs.writeFileSync(outputPath, parts.join("\n")); + files.push({ path: outputPath, category, count: entries.length }); + } + + console.log( + `Generated ${files.length} batch files for ${totalTests} test cases (${variant}):`, + ); + for (const f of files) { + console.log(` ${path.basename(f.path)} (${f.count} tests)`); + } + console.log(""); + console.log("Next steps:"); + console.log( + " 1. Open each batch file and paste its content into Cursor chat", + ); + console.log( + ` 2. Save each response as: results/cursor-batch-${variant}-{category}-response.md`, + ); + console.log(` 3. Run: node scripts/parse-cursor-response.js ${variant}`); + console.log(" 4. Run: node scripts/score-outputs.js"); +} + +main(); diff --git a/packages/interact-evals/scripts/generate-prompts.js b/packages/interact-evals/scripts/generate-prompts.js new file mode 100644 index 00000000..ffb2b5c9 --- /dev/null +++ b/packages/interact-evals/scripts/generate-prompts.js @@ -0,0 +1,129 @@ +#!/usr/bin/env node + +/** + * Phase 1: Generate all eval prompts as ready-to-paste text files. + * + * Usage: node scripts/generate-prompts.js + * Output: prompts are written to results/prompts/ + * + * For each test case, generates two files: + * - {id}__with-rules.txt (trigger rules + user prompt) + * - {id}__no-context.txt (user prompt only) + * + * You can then paste these into Cursor, ChatGPT, or any LLM and save the + * output into results/outputs/{id}__with-rules.txt etc. + */ + +const fs = require("fs"); +const path = require("path"); +const yaml = require("yaml"); +const { resolveRules } = require("./resolve-rules"); +const { generateTestId } = require("./test-id"); + +const EVALS_DIR = path.resolve(__dirname, ".."); +const RESULTS_DIR = path.join(EVALS_DIR, "results"); +const PROMPTS_OUT = path.join(RESULTS_DIR, "prompts"); +const OUTPUTS_DIR = path.join(RESULTS_DIR, "outputs"); + +function ensureDir(dir) { + if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true }); +} + +function loadTestFile(filePath) { + const content = fs.readFileSync(filePath, "utf8"); + return yaml.parse(content); +} + +function main() { + ensureDir(PROMPTS_OUT); + ensureDir(OUTPUTS_DIR); + + const testFiles = fs + .readdirSync(path.join(EVALS_DIR, "tests")) + .filter((f) => f.endsWith(".yaml")); + + const manifest = []; + let index = 0; + + for (const testFile of testFiles) { + const tests = loadTestFile(path.join(EVALS_DIR, "tests", testFile)); + const category = testFile.replace(".yaml", ""); + + for (const test of tests) { + index++; + const id = generateTestId(index, category, test.description); + const userPrompt = + typeof test.vars.prompt === "string" + ? test.vars.prompt.trim() + : test.vars.prompt; + + const rulesContent = test.vars.rules + ? resolveRules(test.vars.rules, EVALS_DIR) + : ""; + + const outputInstruction = + category === "integration" + ? "Generate valid @wix/interact code based on user requests. Include the COMPLETE code: imports, config, and HTML/JSX as requested. No explanatory text." + : "Generate valid InteractConfig code based on user requests. Output ONLY the config object (or the interactions array and effects registry if applicable). Do not include imports, HTML, or explanatory text — just the config."; + + const withRulesPrompt = [ + "=== SYSTEM ===", + `You are an expert at the @wix/interact animation library. ${outputInstruction}`, + "", + rulesContent, + "", + "=== USER ===", + userPrompt, + ].join("\n"); + + const noContextPrompt = [ + "=== SYSTEM ===", + `You are an expert at the @wix/interact animation library. ${outputInstruction}`, + "", + "=== USER ===", + userPrompt, + ].join("\n"); + + fs.writeFileSync( + path.join(PROMPTS_OUT, `${id}__with-rules.txt`), + withRulesPrompt, + ); + fs.writeFileSync( + path.join(PROMPTS_OUT, `${id}__no-context.txt`), + noContextPrompt, + ); + + manifest.push({ + id, + category, + description: test.description, + prompt: userPrompt, + rulesRef: test.vars.rules || null, + expected_trigger: test.vars.expected_trigger || "", + expected: test.vars.expected || {}, + asserts: test.assert || [], + }); + } + } + + fs.writeFileSync( + path.join(RESULTS_DIR, "manifest.json"), + JSON.stringify(manifest, null, 2), + ); + + console.log( + `Generated ${index} test cases × 2 variants = ${index * 2} prompt files`, + ); + console.log(` Prompts: ${PROMPTS_OUT}/`); + console.log(` Outputs: ${OUTPUTS_DIR}/ (put LLM responses here)`); + console.log(` Manifest: ${path.join(RESULTS_DIR, "manifest.json")}`); + console.log(""); + console.log("Next steps:"); + console.log(" 1. Feed each prompt file to your LLM (Cursor, ChatGPT, etc.)"); + console.log( + " 2. Save each response in results/outputs/ with the SAME filename", + ); + console.log(" 3. Run: node scripts/score-outputs.js"); +} + +main(); diff --git a/packages/interact-evals/scripts/parse-cursor-response.js b/packages/interact-evals/scripts/parse-cursor-response.js new file mode 100644 index 00000000..d5b485f7 --- /dev/null +++ b/packages/interact-evals/scripts/parse-cursor-response.js @@ -0,0 +1,101 @@ +#!/usr/bin/env node + +/** + * Parse a Cursor batch response into individual output files for scoring. + * + * Usage: node scripts/parse-cursor-response.js [with-rules|without-rules] + * + * Reads: results/cursor-batch-{variant}-response.md + * Writes: results/outputs/{id}__{variant}.txt (one per test case) + */ + +const fs = require('fs'); +const path = require('path'); + +const EVALS_DIR = path.resolve(__dirname, '..'); +const RESULTS_DIR = path.join(EVALS_DIR, 'results'); +const OUTPUTS_DIR = path.join(RESULTS_DIR, 'outputs'); + +function ensureDir(dir) { + if (!fs.existsSync(dir)) fs.mkdirSync(dir, { recursive: true }); +} + +function parseResponseFile(filePath, variant) { + const content = fs.readFileSync(filePath, 'utf8'); + let count = 0; + + // Parse sections delimited by "=== ID: {id} ===" + const regex = /===\s*ID:\s*([^\s=]+)\s*===([\s\S]*?)(?=(?:===\s*ID:|$))/g; + let match; + + while ((match = regex.exec(content)) !== null) { + const id = match[1].trim(); + let code = match[2].trim(); + + // Strip markdown code fences if present + code = code.replace(/^```[\w]*\n?/, '').replace(/\n?```$/, '').trim(); + + const outputPath = path.join(OUTPUTS_DIR, `${id}__${variant}.txt`); + fs.writeFileSync(outputPath, code); + count++; + } + + if (count === 0) { + // Try alternative parsing: look for sections between "### Test:" headers + const altRegex = /###\s*Test:\s*([^\n]+)\n([\s\S]*?)(?=(?:###\s*Test:|$))/g; + while ((match = altRegex.exec(content)) !== null) { + const id = match[1].trim(); + let code = match[2].trim(); + code = code.replace(/^```[\w]*\n?/, '').replace(/\n?```$/, '').trim(); + const outputPath = path.join(OUTPUTS_DIR, `${id}__${variant}.txt`); + fs.writeFileSync(outputPath, code); + count++; + } + } + + return count; +} + +function main() { + const variant = process.argv[2] || 'with-rules'; + + ensureDir(OUTPUTS_DIR); + + // Find all response files for this variant + const responseFiles = fs + .readdirSync(RESULTS_DIR) + .filter((f) => f.startsWith(`cursor-batch-${variant}-`) && f.endsWith('-response.md')) + .map((f) => path.join(RESULTS_DIR, f)); + + // Also check for a single combined response file + const singleResponse = path.join(RESULTS_DIR, `cursor-batch-${variant}-response.md`); + if (fs.existsSync(singleResponse)) { + responseFiles.push(singleResponse); + } + + if (responseFiles.length === 0) { + console.error(`No response files found for variant "${variant}".`); + console.error('Expected files matching: results/cursor-batch-' + variant + '-*-response.md'); + process.exit(1); + } + + let totalCount = 0; + for (const filePath of responseFiles) { + const count = parseResponseFile(filePath, variant); + console.log(` ${path.basename(filePath)}: ${count} outputs`); + totalCount += count; + } + + if (totalCount === 0) { + console.error('Could not parse any test outputs from the response files.'); + console.error('Expected format: "=== ID: {test-id} ===" followed by the config code.'); + process.exit(1); + } + + console.log(`\nTotal: ${totalCount} outputs parsed`); + console.log(`Written to: ${OUTPUTS_DIR}/`); + console.log(''); + console.log('Next: node scripts/score-outputs.js'); +} + +main(); diff --git a/packages/interact-evals/scripts/resolve-rules.js b/packages/interact-evals/scripts/resolve-rules.js new file mode 100644 index 00000000..8450eca1 --- /dev/null +++ b/packages/interact-evals/scripts/resolve-rules.js @@ -0,0 +1,91 @@ +/** + * Shared utility for resolving rule references. + * Supports both local file:// paths and https:// URLs (fetched via curl, cached locally). + */ + +const fs = require('fs'); +const path = require('path'); +const { execSync } = require('child_process'); + +const CACHE_DIR = path.resolve(__dirname, '..', '.rules-cache'); + +const memoryCache = {}; + +function ensureCacheDir() { + if (!fs.existsSync(CACHE_DIR)) fs.mkdirSync(CACHE_DIR, { recursive: true }); +} + +function urlToCachePath(url) { + const slug = url + .replace(/^https?:\/\//, '') + .replace(/[^a-zA-Z0-9._-]/g, '_'); + return path.join(CACHE_DIR, slug); +} + +function fetchUrl(url) { + if (memoryCache[url]) return memoryCache[url]; + + ensureCacheDir(); + const cachePath = urlToCachePath(url); + + if (fs.existsSync(cachePath)) { + const content = fs.readFileSync(cachePath, 'utf8'); + memoryCache[url] = content; + return content; + } + + try { + const content = execSync(`curl -fsSL "${url}"`, { + encoding: 'utf8', + timeout: 15000, + }); + fs.writeFileSync(cachePath, content); + memoryCache[url] = content; + return content; + } catch (err) { + console.error(`Failed to fetch ${url}: ${err.message}`); + return ''; + } +} + +function readFileRef(ref, baseDir) { + const filePath = ref.replace(/^file:\/\//, ''); + const resolved = path.resolve(baseDir, filePath); + return fs.readFileSync(resolved, 'utf8'); +} + +/** + * Resolve a rules reference to its content string. + * @param {string} ref - A file:// path or https:// URL + * @param {string} baseDir - Base directory for resolving file:// paths + * @returns {string} The rule file content + */ +function resolveRules(ref, baseDir) { + if (!ref) return ''; + if (ref.startsWith('https://') || ref.startsWith('http://')) { + return fetchUrl(ref); + } + if (ref.startsWith('file://')) { + return readFileRef(ref, baseDir); + } + // Assume local path + const resolved = path.resolve(baseDir, ref); + if (fs.existsSync(resolved)) { + return fs.readFileSync(resolved, 'utf8'); + } + return ''; +} + +/** + * Clear the local rules cache. + */ +function clearCache() { + if (fs.existsSync(CACHE_DIR)) { + for (const f of fs.readdirSync(CACHE_DIR)) { + fs.unlinkSync(path.join(CACHE_DIR, f)); + } + } + Object.keys(memoryCache).forEach((k) => delete memoryCache[k]); +} + +module.exports = { resolveRules, clearCache, CACHE_DIR }; diff --git a/packages/interact-evals/scripts/score-outputs.js b/packages/interact-evals/scripts/score-outputs.js new file mode 100644 index 00000000..960d13ab --- /dev/null +++ b/packages/interact-evals/scripts/score-outputs.js @@ -0,0 +1,435 @@ +#!/usr/bin/env node + +/** + * Phase 2: Score collected LLM outputs against assertions. + * No API key required — all scoring is local. + * + * Usage: node scripts/score-outputs.js + * + * Reads: + * - results/manifest.json (test case metadata) + * - results/outputs/*.txt (LLM outputs you collected) + * + * Outputs: + * - results/scores.json (detailed per-test scores) + * - Console summary (aggregated results) + */ + +const fs = require('fs'); +const path = require('path'); +const { resolveRules } = require('./resolve-rules'); + +const EVALS_DIR = path.resolve(__dirname, '..'); +const RESULTS_DIR = path.join(EVALS_DIR, 'results'); +const OUTPUTS_DIR = path.join(RESULTS_DIR, 'outputs'); + +const validConfig = require(path.join(EVALS_DIR, 'assertions', 'valid-config.js')); +const antiPatterns = require(path.join(EVALS_DIR, 'assertions', 'anti-patterns.js')); + +const triggerCheckers = {}; +for (const trigger of ['click', 'hover', 'viewenter', 'viewprogress', 'pointermove']) { + const filePath = path.join(EVALS_DIR, 'assertions', `${trigger}-checks.js`); + if (fs.existsSync(filePath)) { + triggerCheckers[trigger] = require(filePath); + } +} + +/** + * Estimate token count from text. + * Uses a word/symbol splitting heuristic that closely approximates + * BPE tokenizers (GPT/Claude) for code-heavy content. + */ +function estimateTokens(text) { + // Split on whitespace, then count sub-word tokens for long identifiers + // and punctuation. Roughly: 1 word ≈ 1.3 tokens for code. + const words = text.split(/\s+/).filter(Boolean); + let tokens = 0; + for (const word of words) { + if (word.length <= 4) { + tokens += 1; + } else if (word.length <= 10) { + tokens += Math.ceil(word.length / 4); + } else { + tokens += Math.ceil(word.length / 3.5); + } + } + return tokens; +} + +function runContainsAssert(output, assertion) { + const value = assertion.value; + if (typeof value === 'string') { + const pass = output.includes(value); + return { pass, score: pass ? 1 : 0, reason: pass ? `Contains "${value}"` : `Missing "${value}"` }; + } + return { pass: true, score: 1, reason: 'skip' }; +} + +function runContainsAnyAssert(output, assertion) { + const values = assertion.value; + if (Array.isArray(values)) { + const found = values.find((v) => output.includes(v)); + const pass = !!found; + return { + pass, + score: pass ? 1 : 0, + reason: pass ? `Contains "${found}"` : `Missing any of: ${values.join(', ')}`, + }; + } + return { pass: true, score: 1, reason: 'skip' }; +} + +function runNotContainsAssert(output, assertion) { + const value = assertion.value; + if (typeof value === 'string') { + const lower = output.toLowerCase(); + const pass = !lower.includes(value.toLowerCase()); + return { + pass, + score: pass ? 1 : 0, + reason: pass ? `Does not contain "${value}"` : `Unexpectedly contains "${value}"`, + }; + } + return { pass: true, score: 1, reason: 'skip' }; +} + +function scoreOutput(output, testCase) { + const results = []; + const vars = { + expected_trigger: testCase.expected_trigger, + expected: testCase.expected, + }; + + // Default assertions + results.push({ + name: 'structure', + ...validConfig(output, { vars }), + weight: 2, + }); + + results.push({ + name: 'anti_patterns', + ...antiPatterns(output, { vars }), + weight: 2, + }); + + results.push({ + name: 'compliance', + ...runNotContainsAssert(output, { value: "I can't" }), + weight: 1, + }); + + results.push({ + name: 'compliance_2', + ...runNotContainsAssert(output, { value: 'I cannot' }), + weight: 1, + }); + + // Trigger-specific checks + const trigger = testCase.expected_trigger; + const triggerKey = trigger ? trigger.toLowerCase() : ''; + if (triggerCheckers[triggerKey]) { + results.push({ + name: 'semantic', + ...triggerCheckers[triggerKey](output, { vars }), + weight: 3, + }); + } + + // Inline assertions from test case + for (const assert of testCase.asserts) { + if (assert.type === 'javascript') continue; // already handled above + + let result; + switch (assert.type) { + case 'contains': + result = runContainsAssert(output, assert); + break; + case 'contains-any': + result = runContainsAnyAssert(output, assert); + break; + case 'not-contains': + case 'not-icontains': + result = runNotContainsAssert(output, assert); + break; + default: + continue; + } + + results.push({ + name: assert.metric || assert.type, + ...result, + weight: assert.weight || 1, + }); + } + + // Weighted average + let totalWeight = 0; + let weightedScore = 0; + for (const r of results) { + totalWeight += r.weight; + weightedScore += r.score * r.weight; + } + const finalScore = totalWeight > 0 ? weightedScore / totalWeight : 0; + const allPassed = results.every((r) => r.pass); + + return { + score: Math.round(finalScore * 100) / 100, + pass: allPassed, + checks: results, + }; +} + +function main() { + const manifestPath = path.join(RESULTS_DIR, 'manifest.json'); + if (!fs.existsSync(manifestPath)) { + console.error('No manifest.json found. Run generate-prompts.js first.'); + process.exit(1); + } + + const manifest = JSON.parse(fs.readFileSync(manifestPath, 'utf8')); + + if (!fs.existsSync(OUTPUTS_DIR)) { + console.error('No results/outputs/ directory. Collect LLM outputs first.'); + process.exit(1); + } + + const outputFiles = fs.readdirSync(OUTPUTS_DIR).filter((f) => f.endsWith('.txt')); + if (outputFiles.length === 0) { + console.error('No output files found in results/outputs/. Collect LLM outputs first.'); + process.exit(1); + } + + const VARIANTS = ['with-rules', 'no-context']; + const VARIANT_LABELS = { + 'with-rules': 'With Rules', + 'no-context': 'No Context', + }; + + function buildInputText(testCase, variant) { + const isIntegration = testCase.category === 'integration'; + const instruction = isIntegration + ? 'You are an expert at the @wix/interact animation library. Generate valid @wix/interact code based on user requests. Include the COMPLETE code: imports, config, and HTML/JSX as requested. No explanatory text.' + : 'You are an expert at the @wix/interact animation library. Generate valid InteractConfig code based on user requests. Output ONLY the config object. Do not include imports, HTML, or explanatory text — just the config.'; + + const parts = [instruction]; + + if (variant === 'with-rules' && testCase.rulesRef) { + parts.push(resolveRules(testCase.rulesRef, EVALS_DIR)); + } + + parts.push(testCase.prompt || ''); + return parts.join('\n'); + } + + const allScores = []; + const categoryScores = {}; + const categoryOutputTokens = {}; + const categoryInputTokens = {}; + const variantScores = {}; + const variantOutputTokens = {}; + const variantInputTokens = {}; + for (const v of VARIANTS) { + variantScores[v] = []; + variantOutputTokens[v] = []; + variantInputTokens[v] = []; + } + + for (const testCase of manifest) { + for (const variant of VARIANTS) { + const filename = `${testCase.id}__${variant}.txt`; + const filePath = path.join(OUTPUTS_DIR, filename); + + if (!fs.existsSync(filePath)) continue; + + const output = fs.readFileSync(filePath, 'utf8'); + const result = scoreOutput(output, testCase); + const outputTokens = estimateTokens(output); + const inputText = buildInputText(testCase, variant); + const inputTokens = estimateTokens(inputText); + + const entry = { + id: testCase.id, + variant, + category: testCase.category, + description: testCase.description, + inputTokens, + outputTokens, + chars: output.length, + ...result, + }; + + allScores.push(entry); + variantScores[variant].push(result.score); + variantOutputTokens[variant].push(outputTokens); + variantInputTokens[variant].push(inputTokens); + + if (!categoryScores[testCase.category]) { + categoryScores[testCase.category] = {}; + categoryOutputTokens[testCase.category] = {}; + categoryInputTokens[testCase.category] = {}; + for (const v of VARIANTS) { + categoryScores[testCase.category][v] = []; + categoryOutputTokens[testCase.category][v] = []; + categoryInputTokens[testCase.category][v] = []; + } + } + categoryScores[testCase.category][variant].push(result.score); + categoryOutputTokens[testCase.category][variant].push(outputTokens); + categoryInputTokens[testCase.category][variant].push(inputTokens); + } + } + + // Save detailed results + fs.writeFileSync( + path.join(RESULTS_DIR, 'scores.json'), + JSON.stringify(allScores, null, 2), + ); + + // Determine which variants have data + const activeVariants = VARIANTS.filter((v) => variantScores[v].length > 0); + + const avg = (arr) => (arr.length ? (arr.reduce((a, b) => a + b, 0) / arr.length) : 0); + const pct = (n) => `${Math.round(n * 100)}%`; + + console.log(''); + console.log('╔════════════════════════════════════════════════════════════════════════════╗'); + console.log('║ @wix/interact Rules Eval Results ║'); + console.log('╠════════════════════════════════════════════════════════════════════════════╣'); + console.log(''); + + // Header + const colWidth = 13; + let header = ' ' + 'Category'.padEnd(18); + for (const v of activeVariants) header += VARIANT_LABELS[v].padStart(colWidth); + if (activeVariants.length >= 2) header += ' Delta(1v' + activeVariants.length + ')'; + console.log(header); + console.log(' ' + '─'.repeat(18 + activeVariants.length * colWidth + 12)); + + for (const [cat, scores] of Object.entries(categoryScores).sort()) { + let line = ' ' + cat.padEnd(18); + const avgs = {}; + for (const v of activeVariants) { + avgs[v] = avg(scores[v]); + line += pct(avgs[v]).padStart(colWidth); + } + if (activeVariants.length >= 2) { + const best = avgs[activeVariants[0]]; + const worst = avgs[activeVariants[activeVariants.length - 1]]; + const delta = best - worst; + const deltaStr = delta > 0 ? `+${pct(delta)}` : delta < 0 ? pct(delta) : '0%'; + const indicator = delta > 0.05 ? ' ✓' : delta < -0.05 ? ' ✗' : ''; + line += ` ${deltaStr.padStart(5)}${indicator}`; + } + console.log(line); + } + + console.log(' ' + '─'.repeat(18 + activeVariants.length * colWidth + 12)); + + let overallLine = ' ' + 'OVERALL'.padEnd(18); + const overallAvgs = {}; + for (const v of activeVariants) { + overallAvgs[v] = avg(variantScores[v]); + overallLine += pct(overallAvgs[v]).padStart(colWidth); + } + if (activeVariants.length >= 2) { + const best = overallAvgs[activeVariants[0]]; + const worst = overallAvgs[activeVariants[activeVariants.length - 1]]; + const delta = best - worst; + const deltaStr = delta > 0 ? `+${pct(delta)}` : pct(delta); + overallLine += ` ${deltaStr.padStart(5)}`; + } + console.log(overallLine); + + // Token usage tables + const num = (n) => Math.round(n).toLocaleString(); + const divider = ' ' + '─'.repeat(18 + activeVariants.length * colWidth); + + // --- Input tokens --- + console.log(''); + console.log(' Avg Input Tokens (context window)'); + console.log(''); + + let inHeader = ' ' + 'Category'.padEnd(18); + for (const v of activeVariants) inHeader += VARIANT_LABELS[v].padStart(colWidth); + console.log(inHeader); + console.log(divider); + + for (const [cat, tokens] of Object.entries(categoryInputTokens).sort()) { + let line = ' ' + cat.padEnd(18); + for (const v of activeVariants) { + line += num(avg(tokens[v])).padStart(colWidth); + } + console.log(line); + } + + console.log(divider); + + let inOverall = ' ' + 'OVERALL (avg)'.padEnd(18); + for (const v of activeVariants) { + inOverall += num(avg(variantInputTokens[v])).padStart(colWidth); + } + console.log(inOverall); + + let inTotal = ' ' + 'TOTAL'.padEnd(18); + for (const v of activeVariants) { + const sum = variantInputTokens[v].reduce((a, b) => a + b, 0); + inTotal += num(sum).padStart(colWidth); + } + console.log(inTotal); + + // --- Output tokens --- + console.log(''); + console.log(' Avg Output Tokens'); + console.log(''); + + let outHeader = ' ' + 'Category'.padEnd(18); + for (const v of activeVariants) outHeader += VARIANT_LABELS[v].padStart(colWidth); + console.log(outHeader); + console.log(divider); + + for (const [cat, tokens] of Object.entries(categoryOutputTokens).sort()) { + let line = ' ' + cat.padEnd(18); + for (const v of activeVariants) { + line += num(avg(tokens[v])).padStart(colWidth); + } + console.log(line); + } + + console.log(divider); + + let outOverall = ' ' + 'OVERALL (avg)'.padEnd(18); + for (const v of activeVariants) { + outOverall += num(avg(variantOutputTokens[v])).padStart(colWidth); + } + console.log(outOverall); + + let outTotal = ' ' + 'TOTAL'.padEnd(18); + for (const v of activeVariants) { + const sum = variantOutputTokens[v].reduce((a, b) => a + b, 0); + outTotal += num(sum).padStart(colWidth); + } + console.log(outTotal); + + console.log(''); + const scoredParts = activeVariants.map((v) => `${variantScores[v].length} ${VARIANT_LABELS[v].toLowerCase()}`); + console.log(` Scored: ${allScores.length} outputs (${scoredParts.join(', ')})`); + console.log(` Details: ${path.join(RESULTS_DIR, 'scores.json')}`); + console.log(''); + + // Print per-test failures + const failures = allScores.filter((s) => !s.pass); + if (failures.length > 0) { + console.log(' Failed tests:'); + for (const f of failures) { + const failedChecks = f.checks.filter((c) => !c.pass).map((c) => c.reason).join('; '); + console.log(` ✗ [${f.variant}] ${f.description}`); + console.log(` Score: ${pct(f.score)} — ${failedChecks}`); + } + console.log(''); + } + + console.log('╚════════════════════════════════════════════════════════════════════════════╝'); +} + +main(); diff --git a/packages/interact-evals/scripts/test-id.js b/packages/interact-evals/scripts/test-id.js new file mode 100644 index 00000000..59d6cbfa --- /dev/null +++ b/packages/interact-evals/scripts/test-id.js @@ -0,0 +1,31 @@ +/** + * Shared test ID generation. Used by cursor-batch.js and generate-prompts.js + * to ensure IDs are consistent across the pipeline. + */ + +function slugify(str) { + return str + .toLowerCase() + .replace(/[^a-z0-9]+/g, '-') + .replace(/^-|-$/g, '') + .slice(0, 60); +} + +/** + * Strip expected-answer hints from test descriptions before slugifying. + * Removes parenthetical hints like "(alternate + SlideIn)" and + * effect type suffixes like "with keyframeEffect". + */ +function stripHints(description) { + return description + .replace(/\s*\([^)]*\)\s*/g, ' ') + .replace(/\s+with\s+(named\s*effect|keyframe\s*effect|transition|custom\s*effect)\b/gi, '') + .replace(/\s+/g, ' ') + .trim(); +} + +function generateTestId(index, category, description) { + return `${String(index).padStart(2, '0')}-${category}-${slugify(stripHints(description))}`; +} + +module.exports = { slugify, stripHints, generateTestId }; diff --git a/packages/interact-evals/tests/click.yaml b/packages/interact-evals/tests/click.yaml new file mode 100644 index 00000000..bc5833fd --- /dev/null +++ b/packages/interact-evals/tests/click.yaml @@ -0,0 +1,110 @@ +- description: 'Click — hamburger menu toggle (alternate + SlideIn)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/click.md + prompt: > + Create a hamburger menu button that slides in a navigation sidebar from + the left when clicked, and slides it back out when clicked again. + expected_trigger: click + expected: + params_type: alternate + effect_type: namedEffect + cross_target: true + has_fill_both: true + has_reversed: true + has_duration: true + has_easing: true + assert: + - type: javascript + value: file://assertions/click-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['SlideIn', 'slideIn'] + weight: 1 + metric: effect_choice + +- description: 'Click — theme toggle (transition + method toggle)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/click.md + prompt: > + Create a dark mode toggle button. When clicked, transition the page + background to #1a1a1a, text color to white, and border-color to #333. + expected_trigger: click + expected: + params_method: toggle + effect_type: transition + cross_target: true + has_duration: true + assert: + - type: javascript + value: file://assertions/click-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'background-color' + weight: 1 + metric: completeness + - type: contains + value: 'color' + weight: 1 + metric: completeness + +- description: 'Click — button pulse feedback (repeat)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/click.md + prompt: > + Make a "Save" button pulse with a scale-up animation each time it is + clicked, restarting from scratch every time. + expected_trigger: click + expected: + params_type: repeat + has_duration: true + assert: + - type: javascript + value: file://assertions/click-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['scale', 'Scale', 'Pulse'] + weight: 1 + metric: effect_choice + +- description: 'Click — play/pause spinner (state)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/click.md + prompt: > + Create a button that toggles a loading spinner between playing and paused + states. The spinner should rotate continuously (infinite loop) when + playing. + expected_trigger: click + expected: + params_type: state + has_duration: true + assert: + - type: javascript + value: file://assertions/click-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['Infinity', 'iterations'] + weight: 1 + metric: effect_choice + +- description: 'Click — accordion expand/collapse' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/click.md + prompt: > + Build an accordion where clicking the header expands the content section + with a reveal animation, and clicking again collapses it. + expected_trigger: click + expected: + params_type: alternate + cross_target: true + has_fill_both: true + has_reversed: true + has_duration: true + assert: + - type: javascript + value: file://assertions/click-checks.js + weight: 3 + metric: semantic diff --git a/packages/interact-evals/tests/hover.yaml b/packages/interact-evals/tests/hover.yaml new file mode 100644 index 00000000..c4b545fb --- /dev/null +++ b/packages/interact-evals/tests/hover.yaml @@ -0,0 +1,101 @@ +- description: 'Hover — card lift with shadow' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/hover.md + prompt: > + Create a hover effect on a product card: when the mouse enters, the card + lifts up (translateY) and gains a deeper box-shadow. Reverse on leave. + expected_trigger: hover + expected: + effect_type: keyframeEffect + has_fill_both: true + has_duration: true + has_transform: true + assert: + - type: javascript + value: file://assertions/hover-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'translateY' + weight: 1 + metric: completeness + - type: contains-any + value: ['boxShadow', 'box-shadow'] + weight: 1 + metric: completeness + +- description: 'Hover — button scale' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/hover.md + prompt: > + Make a button scale up to 1.1x on hover. 300ms ease-out. + expected_trigger: hover + expected: + has_duration: true + has_transform: true + assert: + - type: javascript + value: file://assertions/hover-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'scale' + weight: 1 + metric: completeness + +- description: 'Hover — image zoom on hover' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/hover.md + prompt: > + When hovering over a gallery card, zoom the image inside it to 1.15x + scale. The image is a child of the card. + expected_trigger: hover + expected: + has_fill_both: true + has_duration: true + has_transform: true + assert: + - type: javascript + value: file://assertions/hover-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['scale(1.15)', 'scale(1.1', '1.15'] + weight: 1 + metric: completeness + +- description: 'Hover — multi-target hover (card + icon)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/hover.md + prompt: > + On hovering over a navigation item, simultaneously scale the item's icon + and change the background color of the item itself. + expected_trigger: hover + expected: + cross_target: true + has_duration: true + assert: + - type: javascript + value: file://assertions/hover-checks.js + weight: 3 + metric: semantic + +- description: 'Hover — color change with named effect' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/hover.md + prompt: > + Create a hover interaction on a CTA button that changes its background + color to blue (#2563eb) and text color to white on hover, reverting on + leave. + expected_trigger: hover + expected: + has_duration: true + assert: + - type: javascript + value: file://assertions/hover-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['background-color', 'backgroundColor', '#2563eb'] + weight: 1 + metric: completeness diff --git a/packages/interact-evals/tests/integration.yaml b/packages/interact-evals/tests/integration.yaml new file mode 100644 index 00000000..23cbe95a --- /dev/null +++ b/packages/interact-evals/tests/integration.yaml @@ -0,0 +1,91 @@ +- description: 'Integration — full web setup with HTML + config' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/integration.md + prompt: > + Generate a complete @wix/interact web setup: import statements, a config + with a hover scale effect on a button, and the matching HTML with + interact-element wrapping. + expected_trigger: hover + expected: {} + assert: + - type: contains-any + value: ["@wix/interact/web", "@wix/interact"] + weight: 2 + metric: setup + - type: contains + value: 'Interact.create' + weight: 2 + metric: setup + - type: contains + value: 'interact-element' + weight: 2 + metric: html + - type: contains + value: 'data-interact-key' + weight: 2 + metric: html + +- description: 'Integration — reusable effects registry' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/integration.md + prompt: > + Create an InteractConfig that defines a reusable "scale-up" effect in + the effects registry, then references it from two different interactions + (a button hover and a card hover) using effectId. + expected_trigger: hover + expected: {} + assert: + - type: contains + value: 'effects:' + weight: 2 + metric: registry + - type: contains + value: 'effectId' + weight: 2 + metric: registry + +- description: 'Integration — conditions with media queries' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/integration.md + prompt: > + Create an InteractConfig with a conditions section that includes a + desktop-only media query (min-width: 1024px) and a prefers-reduced-motion + condition. Apply both conditions to a viewEnter entrance animation. + expected_trigger: viewEnter + expected: {} + assert: + - type: contains + value: 'conditions' + weight: 2 + metric: conditions + - type: contains-any + value: ['min-width', 'prefers-reduced-motion'] + weight: 2 + metric: conditions + - type: contains + value: 'viewEnter' + weight: 1 + metric: trigger + +- description: 'Integration — React setup with Interaction component' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/integration.md + prompt: > + Generate a React component that uses @wix/interact/react. It should + render an Interaction component with a click toggle interaction that + slides in a sidebar. + expected_trigger: click + expected: {} + assert: + - type: contains-any + value: ["@wix/interact/react", "interact/react"] + weight: 2 + metric: setup + - type: contains + value: 'Interaction' + weight: 2 + metric: component + - type: contains + value: 'interactKey' + weight: 2 + metric: component diff --git a/packages/interact-evals/tests/pointermove.yaml b/packages/interact-evals/tests/pointermove.yaml new file mode 100644 index 00000000..01ccd347 --- /dev/null +++ b/packages/interact-evals/tests/pointermove.yaml @@ -0,0 +1,92 @@ +- description: 'PointerMove — 3D tilt card (namedEffect)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/pointermove.md + prompt: > + Create an interactive product card that tilts in 3D following the mouse + cursor when hovering over it. + expected_trigger: pointerMove + expected: + hit_area: self + effect_type: namedEffect + named_effect: Tilt3DMouse + assert: + - type: javascript + value: file://assertions/pointermove-checks.js + weight: 3 + metric: semantic + +- description: 'PointerMove — global cursor follower' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/pointermove.md + prompt: > + Create a decorative dot element that follows the mouse cursor anywhere + on the page with a slight lag/floaty feel. + expected_trigger: pointerMove + expected: + hit_area: root + effect_type: namedEffect + has_centeredToTarget: false + assert: + - type: javascript + value: file://assertions/pointermove-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['TrackMouse', 'AiryMouse'] + weight: 1 + metric: effect_choice + +- description: 'PointerMove — multi-layer parallax' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/pointermove.md + prompt: > + Create a multi-layer pointer parallax effect on a hero section with + 3 layers (background, midground, foreground) that move at different + speeds following the mouse. Each layer should use a different distance. + expected_trigger: pointerMove + expected: + hit_area: self + effect_type: namedEffect + multi_layer: true + assert: + - type: javascript + value: file://assertions/pointermove-checks.js + weight: 3 + metric: semantic + +- description: 'PointerMove — custom magnetic button effect' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/pointermove.md + prompt: > + Create a magnetic button effect where the button subtly moves toward + the mouse cursor when hovering over it. Use a customEffect with + distance calculations. + expected_trigger: pointerMove + expected: + hit_area: self + effect_type: customEffect + assert: + - type: javascript + value: file://assertions/pointermove-checks.js + weight: 3 + metric: semantic + +- description: 'PointerMove — horizontal slider with keyframeEffect' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/pointermove.md + prompt: > + Create a horizontal slider-like interaction where moving the mouse + left to right over a container moves a slider element from 0px to + 220px along the x axis. Use keyframeEffect with axis mapping. + expected_trigger: pointerMove + expected: + effect_type: keyframeEffect + assert: + - type: javascript + value: file://assertions/pointermove-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ["axis: 'x'", 'axis: "x"'] + weight: 2 + metric: axis_config diff --git a/packages/interact-evals/tests/viewenter.yaml b/packages/interact-evals/tests/viewenter.yaml new file mode 100644 index 00000000..b68ff7e1 --- /dev/null +++ b/packages/interact-evals/tests/viewenter.yaml @@ -0,0 +1,100 @@ +- description: 'ViewEnter — hero fade-in once' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewenter.md + prompt: > + Make a hero section fade in when it scrolls into view. The animation + should play once and never repeat. + expected_trigger: viewEnter + expected: + params_type: once + effect_type: namedEffect + named_effect: FadeIn + has_duration: true + assert: + - type: javascript + value: file://assertions/viewenter-checks.js + weight: 3 + metric: semantic + +- description: 'ViewEnter — staggered card grid entrance' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewenter.md + prompt: > + Create a staggered entrance for a grid of 3 cards. Each card should + slide in from the bottom with increasing delays (0ms, 150ms, 300ms) + when the section scrolls into view. + expected_trigger: viewEnter + expected: + params_type: once + has_duration: true + staggered: true + assert: + - type: javascript + value: file://assertions/viewenter-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['SlideIn', 'translateY'] + weight: 1 + metric: effect_choice + +- description: 'ViewEnter — repeating counter animation' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewenter.md + prompt: > + Create a counter animation that replays each time the stats section + scrolls into view. Use a separate observer element from the animated + counter. + expected_trigger: viewEnter + expected: + params_type: repeat + cross_target: true + assert: + - type: javascript + value: file://assertions/viewenter-checks.js + weight: 3 + metric: semantic + +- description: 'ViewEnter — ambient floating loop (state)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewenter.md + prompt: > + Start a continuously floating animation (translateY up and down) on + decorative icons when they enter the viewport. The animation should loop + infinitely and pause when the element leaves the viewport. + expected_trigger: viewEnter + expected: + params_type: state + has_duration: true + assert: + - type: javascript + value: file://assertions/viewenter-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['Infinity', 'iterations'] + weight: 1 + metric: effect_choice + +- description: 'ViewEnter — chained entrance (title then content)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewenter.md + prompt: > + When a section enters the viewport, first fade in the title, then after + it finishes, slide in the content paragraph. Use animation chaining. + expected_trigger: viewEnter + expected: + params_type: once + assert: + - type: javascript + value: file://assertions/viewenter-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'animationEnd' + weight: 2 + metric: chaining + - type: contains + value: 'effectId' + weight: 1 + metric: chaining diff --git a/packages/interact-evals/tests/viewprogress.yaml b/packages/interact-evals/tests/viewprogress.yaml new file mode 100644 index 00000000..cb9dee1c --- /dev/null +++ b/packages/interact-evals/tests/viewprogress.yaml @@ -0,0 +1,104 @@ +- description: 'ViewProgress — background parallax' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewprogress.md + prompt: > + Create a parallax effect on a hero section background that moves slower + than the page scroll using ParallaxScroll named effect. + expected_trigger: viewProgress + expected: + effect_type: namedEffect + named_effect: ParallaxScroll + range_name: cover + has_easing_linear: true + cross_target: true + assert: + - type: javascript + value: file://assertions/viewprogress-checks.js + weight: 3 + metric: semantic + +- description: 'ViewProgress — content fade on entry range' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewprogress.md + prompt: > + Create a scroll-driven fade effect on a content block that gradually + fades in as it enters the viewport. Use the entry range. + expected_trigger: viewProgress + expected: + range_name: entry + assert: + - type: javascript + value: file://assertions/viewprogress-checks.js + weight: 3 + metric: semantic + - type: contains-any + value: ['FadeScroll', 'opacity'] + weight: 1 + metric: effect_choice + +- description: 'ViewProgress — custom keyframe scroll (entry + exit)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewprogress.md + prompt: > + Create a scroll-driven animation that fades in and slides up a card as + it enters the viewport, using custom keyframes (translateY and opacity). + Use the entry range from 0% to 70%. + expected_trigger: viewProgress + expected: + effect_type: keyframeEffect + range_name: entry + assert: + - type: javascript + value: file://assertions/viewprogress-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'translateY' + weight: 1 + metric: completeness + - type: contains + value: 'opacity' + weight: 1 + metric: completeness + +- description: 'ViewProgress — multi-range (entry + exit phases)' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewprogress.md + prompt: > + Create a complex scroll animation for a section: content fades in + during entry, and fades out + scales down during exit. Use two separate + effects with entry and exit ranges. + expected_trigger: viewProgress + expected: {} + assert: + - type: javascript + value: file://assertions/viewprogress-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'entry' + weight: 1 + metric: completeness + - type: contains + value: 'exit' + weight: 1 + metric: completeness + +- description: 'ViewProgress — scroll counter with customEffect' + vars: + rules: https://raw.githubusercontent.com/wix/interact/master/packages/interact/rules/viewprogress.md + prompt: > + Create a scroll-driven counter that updates a percentage display from + 0% to 100% as you scroll through the stats section. Use a customEffect. + expected_trigger: viewProgress + expected: + effect_type: customEffect + assert: + - type: javascript + value: file://assertions/viewprogress-checks.js + weight: 3 + metric: semantic + - type: contains + value: 'customEffect' + weight: 2 + metric: effect_type diff --git a/yarn.lock b/yarn.lock index c104a8dd..ad555658 100644 --- a/yarn.lock +++ b/yarn.lock @@ -1373,6 +1373,14 @@ __metadata: languageName: unknown linkType: soft +"@wix/interact-evals@workspace:packages/interact-evals": + version: 0.0.0-use.local + resolution: "@wix/interact-evals@workspace:packages/interact-evals" + dependencies: + yaml: "npm:^2.7.0" + languageName: unknown + linkType: soft + "@wix/interact@npm:^2.0.1, @wix/interact@workspace:packages/interact": version: 0.0.0-use.local resolution: "@wix/interact@workspace:packages/interact" @@ -6255,6 +6263,15 @@ __metadata: languageName: node linkType: hard +"yaml@npm:^2.7.0": + version: 2.8.2 + resolution: "yaml@npm:2.8.2" + bin: + yaml: bin.mjs + checksum: 10/4eab0074da6bc5a5bffd25b9b359cf7061b771b95d1b3b571852098380db3b1b8f96e0f1f354b56cc7216aa97cea25163377ccbc33a2e9ce00316fe8d02f4539 + languageName: node + linkType: hard + "yocto-queue@npm:^0.1.0": version: 0.1.0 resolution: "yocto-queue@npm:0.1.0"