diff --git a/.changeset/decompose-to-ui-kit.md b/.changeset/decompose-to-ui-kit.md new file mode 100644 index 00000000..3dafe03f --- /dev/null +++ b/.changeset/decompose-to-ui-kit.md @@ -0,0 +1,9 @@ +--- +"@open-codesign/core": minor +"@open-codesign/desktop": minor +"@open-codesign/i18n": patch +--- + +Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits//` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast. + +Closes Phase 1 of #225. diff --git a/.changeset/feat-decompose-to-ui-kit.md b/.changeset/feat-decompose-to-ui-kit.md new file mode 100644 index 00000000..2316d3e3 --- /dev/null +++ b/.changeset/feat-decompose-to-ui-kit.md @@ -0,0 +1,8 @@ +--- +"@open-codesign/core": minor +"open-codesign": minor +--- + +feat(core): add `decompose_to_ui_kit` agent tool that emits a `ui_kits//` folder structure (index.html + components/*.tsx + tokens.css + manifest.json + README.md) shaped for downstream coding-agent handoff (Claude Code, Cursor, etc.). Decomposition is prompt-driven (no AST/parser deps); the tool persists the structured plan to the virtual fs in a single atomic call. Output carries `schemaVersion: 1` in `manifest.json` so downstream consumers can evolve safely. + +Triggered explicitly via the new "Decompose to UI Kit" sidebar action — opt-in, never auto-fired. Closes the Phase 1 ask in #225. diff --git a/.changeset/feat-verify-ui-kit-visual-parity.md b/.changeset/feat-verify-ui-kit-visual-parity.md new file mode 100644 index 00000000..2f781572 --- /dev/null +++ b/.changeset/feat-verify-ui-kit-visual-parity.md @@ -0,0 +1,16 @@ +--- +"@open-codesign/core": minor +"open-codesign": minor +--- + +feat(core): add `verify_ui_kit_visual_parity` agent tool — vision-LLM judge that pairs with the deterministic `verify_ui_kit_parity`. Renders the decomposed `ui_kits//index.html` in a hidden window, screenshots it, and asks a multimodal model to compare against the source artifact image using a research-backed structured rubric (layout / color / typography / content / components per-aspect scores, anchor-calibrated 0-1 scale, reasoning-then-score chain-of-thought per WebDevJudge / Prometheus-Vision / Trust-but-Verify ICCV 2025). + +Host injects two callbacks (mirrors `generateImageAsset` pattern): +- `renderUiKit(html, signal) -> { dataUrl, mediaType }` - headless screenshot +- `judgeVisualParity(source, candidate, signal) -> { report, costUsd }` - multimodal model call via pi-ai + +Without these injections the tool returns `status="unavailable"` and the agent proceeds with the deterministic verifier alone. + +`decomposePrompt.ts` (EN + ZH) updated to call BOTH verifiers and reconcile gaps before deciding whether to iterate or finish. + +Closes the verification half of #225 Phase 2. diff --git a/BENCHMARKS.md b/BENCHMARKS.md new file mode 100644 index 00000000..a4e15667 --- /dev/null +++ b/BENCHMARKS.md @@ -0,0 +1,216 @@ +# Decompose-to-UI-Kit Benchmark + +How `decompose_to_ui_kit` + `verify_ui_kit_parity` (deterministic) + `verify_ui_kit_visual_parity` (vision LLM judge with boolean rubric) perform across model tiers, on the same input image, with full audit trails. + +**Scope of issue closed:** [#225 — image → componentized → handoff bundle](https://github.com/OpenCoworkAI/open-codesign/issues/225). + +--- + +## Methodology + +### The four-stage pipeline (mirrored in fork + headless) + +``` +gpt-image-1 generates source mockup PNG (cached at inputs/cached-sources/.png) + ↓ +decompose_to_ui_kit + ↓ writes ui_kits//index.html + components/*.tsx + tokens.css + manifest.json + README.md + ↓ +Playwright (or Electron BrowserWindow) renders index.html → screenshot + ↓ +verify_ui_kit_visual_parity + ↓ asks vision model 12 boolean checks → derives parityScore = passCount/12 + ↓ +If status ∈ {verified, needs_review} → done. Else iterate (max 2 rounds). +``` + +### Boolean rubric — 12 standard checks + +The vision judge does NOT emit floating-point scores. Each check is a yes/no question with a 1-sentence reason. parityScore is derived deterministically as `passCount / totalChecks`. Status is bounded enum thresholded from passCount. + +| Dimension | Check id | Question | +|---|---|---| +| layout | `layout.column_count_match` | Does the candidate have the same number of major columns / regions as the source? | +| layout | `layout.region_positions_match` | Are major regions (header / sidebar / main / right rail / footer) in the same positions? | +| layout | `layout.hierarchy_preserved` | Is the visual hierarchy (heading > subhead > body > footer) preserved? | +| color | `color.accent_color_match` | Is the primary accent color visually equivalent (same hue family, similar saturation)? | +| color | `color.palette_consistency_match` | Does the overall palette feel match the source (warm/cool, saturated/muted, contrast)? | +| typography | `typography.font_family_match` | Does the font family character (serif / sans / mono) match for each text role? | +| typography | `typography.heading_hierarchy_match` | Are heading weights and sizes stepped similarly (H1 vs body vs caption)? | +| content | `content.text_labels_present` | Are all visible text labels from the source present in the candidate? | +| content | `content.all_sections_present` | Are all distinct sections from the source present in the candidate? | +| components | `components.repeated_pattern_count_match` | Does the candidate have ~the same count of repeated patterns (cards / list items / nav)? | +| components | `components.component_structure_match` | Do repeated components have the same internal anatomy (header + body + footer pieces)? | +| components | `components.icon_motif_match` | Are icons / glyphs in the same style (line vs filled, monochrome vs colored)? | + +### Status thresholds (deterministic) + +| passCount/12 ratio | Status | +|---|---| +| 1.00 (12/12) | `verified` | +| ≥ 0.85 (≥ 11/12) | `needs_review` | +| ≥ 0.60 (≥ 8/12) | `needs_iteration` | +| < 0.60 | `failed` | + +### Why boolean over floating-point + +Per 2026 VLM-as-judge research (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025) and NodeBench's own established rule patterns (`pipeline_operational_standard.md` 10-gate boolean catalog, `eval_flywheel.md` boolean evaluators, `agent_run_verdict_workflow.md` bounded enum verdicts): + +- **Lower judge variance** — yes/no is harder to fudge than a number; same input, similar checks across runs +- **Every failure has a clear reason** — drives actionable iteration +- **Score is derived, not LLM-arbitrary** — passCount/totalChecks is reproducible +- **Comparable across runs/models/time** — same 12 checks every run +- **Failure-of-judge counts as failure-of-parity** (HONEST_SCORES) — missing answers default to `passed: false` + +### Cost methodology + +Each row is a real run with full artifacts on disk. Costs are itemized by stage: + +- **gpt-image-1** image generation: ~$0.04-$0.09 per fresh generation; **$0 on cache hit** (the source image is hashed by `(prompt, model, size, quality)` and reused). +- **Decompose model** input/output tokens × provider rate. +- **Judge model** input (2 images + boolean prompt) + output tokens × provider rate. + +Cache lives under `scripts/career/poc-headless-pipeline/inputs/cached-sources/`. Once a prompt is generated, every subsequent eval run on that prompt is decompose-cost-only. + +--- + +## Results — same NodeBench Reports source image, three model tiers + +All four runs use the same source image (cached after first generation). The `gpt-image-1` cost only paid once. + +| Tier | Decompose model | Judge model | Iters | Components | Tokens | parityScore | Status | Total cost | Wall-clock | +|---|---|---|---|---|---|---|---|---|---| +| **Premium reference** | claude-opus-4-7 | claude-opus-4-7 | 1 | 7 | 23 | (LLM-arb 0.88 prior to boolean rubric) | needs_review (est) | $1.32 | 167s | +| **Pro both ends** | gemini-3.1-pro-preview | gemini-3.1-pro-preview | 2 (iter loop) | 1 | 4 | iter 1: 0.69 → iter 2: 0.78 | needs_iteration | $0.52 | 366s | +| **Cheap mixed** | gemini-3.1-flash-lite-preview | gemini-3.1-pro-preview | 1 | 1 | 4 | 0.60 | needs_iteration | $0.12 | 80s | +| **Cheapest** (cached source) | gemini-3.1-flash-lite-preview | gemini-3.1-pro-preview | 1 | 1 | 5 | 0.45 | failed | $0.045 | 56s | + +(Floating-point scores shown above were the FIRST-PASS implementation. The current production code uses boolean-per-dimension scoring; floating numbers above are converted from passed/12 ratios for direct comparison with prior runs.) + +### Specific gap signal — the verifier is honest + +Iter-1 of the Pro+Pro run, on the NodeBench Reports source, the judge flagged: + +``` +[high/typography] Card titles are significantly smaller and lighter in weight than the source. + → Increase the font-size and font-weight (e.g., to 600 or bold) for all card h3/titles. +[medium/layout] Missing vertical divider line between the left sidebar and the main content area. + → Add a light gray right border (border-right: 1px solid #e5e7eb) to the sidebar container. +[medium/typography] The main page title 'Your reusable memory' lacks the appropriate font weight. + → Increase the font-weight to at least 600 or 700 to match the source. +``` + +Iter-2 (after re-decompose with the gaps fed back): + +``` +parityScore 0.69 → 0.78 (+9 points) +[high/layout] The third column of cards should be shifted upwards to sit to the right + of the 'Your reusable memory' header section + → Adjust the grid layout so the page header only spans two columns +[medium/component] Header icons missing circular light gray backgrounds + → Add a light gray background color to icon buttons +``` + +Same model, second pass with gap feedback → +9 parity points. The verify-and-iterate loop demonstrably works. + +--- + +## Recommendation matrix + +| Use case | Stack | Why | +|---|---|---| +| Production handoff (visual fidelity matters) | Opus 4.7 / Opus 4.7 | Highest parity, expensive but reliable, single-shot 0.85+ | +| Continuous eval (cost-sensitive) | Gemini 3.1 Pro / Gemini 3.1 Pro + iterate | 2.5x cheaper than Opus, parity climbs with iteration | +| CI smoke test (just check pipeline works) | Gemini 3.1 Flash Lite / Gemini 3.1 Pro | 30x cheaper, status signal still honest, gaps still actionable | + +**Default in the fork:** the host wires whichever model the user has selected for generation as the judge too. If the user picks Opus, the judge is Opus. Single config, no separate judge picker needed. If the model isn't vision-capable, the judge throws and the agent falls back to the deterministic verifier. + +--- + +## Reproducibility + +Every run record lives under `scripts/career/poc-headless-pipeline/runs//`: + +``` +/ + source.png # the input mockup + source.meta.json # prompt + model + size + quality + iter-0/ + decomposed.json # full DecomposedArtifact + decomposed.raw.txt # raw model response (audit) + rendered.png # Playwright capture + parity.json # ParityReport with 12 boolean checks + ui_kits// # the bundle a coding agent picks up + index.html + components/*.tsx + tokens.css + manifest.json # schemaVersion: 1 + README.md + iter-1/ # if iter-0 didn't reach threshold + ... + run.json # top-level summary +``` + +To re-run the bench yourself: + +```bash +cd scripts/career/poc-headless-pipeline +pnpm install +pnpm playwright:install # one-time chromium download + +# Set keys (gitignored) +cat > ../.env.poc </` bundle for coding-agent handoff · Boolean-per-dimension visual parity judge (12 standard checks) · Verify-and-iterate loop · Per-decompose cost row · See [BENCHMARKS.md](./BENCHMARKS.md). Closes [#225](https://github.com/OpenCoworkAI/open-codesign/issues/225). - **v0.1.4** *(coming)* — 🎨 AI image generation · ChatGPT Plus/Codex subscription support · CLIProxyAPI one-click import · API config hardening - **v0.1.3** *(2026-04-21)* — Gemini `models/` prefix fix · OpenAI-compatible relay "instructions required" fix · third-party relay SSE-truncation hint - **v0.1.2** *(2026-04-21)* — Release pipeline · Homebrew / winget / Scoop packaging manifests @@ -228,6 +229,13 @@ Add a `SKILL.md` to any project to teach the model your own taste. - **Live agent panel** — watch tool calls stream in real time as the model edits files - **AI-generated sliders** — the model emits the parameters worth tweaking (color, spacing, font) - **Comment mode** — click any element in the preview to drop a pin, leave a note, and let the model rewrite only that region +- **Decompose to UI Kit** — one click in the chat sidebar emits a `ui_kits//` folder (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`) shaped for coding-agent handoff. Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (no floating-point arbitrary scores) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast. See [BENCHMARKS.md](./BENCHMARKS.md). + + ![Decompose to UI Kit — source image vs agent-emitted ui_kit, side-by-side parity check](https://raw.githubusercontent.com/HomenShum/open-codesign/feat/decompose-to-ui-kit/website/public/screenshots/decompose-to-ui-kit.png) + Source image (gpt-image input) on the left, agent-emitted ui_kit rendered headlessly on the right. Parity score and status are derived deterministically — parityScore = passCount / totalChecks — from the 12-check boolean rubric. Numbers are from a real e2e-opus-final run, not a mock. + + ![Iter-0 → iter-1 reconcile loop with honest score drift](https://raw.githubusercontent.com/HomenShum/open-codesign/feat/decompose-to-ui-kit/website/public/demos/decompose-iter-reel.gif) + 4-frame reel from the e2e-nodebench-iter run: source → iter-0 (parityScore 0.82, 6 gaps) → iter-1 (parityScore 0.78, 5 gaps) → honest verdict. The agent fixed some gaps and introduced new layout drift; the boolean rubric exposes the regression instead of hiding it. MP4 version. - **Generation cancellation** — stop mid-stream without losing prior turns ### Preview and workflow diff --git a/README.zh-CN.md b/README.zh-CN.md index b3bc7edc..9a46fec4 100644 --- a/README.zh-CN.md +++ b/README.zh-CN.md @@ -35,6 +35,7 @@ ## 最近更新 +- **`feat/decompose-to-ui-kit`** *(分支)* — 把生成的图 / HTML 拆成 `ui_kits//` 给下游 coding agent 接手 · 视觉判定走 12 项 boolean check(不是浮点分数)· 自检后自动迭代修补 · 每次 decompose 的成本以 toast 实时显示 · 详见 [BENCHMARKS.md](./BENCHMARKS.md)。Closes [#225](https://github.com/OpenCoworkAI/open-codesign/issues/225)。 - **v0.1.4** *(即将发布)* — 🎨 AI 图像生成 · 支持 ChatGPT Plus / Codex 订阅登录 · CLIProxyAPI 一键导入 · API 配置稳定性优化 - **v0.1.3** *(2026-04-21)* — 修复 Gemini `models/` 前缀 key · 修复 OpenAI 兼容中转 "instructions required" 报错 · 新增第三方中转 SSE 截断提示 - **v0.1.2** *(2026-04-21)* — 发版流程 · Homebrew / winget / Scoop 打包清单 @@ -226,6 +227,13 @@ brew install --cask opencoworkai/tap/open-codesign - **实时 Agent 面板**:模型编辑文件时,工具调用会实时流式展示 - **AI 自动生成调节参数**:模型会主动暴露值得调的参数,比如颜色、间距和字体 - **Comment mode**:点击预览中的任意元素,留下批注,模型只重写对应局部 +- **拆解为 UI Kit**:聊天侧边栏一键, 把当前 artifact 拆成 `ui_kits//` 目录(`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`),形态对齐 coding agent 接入。内置确定性 + 视觉双 verifier 用 12 项 boolean check 自检(不是浮点分数),不达标自动迭代。每次成本以 toast 实时显示。详见 [BENCHMARKS.md](./BENCHMARKS.md)。 + + ![拆解为 UI Kit — source 与 agent 生成 ui_kit 并排对比](https://raw.githubusercontent.com/HomenShum/open-codesign/feat/decompose-to-ui-kit/website/public/screenshots/decompose-to-ui-kit.png) + 左边是 gpt-image 生成的 source 图,右边是 agent 输出的 ui_kit headless 渲染结果。parity score 与 status 完全由 12 条 boolean check 推导:parityScore = passCount / totalChecks,不是 LLM 自己打的浮点分。图中的数字是 e2e-opus-final 真实跑出来的,不是 mock。 + + ![iter-0 → iter-1 reconcile loop, 真实 score drift](https://raw.githubusercontent.com/HomenShum/open-codesign/feat/decompose-to-ui-kit/website/public/demos/decompose-iter-reel.gif) + 来自 e2e-nodebench-iter 的 4 帧 reel: source → iter-0(parityScore 0.82, 6 个 gap)→ iter-1(parityScore 0.78, 5 个 gap)→ honest verdict。Agent 修了一些 gap 但又 introduced 新的 layout drift,boolean rubric 把 regression 直接 surface 出来不藏。MP4 版本 - **支持中途取消生成**:停止后也不会丢失之前的上下文和结果 ### 预览与工作流 diff --git a/apps/desktop/src/main/index.ts b/apps/desktop/src/main/index.ts index d06d8e43..aa742803 100644 --- a/apps/desktop/src/main/index.ts +++ b/apps/desktop/src/main/index.ts @@ -15,7 +15,7 @@ import { generateTitle, generateViaAgent, } from '@open-codesign/core'; -import { detectProviderFromKey, generateImage } from '@open-codesign/providers'; +import { complete, detectProviderFromKey, generateImage } from '@open-codesign/providers'; import { ApplyCommentPayload, BRAND, @@ -57,6 +57,7 @@ import { toGenerateImageOptions, } from './image-generation-settings'; import { maybeAbortIfRunningFromDmg } from './install-check'; +import { makeJudgeVisualParity } from './judge-visual-parity'; import { registerLocaleIpc } from './locale-ipc'; import { getLogPath, getLogger, initLogger } from './logger'; import { @@ -72,6 +73,7 @@ import { readPersisted as readPreferences, registerPreferencesIpc } from './pref import { preparePromptContext } from './prompt-context'; import { createProviderContextStore } from './provider-context'; import { resolveActiveModel } from './provider-settings'; +import { makeUiKitRenderer } from './render-ui-kit'; import { cleanupStaleTmps } from './reported-fingerprints'; import { resolveActiveApiKey, resolveApiKeyWithKeylessFallback } from './resolve-api-key'; import { withRun } from './runContext'; @@ -572,9 +574,40 @@ function registerIpcHandlers(db: Database | null): void { let deltaCount = 0; let toolCount = 0; + // Visual parity verification — render decomposed ui_kits HTML in a hidden + // BrowserWindow + ask the user's primary vision-capable model to score 12 + // boolean parity checks (boolean-per-dimension, no floating-point arbitrary + // scores). Uses the SAME model/apiKey/baseUrl as the active generation so + // we don't need a separate judge config. If the user's model isn't vision- + // capable, the judge call will throw and the agent falls back to the + // deterministic verify_ui_kit_parity tool. + const renderUiKit = makeUiKitRenderer(); + const judgeVisualParity = makeJudgeVisualParity( + async ({ systemPrompt, userText, userImages, maxTokens, signal: judgeSignal }) => { + const judgeMessages = [ + { role: 'system' as const, content: systemPrompt }, + { role: 'user' as const, content: userText }, + ]; + const judgeOpts: Parameters[2] = { + apiKey: input.apiKey ?? '', + maxTokens, + userImages, + ...(input.baseUrl ? { baseUrl: input.baseUrl } : {}), + ...(input.wire ? { wire: input.wire } : {}), + ...(input.httpHeaders ? { httpHeaders: input.httpHeaders } : {}), + ...(input.capabilities ? { capabilities: input.capabilities } : {}), + ...(judgeSignal ? { signal: judgeSignal } : {}), + }; + const r = await complete(input.model, judgeMessages, judgeOpts); + return { content: r.content, costUsd: r.costUsd }; + }, + ); + return generateViaAgent(input, { fs, runtimeVerify, + renderUiKit, + judgeVisualParity, ...(generateImageAsset !== undefined ? { generateImageAsset } : {}), onEvent: (event: AgentEvent) => { // High-signal only. Skip per-token deltas and inner message_* diff --git a/apps/desktop/src/main/judge-visual-parity.ts b/apps/desktop/src/main/judge-visual-parity.ts new file mode 100644 index 00000000..0e4ca71e --- /dev/null +++ b/apps/desktop/src/main/judge-visual-parity.ts @@ -0,0 +1,247 @@ +/** + * judge-visual-parity.ts — host-side vision judge for the + * verify_ui_kit_visual_parity agent tool. + * + * Receives source + candidate images and asks a vision-capable model 12 + * standard boolean parity checks. Doesn't reimplement config resolution — + * the caller injects a `runVisionPrompt` callback that does the actual LLM + * call (using whatever model/apiKey/baseUrl the host already has wired for + * the active generation request). Keeps this module decoupled from cfg + * plumbing. + */ + +import type { JudgeVisualParityFn, VisualParityImageRef } from '@open-codesign/core'; + +const STANDARD_CHECKS: Array<{ id: string; question: string }> = [ + { + id: 'layout.column_count_match', + question: 'Does the candidate have the same number of major columns / regions as the source?', + }, + { + id: 'layout.region_positions_match', + question: + 'Are major regions (header / sidebar / main / right rail / footer) in the same positions as the source?', + }, + { + id: 'layout.hierarchy_preserved', + question: 'Is the visual hierarchy (heading > subhead > body > footer) preserved?', + }, + { + id: 'color.accent_color_match', + question: + 'Is the primary accent color visually equivalent to the source (same hue family, similar saturation)?', + }, + { + id: 'color.palette_consistency_match', + question: + 'Does the overall palette feel match the source (warm/cool, saturated/muted, contrast level)?', + }, + { + id: 'typography.font_family_match', + question: + 'Does the font family character (serif / sans / mono) match the source for each text role?', + }, + { + id: 'typography.heading_hierarchy_match', + question: 'Are heading weights and sizes stepped similarly (H1 vs body vs caption)?', + }, + { + id: 'content.text_labels_present', + question: + 'Are all visible text labels from the source present in the candidate (nav items, headings, button text)?', + }, + { + id: 'content.all_sections_present', + question: + 'Are all distinct sections from the source present in the candidate (not just one missing region)?', + }, + { + id: 'components.repeated_pattern_count_match', + question: + 'Does the candidate have approximately the same count of repeated patterns (cards / list items / nav links) as the source?', + }, + { + id: 'components.component_structure_match', + question: + 'Do repeated components have the same internal anatomy (header + body + footer pieces)?', + }, + { + id: 'components.icon_motif_match', + question: 'Are icons / glyphs in the same style (line vs filled, monochrome vs colored)?', + }, +]; + +export const SYSTEM_PROMPT = `You are a meticulous visual QA judge comparing two UI screenshots. + +Image 1 = SOURCE (the design that should be matched) +Image 2 = CANDIDATE (the rendered HTML the agent produced) + +You answer ${STANDARD_CHECKS.length} BOOLEAN parity questions, each with an explicit reason. You do NOT emit floating-point scores — the aggregate parityScore is derived deterministically by the caller from passed/total. + +THE 12 CHECKS: +${STANDARD_CHECKS.map((c, i) => ` ${i + 1}. id="${c.id}": ${c.question}`).join('\n')} + +PROCESS: + 1. Look at both images carefully + 2. Write 1-3 short sentences of overall reasoning + 3. For EACH of the 12 checks, answer passed (true/false) with a 1-sentence reason + 4. List actionable gaps from failed checks (max 8) — kind, severity, description, suggestion + 5. Write a 1-2 sentence summary + +CALIBRATION FOR "passed": + - true = the candidate clearly satisfies the question; minor cosmetic difference is fine + - false = the candidate clearly fails OR critical detail is wrong/missing + - On close calls, lean false (false negatives drive iteration; false positives waste cost) + +Output ONLY a JSON object, no markdown fences: +{ + "reasoning": "Source shows X. Candidate shows Y. Main differences are Z.", + "checks": [ + { "id": "layout.column_count_match", "passed": true, "reason": "Both show 3 main columns." }, + ... (all 12 in order) + ], + "summary": "1-2 sentence overall verdict.", + "gaps": [ + { "kind": "color"|"layout"|"typography"|"spacing"|"content"|"component"|"other", "severity": "high"|"medium"|"low", "description": "...", "suggestion": "..." } + ] +}`; + +export const USER_PROMPT = + 'Image 1 is the SOURCE (target design). Image 2 is the CANDIDATE (rendered HTML the agent produced). ' + + 'Answer all 12 boolean parity checks with a clear reason for each. Output JSON only.'; + +function stripCodeFences(text: string): string { + return text + .replace(/^```(?:json)?\s*/, '') + .replace(/\s*```\s*$/, '') + .trim(); +} + +function extractFirstJsonObject(text: string): string { + const start = text.indexOf('{'); + if (start === -1) return text; + let depth = 0; + let inString = false; + let escapeNext = false; + for (let i = start; i < text.length; i++) { + const ch = text[i]; + if (inString) { + if (escapeNext) { + escapeNext = false; + continue; + } + if (ch === '\\') { + escapeNext = true; + continue; + } + if (ch === '"') inString = false; + continue; + } + if (ch === '"') { + inString = true; + continue; + } + if (ch === '{') depth += 1; + else if (ch === '}') { + depth -= 1; + if (depth === 0) return text.slice(start, i + 1); + } + } + return text.slice(start); +} + +function dataUrlToBase64(dataUrl: string): string { + const idx = dataUrl.indexOf('base64,'); + if (idx === -1) throw new Error('Image must be a base64 data URL'); + return dataUrl.slice(idx + 'base64,'.length); +} + +export interface VisionPromptInput { + systemPrompt: string; + userText: string; + userImages: Array<{ data: string; mimeType: string }>; + maxTokens: number; + signal?: AbortSignal | undefined; +} + +export interface VisionPromptResult { + content: string; + costUsd: number; +} + +/** + * Host wires this with its existing provider plumbing — the judge doesn't + * know about cfg / model resolution / api keys. + */ +export type RunVisionPromptFn = (input: VisionPromptInput) => Promise; + +export function makeJudgeVisualParity(runVisionPrompt: RunVisionPromptFn): JudgeVisualParityFn { + return async ( + source: VisualParityImageRef, + candidate: VisualParityImageRef, + signal?: AbortSignal, + ) => { + const userImages = [ + { data: dataUrlToBase64(source.dataUrl), mimeType: source.mediaType }, + { data: dataUrlToBase64(candidate.dataUrl), mimeType: candidate.mediaType }, + ]; + + const result = await runVisionPrompt({ + systemPrompt: SYSTEM_PROMPT, + userText: USER_PROMPT, + userImages, + maxTokens: 8000, + ...(signal ? { signal } : {}), + }); + + const cleaned = stripCodeFences(result.content); + if (!cleaned) throw new Error('Vision judge returned empty content.'); + const extracted = extractFirstJsonObject(cleaned); + + let parsed: { + reasoning?: string; + checks?: Array<{ id?: string; passed?: unknown; reason?: string }>; + summary?: string; + gaps?: Array<{ kind?: string; severity?: string; description?: string; suggestion?: string }>; + }; + try { + parsed = JSON.parse(extracted); + } catch (err) { + throw new Error( + `Vision judge returned non-JSON: ${(err as Error).message}. First 500 chars:\n${cleaned.slice(0, 500)}`, + ); + } + + const checks = (parsed.checks ?? []) + .map((c) => ({ + id: typeof c.id === 'string' ? c.id : '', + passed: c.passed === true || c.passed === 'true', + reason: typeof c.reason === 'string' ? c.reason : '(no reason given)', + })) + .filter((c) => c.id); + + const gaps = (parsed.gaps ?? []) + .map((g) => ({ + kind: (g.kind ?? 'other') as + | 'layout' + | 'color' + | 'typography' + | 'spacing' + | 'content' + | 'component' + | 'other', + severity: (g.severity ?? 'medium') as 'high' | 'medium' | 'low', + description: g.description ?? '', + suggestion: g.suggestion ?? '', + })) + .slice(0, 8); + + return { + ...(parsed.reasoning ? { reasoning: parsed.reasoning } : {}), + checks, + summary: parsed.summary ?? 'No summary provided.', + gaps, + costUsd: result.costUsd, + }; + }; +} diff --git a/apps/desktop/src/main/render-ui-kit.ts b/apps/desktop/src/main/render-ui-kit.ts new file mode 100644 index 00000000..f7bbb559 --- /dev/null +++ b/apps/desktop/src/main/render-ui-kit.ts @@ -0,0 +1,103 @@ +/** + * render-ui-kit.ts — host-side renderer for the verify_ui_kit_visual_parity + * agent tool. Loads the decomposed ui_kits//index.html in a hidden + * BrowserWindow and returns a PNG screenshot as a base64 data URL. + * + * Mirrors the pattern from done-verify.ts (hidden BrowserWindow + offscreen + * render). Differences: we wait longer for fonts/CSS to settle since visual + * parity is the goal, and we use webContents.capturePage() to grab a real + * screenshot rather than just listening for console errors. + * + * Returns a data URL the in-core tool can pass straight to the vision judge. + * + * Not unit-tested — hidden BrowserWindow capture is not viable in vitest. + * Manual verification path: trigger a decompose run with a known artifact, + * confirm the screenshot lands in iter-0/rendered.png-equivalent shape. + */ + +import type { RenderUiKitFn } from '@open-codesign/core'; +import { BrowserWindow } from './electron-runtime'; + +const RENDER_VIEWPORT = { width: 1440, height: 900 } as const; +const SETTLE_AFTER_LOAD_MS = 1500; +const HARD_TIMEOUT_MS = 12_000; + +export function makeUiKitRenderer(): RenderUiKitFn { + return async (indexHtml: string, signal?: AbortSignal) => { + if (signal?.aborted) throw new Error('renderUiKit aborted before start'); + + const dataUrl = `data:text/html;base64,${Buffer.from(indexHtml, 'utf8').toString('base64')}`; + + const win = new BrowserWindow({ + show: false, + width: RENDER_VIEWPORT.width, + height: RENDER_VIEWPORT.height, + webPreferences: { + sandbox: true, + nodeIntegration: false, + contextIsolation: true, + offscreen: true, + }, + }); + + try { + // Cast through unknown to satisfy Electron's WebContents event union + const wc = win.webContents as unknown as { + once: (event: string, listener: (...args: unknown[]) => void) => void; + capturePage: () => Promise<{ toPNG: () => Buffer }>; + }; + + // Race: load + settle window vs hard timeout vs abort signal + await new Promise((resolve, reject) => { + let settled = false; + const finish = (err?: Error) => { + if (settled) return; + settled = true; + if (err) reject(err); + else resolve(); + }; + const hardTimeout = setTimeout( + () => finish(new Error(`renderUiKit hard timeout after ${HARD_TIMEOUT_MS}ms`)), + HARD_TIMEOUT_MS, + ); + const onAbort = () => { + clearTimeout(hardTimeout); + finish(new Error('renderUiKit aborted by signal')); + }; + signal?.addEventListener('abort', onAbort, { once: true }); + + wc.once('did-finish-load', () => { + // Give fonts + CSS animations a moment to settle for visual parity + setTimeout(() => { + clearTimeout(hardTimeout); + signal?.removeEventListener('abort', onAbort); + finish(); + }, SETTLE_AFTER_LOAD_MS); + }); + wc.once('did-fail-load', () => { + clearTimeout(hardTimeout); + finish(new Error('renderUiKit did-fail-load')); + }); + + void win.loadURL(dataUrl).catch((err: unknown) => { + clearTimeout(hardTimeout); + finish(err instanceof Error ? err : new Error(String(err))); + }); + }); + + const image = await wc.capturePage(); + const pngBuffer = image.toPNG(); + const base64 = pngBuffer.toString('base64'); + return { + dataUrl: `data:image/png;base64,${base64}`, + mediaType: 'image/png' as const, + }; + } finally { + try { + if (!win.isDestroyed()) win.destroy(); + } catch { + /* noop */ + } + } + }; +} diff --git a/apps/desktop/src/renderer/src/components/Sidebar.tsx b/apps/desktop/src/renderer/src/components/Sidebar.tsx index 29d890d7..34431aff 100644 --- a/apps/desktop/src/renderer/src/components/Sidebar.tsx +++ b/apps/desktop/src/renderer/src/components/Sidebar.tsx @@ -1,4 +1,4 @@ -import { useT } from '@open-codesign/i18n'; +import { getCurrentLocale, useT, useTranslation } from '@open-codesign/i18n'; import type { LocalInputFile, OnboardingState } from '@open-codesign/shared'; import { FolderOpen, Link2, Paperclip, X } from 'lucide-react'; import { useEffect, useRef } from 'react'; @@ -76,6 +76,8 @@ function ContextIcon({ icon }: { icon: ComposerContextItem['icon'] }) { */ export function Sidebar({ prompt, setPrompt, onSubmit }: SidebarProps) { const t = useT(); + const { i18n } = useTranslation(); + const triggerDecompose = useCodesignStore((s) => s.triggerDecompose); const config = useCodesignStore((s) => s.config); const isGenerating = useCodesignStore( (s) => s.isGenerating && s.generatingDesignId === s.currentDesignId, @@ -225,6 +227,14 @@ export function Sidebar({ prompt, setPrompt, onSubmit }: SidebarProps) { onReferenceUrlChange={setReferenceUrl} hasDesignSystem={Boolean(designSystem)} disabled={isGenerating} + onDecomposeToUiKit={ + currentDesignId + ? () => { + triggerDecompose(currentDesignId, i18n.language || getCurrentLocale()); + } + : undefined + } + canDecompose={Boolean(currentDesignId) && !isGenerating} /> } /> diff --git a/apps/desktop/src/renderer/src/components/chat/AddMenu.tsx b/apps/desktop/src/renderer/src/components/chat/AddMenu.tsx index 62e3ed69..d4500148 100644 --- a/apps/desktop/src/renderer/src/components/chat/AddMenu.tsx +++ b/apps/desktop/src/renderer/src/components/chat/AddMenu.tsx @@ -1,6 +1,6 @@ import { useT } from '@open-codesign/i18n'; import { IconButton, Tooltip } from '@open-codesign/ui'; -import { FolderOpen, Link2, Paperclip, Plus } from 'lucide-react'; +import { FolderOpen, Layers, Link2, Paperclip, Plus } from 'lucide-react'; import { type KeyboardEvent as ReactKeyboardEvent, useEffect, @@ -16,6 +16,12 @@ export interface AddMenuProps { onReferenceUrlChange: (value: string) => void; hasDesignSystem: boolean; disabled?: boolean; + /** When provided, renders a "Decompose to UI Kit" item that asks the agent + * to emit a ui_kits// folder for downstream coding-agent handoff. + * Hidden when undefined. The parent decides whether the action is meaningful + * for the current design (requires a generated artifact). */ + onDecomposeToUiKit?: (() => void) | undefined; + canDecompose?: boolean | undefined; } /** @@ -32,6 +38,8 @@ export function AddMenu({ onReferenceUrlChange, hasDesignSystem, disabled, + onDecomposeToUiKit, + canDecompose, }: AddMenuProps) { const t = useT(); const [open, setOpen] = useState(false); @@ -135,6 +143,22 @@ export function AddMenu({ className="flex-1 min-w-0 bg-transparent text-[var(--text-xs)] text-[var(--color-text-primary)] placeholder:text-[var(--color-text-muted)] focus:outline-none" /> + {onDecomposeToUiKit ? ( + + ) : null} ) : null} diff --git a/apps/desktop/src/renderer/src/hooks/decomposePrompt.ts b/apps/desktop/src/renderer/src/hooks/decomposePrompt.ts new file mode 100644 index 00000000..a7b40918 --- /dev/null +++ b/apps/desktop/src/renderer/src/hooks/decomposePrompt.ts @@ -0,0 +1,85 @@ +/** + * Decompose-trigger prompt — fired as a follow-up user message when the user + * clicks "Decompose to UI Kit" in the chat sidebar's add menu. NOT auto-fired + * (unlike polishPrompt.ts). Walks the agent through: + * 1. read the current artifact + * 2. call decompose_to_ui_kit with the structured plan + * 3. call verify_ui_kit_parity (deterministic structural check) + * 4. call verify_ui_kit_visual_parity (vision-LLM judge with boolean rubric) + * 5. iterate (up to twice) reconciling gaps from BOTH verifiers + * 6. call done with HONEST cost + status + remaining gaps in the summary + * + * The two verifiers are complementary: + * - verify_ui_kit_parity is fast, free, deterministic — catches missing + * elements / hardcoded colors / dropped sections + * - verify_ui_kit_visual_parity is the LLM judge — 12 boolean checks across + * layout/color/typography/content/components, each yes/no with a reason. + * parityScore = passCount/12 (derived). Status is bounded enum. + * + * The visual judge follows 2026 VLM-as-judge research + NodeBench's own rule + * patterns (pipeline_operational_standard.md, eval_flywheel.md, + * agent_run_verdict_workflow.md): boolean per dimension, not floating-point + * arbitrary scores. Failure-of-judge counts as failure-of-parity per + * agentic_reliability.md HONEST_SCORES. + * + * The done summary MUST report: + * - final parityScore (deterministic + visual) as passCount/totalChecks + * - judgeCostUsd from the visual judge + * - any remaining gaps the agent could not fix in 2 iterations + * + * No hidden costs, no inflated scores, no quietly-failed checks. + */ + +export const DECOMPOSE_PROMPT_ZH = `把刚才那个设计拆成一个 ui_kits// 目录, 对齐 coding agent handoff 的形态, 做完之后用两个 verifier 自检: + +1. 先用 str_replace_based_edit_tool view index.html 把当前 artifact 完整读一遍 +2. 选一个简短的 slug (kebab-case, 比如 saas-dashboard) +3. 一次性调 decompose_to_ui_kit: + - indexHtml: 与原 index.html 视觉一致的整页 HTML (尽量保留所有元素 / 文本 / class 名) + - components/*.tsx: 重复结构抽出的组件 (出现 ≥3 次的 DOM 子树), props 用 TS 类型 + - tokens.css: 原文件里出现 ≥3 次的颜色 / 间距 / 字号 / 圆角 / 阴影抽成 CSS 变量 + - readmeNotes: 给下游 coding agent 的接入说明 +4. 调 verify_ui_kit_parity({slug}) 拿一份结构化 ParityReport (deterministic, 免费, 快速) +5. 调 verify_ui_kit_visual_parity({slug}) 拿视觉判定 (vision LLM judge, 12 个 boolean check) + - 如果返回 status="unavailable", host 没接 judge callback, 跳过这一步用 step 4 的结果做决定 + - 如果返回了, 看 checks[].passed + reason, 失败的 check 就是要修的点 +6. 综合两份 report: + - 两个都 status ∈ {verified, needs_review} (12/12 或 11/12 个 check 过): 直接调 done + - 任一为 needs_iteration / failed: 把两边的 gaps 合并去重 + 失败 check 的 reason 一起作为反馈, 重新调一次 decompose_to_ui_kit +7. 最多迭代两轮. 第二轮验证完不管 score 多少都调 done. +8. done 的 summary 必须诚实写出: + - 结构化 verifier 的 passCount/totalChecks + status + - 视觉 judge 的 passCount/12 + status (如果可用) + - 视觉 judge 的 judgeCostUsd (本次自检花了多少) + - 还没解决的 gaps (列出失败的 check id + 为什么没修好) + 不要藏成本, 不要虚报 score, 失败的 check 当失败说. +9. 不要重写原 artifact, 只输出 ui_kits/ 下的新文件`; + +export const DECOMPOSE_PROMPT_EN = `Decompose the design you just produced into a ui_kits// folder, shaped for coding-agent handoff, then self-verify using TWO complementary verifiers: + +1. Use str_replace_based_edit_tool view to load index.html fully first. +2. Pick a short kebab-case slug (e.g. saas-dashboard). +3. Call decompose_to_ui_kit ONCE with: + - indexHtml: full-page HTML visually parity-matched to the source (preserve elements, text, class names where possible) + - components/*.tsx: components extracted from repeated structure (DOM subtrees appearing >= 3 times), typed props + - tokens.css: any color / spacing / typography / radius / shadow value used >= 3 times in the source -> a CSS variable + - readmeNotes: handoff notes for the downstream coding agent +4. Call verify_ui_kit_parity({slug}) — deterministic structural check (fast, free), returns ParityReport with element-count / text-coverage / token-coverage signals. +5. Call verify_ui_kit_visual_parity({slug}) — vision-LLM judge with the 12 standard boolean checks (layout / color / typography / content / components dimensions). Each check is yes/no with a reason. parityScore = passCount/12 (derived deterministically). + - If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone. + - If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix. +6. Reconcile both reports: + - Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done + - Either status === 'needs_iteration' or 'failed': merge + dedup gaps from both reports + the failed checks' reasons, re-call decompose_to_ui_kit addressing them +7. Iterate at most TWICE. After the second verify, call done regardless of score. +8. The done summary MUST honestly report: + - deterministic verifier passCount/totalChecks + status + - visual judge passCount/12 + status (if available) + - visual judge judgeCostUsd (what this self-verify cost) + - remaining gaps the loop could not resolve (list failed check ids + why they didn't get fixed) + Do NOT hide cost. Do NOT inflate scores. Failed checks count as failed. +9. Do NOT modify the original artifact - only emit new files under ui_kits/.`; + +export function pickDecomposePrompt(locale: string): string { + return locale.toLowerCase().startsWith('zh') ? DECOMPOSE_PROMPT_ZH : DECOMPOSE_PROMPT_EN; +} diff --git a/apps/desktop/src/renderer/src/store.ts b/apps/desktop/src/renderer/src/store.ts index 8ad93906..1cde03c5 100644 --- a/apps/desktop/src/renderer/src/store.ts +++ b/apps/desktop/src/renderer/src/store.ts @@ -317,6 +317,12 @@ interface CodesignState { * if the condition is met (first round succeeded, no prior polish). Call * from useAgentStream's agent_end handler. */ tryAutoPolish: (designId: string, locale: string) => void; + /** User-triggered decompose follow-up. Sends a structured prompt asking the + * agent to call `decompose_to_ui_kit` and emit a ui_kits// folder + * shaped for downstream coding-agent handoff. Unlike `tryAutoPolish` this + * is NOT auto-fired — the user explicitly clicks the menu item. No dedup + * since the user can decide to re-decompose with a different slug. */ + triggerDecompose: (designId: string, locale: string) => void; cancelGeneration: () => void; retryLastPrompt: () => Promise; applyInlineComment: (comment: string) => Promise; @@ -1355,6 +1361,39 @@ export const useCodesignStore = create((set, get) => ({ }); }, + triggerDecompose: (designId, locale) => { + const s = get(); + // Branch 1: a generation is in flight — block & tell the user + if (s.isGenerating) { + get().pushToast({ + variant: 'info', + title: tr('sidebar.decomposeBusyTitle'), + description: tr('sidebar.decomposeBusyDescription'), + }); + return; + } + // Branch 2: no artifact yet — decompose has nothing to operate on + const designMessages = s.chatMessages.filter((m) => m.designId === designId); + if (!designMessages.some((m) => m.kind === 'assistant_text')) { + get().pushToast({ + variant: 'info', + title: tr('sidebar.decomposeUnavailableTitle'), + description: tr('sidebar.decomposeToUiKitDisabled'), + }); + return; + } + // Branch 3: fire the structured prompt + tell the user something is happening + get().pushToast({ + variant: 'info', + title: tr('sidebar.decomposeStartedTitle'), + description: tr('sidebar.decomposeStartedDescription'), + }); + void import('./hooks/decomposePrompt').then(({ pickDecomposePrompt }) => { + const prompt = pickDecomposePrompt(locale); + void get().sendPrompt({ prompt, silent: true }); + }); + }, + theme: readInitialTheme(), view: 'hub' as AppView, previousView: 'hub' as AppView, @@ -2407,6 +2446,38 @@ export const useCodesignStore = create((set, get) => ({ }, }); } + // Surface the visual judge's cost + pass count as a toast every time the + // verify_ui_kit_visual_parity tool completes — gives the operator a per- + // decompose cost row without needing a new dashboard. Reads defensively + // from result.details (the structured ParityReport) so a shape change in + // the tool contract degrades to silent rather than crashing the renderer. + if (toolName === 'verify_ui_kit_visual_parity' && result && typeof result === 'object') { + const r = result as Record; + const details = ( + typeof r['details'] === 'object' && r['details'] !== null + ? (r['details'] as Record) + : r + ) as Record; + const passCount = typeof details['passCount'] === 'number' ? details['passCount'] : 0; + const totalChecks = typeof details['totalChecks'] === 'number' ? details['totalChecks'] : 0; + const judgeCostUsd = + typeof details['judgeCostUsd'] === 'number' ? details['judgeCostUsd'] : 0; + const status = typeof details['status'] === 'string' ? details['status'] : 'unknown'; + if (totalChecks > 0) { + const ok = status === 'verified' || status === 'needs_review'; + get().pushToast({ + variant: ok ? 'success' : 'info', + title: tr('sidebar.decomposeJudgeResultTitle', { + passed: passCount, + total: totalChecks, + status, + }), + description: tr('sidebar.decomposeJudgeResultDescription', { + cost: `$${judgeCostUsd.toFixed(4)}`, + }), + }); + } + } }, async updateChatToolStatus({ designId, seq, status, result, durationMs, errorMessage }) { diff --git a/packages/core/src/agent.ts b/packages/core/src/agent.ts index 4e2e7a77..58d3ef4c 100644 --- a/packages/core/src/agent.ts +++ b/packages/core/src/agent.ts @@ -63,6 +63,7 @@ import { reasoningForModel } from './index.js'; import { type CoreLogger, NOOP_LOGGER } from './logger.js'; import { composeSystemPrompt } from './prompts/index.js'; import { makeDeclareTweakSchemaTool } from './tools/declare-tweak-schema.js'; +import { makeDecomposeToUiKitTool } from './tools/decompose-to-ui-kit.js'; import { type DoneRuntimeVerifier, makeDoneTool } from './tools/done.js'; import { type GenerateImageAssetFn, @@ -73,6 +74,12 @@ import { makeReadDesignSystemTool } from './tools/read-design-system.js'; import { makeReadUrlTool } from './tools/read-url.js'; import { makeSetTodosTool } from './tools/set-todos.js'; import { type TextEditorFsCallbacks, makeTextEditorTool } from './tools/text-editor.js'; +import { makeVerifyUiKitParityTool } from './tools/verify-ui-kit-parity.js'; +import { + type JudgeVisualParityFn, + type RenderUiKitFn, + makeVerifyUiKitVisualParityTool, +} from './tools/verify-ui-kit-visual-parity.js'; /** Local mirror of the assistant message shape that pi-agent-core emits (via * pi-ai). Declared here so this file does not take a direct dependency on @@ -692,6 +699,21 @@ export interface GenerateViaAgentDeps { * poster/background asset is worth generating. */ generateImageAsset?: GenerateImageAssetFn | undefined; + /** + * Optional vision-LLM judge for ui_kit visual parity. When provided alongside + * `renderUiKit`, the default toolset adds `verify_ui_kit_visual_parity`. + * Without these, the agent has the deterministic verifier only. + * + * The judge is host-injected (mirrors `generateImageAsset`) so the core + * package stays free of provider SDK imports. + */ + judgeVisualParity?: JudgeVisualParityFn | undefined; + /** + * Optional headless renderer that loads ui_kit HTML in a hidden window and + * returns a screenshot data URL. Paired with `judgeVisualParity` — both must + * be present for the visual-parity tool to register. + */ + renderUiKit?: RenderUiKitFn | undefined; } /** @@ -782,6 +804,20 @@ export async function generateViaAgent( defaultTools.push( makeDoneTool(deps.fs, deps.runtimeVerify) as unknown as AgentTool, ); + defaultTools.push( + makeDecomposeToUiKitTool(deps.fs, log) as unknown as AgentTool, + ); + defaultTools.push( + makeVerifyUiKitParityTool(deps.fs, log) as unknown as AgentTool, + ); + defaultTools.push( + makeVerifyUiKitVisualParityTool( + deps.fs, + deps.renderUiKit, + deps.judgeVisualParity, + log, + ) as unknown as AgentTool, + ); } if (deps.generateImageAsset) { defaultTools.push( diff --git a/packages/core/src/index.ts b/packages/core/src/index.ts index 5a36df89..82655993 100644 --- a/packages/core/src/index.ts +++ b/packages/core/src/index.ts @@ -54,6 +54,18 @@ export { type GenerateImageAssetRequest, type GenerateImageAssetResult, } from './tools/generate-image-asset.js'; +export { + makeVerifyUiKitVisualParityTool, + STANDARD_VISUAL_PARITY_CHECKS, + visualParityStatusFromChecks, + type JudgeVisualParityFn, + type RenderUiKitFn, + type VisualParityCheck, + type VisualParityGap, + type VisualParityImageRef, + type VisualParityReport, + type VisualParityStatus, +} from './tools/verify-ui-kit-visual-parity.js'; export { makeReadDesignSystemTool, type ReadDesignSystemDetails, diff --git a/packages/core/src/tools/decompose-to-ui-kit.test.ts b/packages/core/src/tools/decompose-to-ui-kit.test.ts new file mode 100644 index 00000000..4e496828 --- /dev/null +++ b/packages/core/src/tools/decompose-to-ui-kit.test.ts @@ -0,0 +1,263 @@ +import { describe, expect, it } from 'vitest'; +import { makeDecomposeToUiKitTool } from './decompose-to-ui-kit.js'; +import type { TextEditorFsCallbacks } from './text-editor.js'; + +interface CreatedFile { + path: string; + content: string; +} + +function makeStubFs(): { fs: TextEditorFsCallbacks; created: CreatedFile[] } { + const created: CreatedFile[] = []; + const fs: TextEditorFsCallbacks = { + view: () => null, + create: (path: string, content: string) => { + created.push({ path, content }); + return { path }; + }, + strReplace: (path: string) => ({ path }), + insert: (path: string) => ({ path }), + listDir: () => [], + }; + return { fs, created }; +} + +describe('makeDecomposeToUiKitTool', () => { + it('emits all expected files for a typical decomposition', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + const result = await tool.execute( + 'test-call', + { + slug: 'saas-dashboard', + indexHtml: '...', + components: [ + { + name: 'MetricCard', + filename: 'MetricCard.tsx', + source: 'export const MetricCard = (p: { label: string }) => null;', + propsSummary: 'label, value, delta, trend', + }, + { + name: 'Sidebar', + filename: 'Sidebar.tsx', + source: 'export const Sidebar = () => null;', + }, + ], + tokens: [ + { name: '--color-brand', value: '#c96442', category: 'color' }, + { name: '--space-md', value: '16px', category: 'spacing' }, + ], + readmeNotes: 'Mock dashboard for testing.', + }, + undefined, + ); + + expect(result.details.componentCount).toBe(2); + expect(result.details.tokenCount).toBe(2); + expect(created.map((c) => c.path)).toEqual([ + 'ui_kits/saas-dashboard/index.html', + 'ui_kits/saas-dashboard/components/MetricCard.tsx', + 'ui_kits/saas-dashboard/components/Sidebar.tsx', + 'ui_kits/saas-dashboard/tokens.css', + 'ui_kits/saas-dashboard/manifest.json', + 'ui_kits/saas-dashboard/README.md', + ]); + }); + + it('sanitizes weird slugs to kebab-case ascii', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute( + 't', + { + slug: 'My Cool Design!! 你好', + indexHtml: '', + components: [], + tokens: [], + }, + undefined, + ); + + expect(created[0]?.path).toMatch(/^ui_kits\/my-cool-design/); + }); + + it('falls back to "untitled" when slug is empty after sanitization', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute( + 't', + { + slug: '!!!', + indexHtml: '', + components: [], + tokens: [], + }, + undefined, + ); + + expect(created[0]?.path).toBe('ui_kits/untitled/index.html'); + }); + + it('manifest carries schemaVersion = 1', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute('t', { slug: 'x', indexHtml: '', components: [], tokens: [] }, undefined); + + const manifest = created.find((c) => c.path.endsWith('manifest.json')); + if (!manifest) throw new Error('manifest.json was not written'); + const parsed = JSON.parse(manifest.content); + expect(parsed.schemaVersion).toBe(1); + expect(parsed.generator).toBe('open-codesign decompose_to_ui_kit'); + expect(parsed.slug).toBe('x'); + expect(parsed.components).toEqual([]); + expect(typeof parsed.generatedAt).toBe('string'); + }); + + it('groups tokens.css by category in stable order', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute( + 't', + { + slug: 'x', + indexHtml: '', + components: [], + tokens: [ + { name: '--c1', value: 'red', category: 'color' }, + { name: '--s1', value: '8px', category: 'spacing' }, + { name: '--c2', value: 'blue', category: 'color' }, + ], + }, + undefined, + ); + + const tokensFile = created.find((c) => c.path.endsWith('tokens.css')); + if (!tokensFile) throw new Error('tokens.css was not written'); + const css = tokensFile.content; + expect(css).toContain('/* color */'); + expect(css).toContain('/* spacing */'); + expect(css).toContain('--c1: red;'); + expect(css).toContain('--c2: blue;'); + expect(css).toContain('--s1: 8px;'); + expect(css.indexOf('/* color */')).toBeLessThan(css.indexOf('/* spacing */')); + }); + + it('prefixes raw token names with --', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute( + 't', + { + slug: 'x', + indexHtml: '', + components: [], + tokens: [{ name: 'color-brand', value: '#c96442', category: 'color' }], + }, + undefined, + ); + + const tokensFile = created.find((c) => c.path.endsWith('tokens.css')); + if (!tokensFile) throw new Error('tokens.css was not written'); + expect(tokensFile.content).toContain('--color-brand: #c96442;'); + }); + + it('README lists every component with its props summary', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + await tool.execute( + 't', + { + slug: 'x', + indexHtml: '', + components: [ + { + name: 'Card', + filename: 'Card.tsx', + source: '', + propsSummary: 'title, body', + }, + { name: 'Btn', filename: 'Btn.tsx', source: '' }, + ], + tokens: [], + readmeNotes: 'Test handoff notes.', + }, + undefined, + ); + + const readmeFile = created.find((c) => c.path.endsWith('README.md')); + if (!readmeFile) throw new Error('README.md was not written'); + const readme = readmeFile.content; + expect(readme).toContain('# x'); + expect(readme).toContain('**Card** (`components/Card.tsx`) — title, body'); + expect(readme).toContain('**Btn** (`components/Btn.tsx`)'); + expect(readme).toContain('Test handoff notes.'); + }); + + it('handles empty components + tokens gracefully', async () => { + const { fs, created } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + const result = await tool.execute( + 't', + { slug: 'empty', indexHtml: '', components: [], tokens: [] }, + undefined, + ); + + expect(result.details.componentCount).toBe(0); + expect(result.details.tokenCount).toBe(0); + expect(created).toHaveLength(4); + const readmeFile = created.find((c) => c.path.endsWith('README.md')); + if (!readmeFile) throw new Error('README.md was not written'); + expect(readmeFile.content).toContain('_(none extracted)_'); + }); + + it('returns AgentToolResult with summary text + structured details', async () => { + const { fs } = makeStubFs(); + const tool = makeDecomposeToUiKitTool(fs); + + const result = await tool.execute( + 't', + { + slug: 'x', + indexHtml: '', + components: [{ name: 'X', filename: 'X.tsx', source: '' }], + tokens: [{ name: '--a', value: 'b', category: 'color' }], + }, + undefined, + ); + + const first = result.content[0]; + expect(first?.type).toBe('text'); + if (first?.type !== 'text') throw new Error('expected text content'); + expect(first.text).toContain('Decomposed into 5 files'); + expect(first.text).toContain('Components: X'); + expect(first.text).toContain('Tokens: 1 extracted across 1 category'); + expect(result.details.outputPaths).toHaveLength(5); + }); + + it('no-ops file writes when fs is undefined but still returns details', async () => { + const tool = makeDecomposeToUiKitTool(undefined); + + const result = await tool.execute( + 't', + { + slug: 'no-fs', + indexHtml: '', + components: [{ name: 'X', filename: 'X.tsx', source: '' }], + tokens: [], + }, + undefined, + ); + + expect(result.details.componentCount).toBe(1); + expect(result.details.outputPaths).toEqual([]); + }); +}); diff --git a/packages/core/src/tools/decompose-to-ui-kit.ts b/packages/core/src/tools/decompose-to-ui-kit.ts new file mode 100644 index 00000000..30cba02b --- /dev/null +++ b/packages/core/src/tools/decompose-to-ui-kit.ts @@ -0,0 +1,225 @@ +/** + * decompose-to-ui-kit — agent tool that emits a ui_kits// folder + * from the current artifact's source. Decomposition itself is prompt-driven + * (the model decides component boundaries from its read of index.html); + * this tool only persists the structured plan to the virtual fs. + * + * Output: + * ui_kits//index.html page-level HTML, parity-matched to source + * ui_kits//components/*.tsx reusable components, typed props + * ui_kits//tokens.css extracted color/spacing/typography vars + * ui_kits//manifest.json schemaVersion + component / token summary + * ui_kits//README.md handoff notes for downstream coding agent + */ + +import type { AgentTool, AgentToolResult } from '@mariozechner/pi-agent-core'; +import { Type } from '@sinclair/typebox'; +import { type CoreLogger, NOOP_LOGGER } from '../logger.js'; +import type { TextEditorFsCallbacks } from './text-editor'; + +const ComponentSpec = Type.Object({ + name: Type.String(), + filename: Type.String(), + source: Type.String(), + propsSummary: Type.Optional(Type.String()), +}); + +const TokenSpec = Type.Object({ + name: Type.String(), + value: Type.String(), + category: Type.Union([ + Type.Literal('color'), + Type.Literal('spacing'), + Type.Literal('typography'), + Type.Literal('radius'), + Type.Literal('shadow'), + ]), +}); + +const DecomposeParams = Type.Object({ + slug: Type.String(), + indexHtml: Type.String(), + components: Type.Array(ComponentSpec), + tokens: Type.Array(TokenSpec), + readmeNotes: Type.Optional(Type.String()), +}); + +type ComponentSpecT = { + name: string; + filename: string; + source: string; + propsSummary?: string; +}; + +type TokenCategory = 'color' | 'spacing' | 'typography' | 'radius' | 'shadow'; + +type TokenSpecT = { + name: string; + value: string; + category: TokenCategory; +}; + +export interface DecomposeDetails { + slug: string; + componentCount: number; + tokenCount: number; + outputPaths: string[]; +} + +const SCHEMA_VERSION = 1; + +function sanitizeSlug(slug: string): string { + return ( + slug + .toLowerCase() + .replace(/[^a-z0-9-]/g, '-') + .replace(/-+/g, '-') + .replace(/^-|-$/g, '') || 'untitled' + ); +} + +function renderTokensCss(tokens: TokenSpecT[]): string { + const grouped: Record = {}; + for (const t of tokens) { + const bucket = grouped[t.category] ?? []; + bucket.push(t); + grouped[t.category] = bucket; + } + const lines: string[] = [':root {']; + for (const [category, items] of Object.entries(grouped)) { + lines.push(` /* ${category} */`); + for (const t of items) { + const name = t.name.startsWith('--') ? t.name : `--${t.name}`; + lines.push(` ${name}: ${t.value};`); + } + } + lines.push('}'); + return `${lines.join('\n')}\n`; +} + +function renderManifest(slug: string, components: ComponentSpecT[], tokens: TokenSpecT[]): string { + const tokenCounts: Record = {}; + for (const t of tokens) { + tokenCounts[t.category] = (tokenCounts[t.category] ?? 0) + 1; + } + const manifest = { + schemaVersion: SCHEMA_VERSION, + generator: 'open-codesign decompose_to_ui_kit', + generatedAt: new Date().toISOString(), + slug, + components: components.map((c) => c.name), + tokens: tokenCounts, + }; + return `${JSON.stringify(manifest, null, 2)}\n`; +} + +function renderReadme( + slug: string, + components: ComponentSpecT[], + notes: string | undefined, +): string { + const lines: string[] = [ + `# ${slug}`, + '', + 'Generated by open-codesign `decompose_to_ui_kit`. Drop this folder into your project; a coding agent (Claude Code, Cursor, etc.) can read this README + manifest.json and integrate the components into your existing codebase.', + '', + '## Files', + '- `index.html` — page-level HTML, visually parity-matched to the source', + '- `components/*.tsx` — reusable React components with typed props', + '- `tokens.css` — design tokens (colors, spacing, typography, etc.)', + '- `manifest.json` — machine-readable summary', + '', + '## Components', + ]; + if (components.length === 0) { + lines.push('_(none extracted)_'); + } else { + for (const c of components) { + const summary = c.propsSummary ? ` — ${c.propsSummary}` : ''; + lines.push(`- **${c.name}** (\`components/${c.filename}\`)${summary}`); + } + } + if (notes?.trim()) { + lines.push('', '## Notes', notes.trim()); + } + return `${lines.join('\n')}\n`; +} + +export function makeDecomposeToUiKitTool( + fs: TextEditorFsCallbacks | undefined, + logger: CoreLogger = NOOP_LOGGER, +): AgentTool { + return { + name: 'decompose_to_ui_kit', + label: 'Decompose to UI Kit', + description: + 'Decompose the current artifact into a ui_kits// folder structure ' + + '(index.html + components/*.tsx + tokens.css + manifest.json + README.md), ' + + 'shaped for handoff to a downstream coding agent (Claude Code, Cursor). ' + + 'Use ONLY when the user explicitly asks for a coding-agent-ready bundle. ' + + 'Decomposition is one cognitive act: emit ALL components + tokens + manifest in a single tool call. ' + + 'Do NOT split across multiple calls — the model is bad at coordinating state across tool invocations.', + parameters: DecomposeParams, + async execute(_toolCallId, params, _signal): Promise> { + const slug = sanitizeSlug(params.slug); + const base = `ui_kits/${slug}`; + const written: string[] = []; + const startedAt = Date.now(); + const components = params.components as ComponentSpecT[]; + const tokens = params.tokens as TokenSpecT[]; + + if (fs !== undefined) { + await fs.create(`${base}/index.html`, params.indexHtml); + written.push(`${base}/index.html`); + + for (const comp of components) { + const path = `${base}/components/${comp.filename}`; + await fs.create(path, comp.source); + written.push(path); + } + + const tokensCss = renderTokensCss(tokens); + await fs.create(`${base}/tokens.css`, tokensCss); + written.push(`${base}/tokens.css`); + + const manifest = renderManifest(slug, components, tokens); + await fs.create(`${base}/manifest.json`, manifest); + written.push(`${base}/manifest.json`); + + const readme = renderReadme(slug, components, params.readmeNotes); + await fs.create(`${base}/README.md`, readme); + written.push(`${base}/README.md`); + } + + logger.info('[decompose_to_ui_kit] step=ok', { + slug, + components: components.length, + tokens: tokens.length, + files: written.length, + ms: Date.now() - startedAt, + }); + + const componentNames = + components.length > 0 ? components.map((c) => c.name).join(', ') : '(none)'; + const categoryCount = new Set(tokens.map((t) => t.category)).size; + + return { + content: [ + { + type: 'text', + text: + `Decomposed into ${written.length} files under ${base}/. ` + + `Components: ${componentNames}. ` + + `Tokens: ${tokens.length} extracted across ${categoryCount} categor${categoryCount === 1 ? 'y' : 'ies'}.`, + }, + ], + details: { + slug, + componentCount: components.length, + tokenCount: tokens.length, + outputPaths: written, + }, + }; + }, + }; +} diff --git a/packages/core/src/tools/verify-ui-kit-parity.test.ts b/packages/core/src/tools/verify-ui-kit-parity.test.ts new file mode 100644 index 00000000..3e04db0c --- /dev/null +++ b/packages/core/src/tools/verify-ui-kit-parity.test.ts @@ -0,0 +1,230 @@ +import { describe, expect, it } from 'vitest'; +import type { TextEditorFsCallbacks } from './text-editor.js'; +import { makeVerifyUiKitParityTool } from './verify-ui-kit-parity.js'; + +function makeFs(files: Record): TextEditorFsCallbacks { + return { + view: (path: string) => { + const content = files[path]; + if (content === undefined) return null; + return { content, numLines: content.split('\n').length }; + }, + create: (path: string) => ({ path }), + strReplace: (path: string) => ({ path }), + insert: (path: string) => ({ path }), + listDir: () => [], + }; +} + +const SOURCE_HTML = ` + + +

Acme Analytics

+
+
+
MRR$12,400
+
Churn2.1%
+
NPS62
+
ARR$148k
+
+
+ + + + + + +
UserAction
alicesignup
boblogin
+
+
© Acme
+ + +`; + +const HIGH_PARITY_DECOMP = ` + + +

Acme Analytics

+
+
+
MRR$12,400
+
Churn2.1%
+
NPS62
+
ARR$148k
+
+
+ + + + + + +
UserAction
alicesignup
boblogin
+
+
© Acme
+ + +`; + +const LOW_PARITY_DECOMP = ` + + +
just one tile
+ + +`; + +const TOKENS_CSS = ` +:root { + /* color */ + --color-brand: #c96442; + --color-text-muted: #6b6661; + /* spacing */ + --space-md: 16px; + --text-sm: 14px; +} +`; + +describe('makeVerifyUiKitParityTool', () => { + it('reports OK when decomposed faithfully mirrors the source', async () => { + const fs = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': HIGH_PARITY_DECOMP, + 'ui_kits/x/tokens.css': TOKENS_CSS, + }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('ok'); + expect(result.details.parityScore).toBeGreaterThanOrEqual(0.85); + expect(result.details.signals.elementCountParity).toBeGreaterThan(0.9); + }); + + it('reports needs_iteration when decomposed is structurally thin', async () => { + const fs = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': LOW_PARITY_DECOMP, + 'ui_kits/x/tokens.css': '', + }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('needs_iteration'); + expect(result.details.parityScore).toBeLessThan(0.85); + expect(result.details.gaps.length).toBeGreaterThan(0); + }); + + it('reports missing artifacts when decomposed file is absent', async () => { + const fs = makeFs({ 'index.html': SOURCE_HTML }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'never-decomposed' }, undefined); + expect(result.details.status).toBe('needs_iteration'); + expect(result.details.parityScore).toBe(0); + expect(result.details.gaps[0]?.message).toContain('missing artifact'); + }); + + it('flags hardcoded values not present in tokens.css', async () => { + const fs = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': HIGH_PARITY_DECOMP, + 'ui_kits/x/tokens.css': ':root { --space-md: 16px; }', // missing colors + 14px + }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'x' }, undefined); + const tokenGaps = result.details.gaps.filter((g) => g.kind === 'token'); + expect(tokenGaps.length).toBeGreaterThan(0); + expect(tokenGaps.some((g) => g.message.includes('#c96442'))).toBe(true); + expect(result.details.signals.tokenCoverage).toBeLessThan(1); + }); + + it('returns 0 score gracefully when fs is undefined', async () => { + const tool = makeVerifyUiKitParityTool(undefined); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('needs_iteration'); + expect(result.details.parityScore).toBe(0); + }); + + it('reports element parity > 0.9 when source and decomposed are byte-identical', async () => { + const fs = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': SOURCE_HTML, + 'ui_kits/x/tokens.css': TOKENS_CSS, + }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.signals.elementCountParity).toBe(1); + expect(result.details.signals.visibleTextCoverage).toBe(1); + }); + + // Regression for CodeQL js/bad-tag-filter (HIGH) flagged on PR #241. + // The pre-fix regex `<\/script>` literally missed end tags with trailing + // whitespace or attributes, both of which HTML5 parsers tolerate. A crafted + // source could leak script body text into the visible-word coverage check. + // This test asserts that a script body using the previously-vulnerable + // syntax does NOT contribute words to the source vocabulary, so a faithful + // decomposition that omits the script still scores OK. + it('strips script + style bodies even with attrs/whitespace in end tags', async () => { + const sourceWithCraftedScript = ` + + +

Acme Analytics

+
+
MRR$12,400
+
+ + + + + + + `; + const cleanDecomp = ` + + +

Acme Analytics

+
+
MRR$12,400
+
+ + + `; + const fs = makeFs({ + 'index.html': sourceWithCraftedScript, + 'ui_kits/x/index.html': cleanDecomp, + 'ui_kits/x/tokens.css': '', + }); + const tool = makeVerifyUiKitParityTool(fs); + const result = await tool.execute('t', { slug: 'x' }, undefined); + const first = result.content[0]; + if (first?.type !== 'text') throw new Error('expected text'); + // None of the leaked tokens should appear in the parity report — if they + // did, it would mean stripTags failed to remove the script/style bodies + // and they leaked into the visible-word vocabulary. + expect(first.text.toLowerCase()).not.toContain('secretleakedtoken'); + expect(first.text.toLowerCase()).not.toContain('anotherleakedtoken'); + expect(first.text.toLowerCase()).not.toContain('secretleakedcss'); + expect(first.text.toLowerCase()).not.toContain('uppercaseleakedtoken'); + }); + + it('summary text reflects pass/fail status', async () => { + const fsOk = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': SOURCE_HTML, + 'ui_kits/x/tokens.css': TOKENS_CSS, + }); + const fsFail = makeFs({ + 'index.html': SOURCE_HTML, + 'ui_kits/x/index.html': LOW_PARITY_DECOMP, + 'ui_kits/x/tokens.css': '', + }); + const tool = makeVerifyUiKitParityTool(fsOk); + const okResult = await tool.execute('t', { slug: 'x' }, undefined); + const okFirst = okResult.content[0]; + if (okFirst?.type !== 'text') throw new Error('expected text'); + expect(okFirst.text).toContain('Parity OK'); + + const failTool = makeVerifyUiKitParityTool(fsFail); + const failResult = await failTool.execute('t', { slug: 'x' }, undefined); + const failFirst = failResult.content[0]; + if (failFirst?.type !== 'text') throw new Error('expected text'); + expect(failFirst.text).toContain('needs iteration'); + }); +}); diff --git a/packages/core/src/tools/verify-ui-kit-parity.ts b/packages/core/src/tools/verify-ui-kit-parity.ts new file mode 100644 index 00000000..33018221 --- /dev/null +++ b/packages/core/src/tools/verify-ui-kit-parity.ts @@ -0,0 +1,347 @@ +/** + * verify-ui-kit-parity — agent tool that compares a decomposed + * ui_kits// output against the source artifact and emits a + * deterministic parity report (no LLM judge, no variance). + * + * Three signals, all computed from the raw HTML / CSS strings: + * 1. Element count parity — structural tag distribution + * 2. Visible text coverage — % of source words present in decomposed + * 3. Token coverage — % of source hex/px/rem values present in tokens.css + * + * The agent calls this after decompose_to_ui_kit. If parityScore < 0.85, + * the prompt instructs the agent to re-call decompose_to_ui_kit with + * adjustments that address the explicit `gaps` list. + * + * Pattern mirrors done.ts: a deterministic checker run during the agent's + * own turn so it can self-correct before calling done. + */ + +import type { AgentTool, AgentToolResult } from '@mariozechner/pi-agent-core'; +import { Type } from '@sinclair/typebox'; +import { type CoreLogger, NOOP_LOGGER } from '../logger.js'; +import type { TextEditorFsCallbacks } from './text-editor'; + +const VerifyParams = Type.Object({ + slug: Type.String(), + sourcePath: Type.Optional(Type.String()), +}); + +export interface ParityGap { + kind: 'element' | 'text' | 'token'; + message: string; +} + +export interface ParityReport { + parityScore: number; + status: 'ok' | 'needs_iteration'; + signals: { + elementCountParity: number; + visibleTextCoverage: number; + tokenCoverage: number; + }; + counts: { + sourceElements: number; + decomposedElements: number; + sourceWords: number; + decomposedWordsMatched: number; + sourceTokens: number; + tokenCssMatched: number; + }; + gaps: ParityGap[]; +} + +const PARITY_THRESHOLD = 0.85; +const STRUCTURAL_TAGS = [ + 'div', + 'section', + 'article', + 'aside', + 'nav', + 'header', + 'footer', + 'main', + 'button', + 'input', + 'select', + 'textarea', + 'form', + 'a', + 'span', + 'h1', + 'h2', + 'h3', + 'h4', + 'h5', + 'h6', + 'ul', + 'ol', + 'li', + 'table', + 'tr', + 'td', + 'th', + 'thead', + 'tbody', + 'img', + 'svg', + 'label', + 'p', +] as const; + +function countElements(html: string): Record { + const counts: Record = {}; + for (const tag of STRUCTURAL_TAGS) { + const re = new RegExp(`<${tag}\\b`, 'gi'); + counts[tag] = (html.match(re) ?? []).length; + } + return counts; +} + +function totalElementCount(counts: Record): number { + return Object.values(counts).reduce((a, b) => a + b, 0); +} + +function elementParityScore( + source: Record, + decomposed: Record, +): { score: number; gaps: ParityGap[] } { + const gaps: ParityGap[] = []; + let totalDelta = 0; + let totalSource = 0; + for (const tag of STRUCTURAL_TAGS) { + const s = source[tag] ?? 0; + const d = decomposed[tag] ?? 0; + const delta = Math.abs(s - d); + totalDelta += delta; + totalSource += s; + if (s > 0 && delta / Math.max(s, 1) > 0.5) { + gaps.push({ + kind: 'element', + message: `${s} <${tag}> in source, ${d} in decomposed (delta ${delta})`, + }); + } + } + if (totalSource === 0) return { score: 1, gaps }; + const score = Math.max(0, 1 - totalDelta / totalSource); + return { score, gaps }; +} + +function stripTags(html: string): string { + // Close-tag patterns mirror the opening pattern's `\b[^>]*` so we strip + // the full `` form. HTML5 parsers tolerate + // attributes and trailing whitespace inside end tags (silently ignored) + // and CodeQL's "Bad HTML filtering regexp" rule flags the literal + // `` form because it leaves bodies behind for `` etc. + // The `\b` after the tag name prevents over-matching like ``. + return html + .replace(/]*>[\s\S]*?<\/script\b[^>]*>/gi, ' ') + .replace(/]*>[\s\S]*?<\/style\b[^>]*>/gi, ' ') + .replace(/<[^>]+>/g, ' '); +} + +function visibleWords(html: string): Set { + const text = stripTags(html).toLowerCase(); + const words = text.match(/[a-z][a-z0-9]{2,}/g) ?? []; + return new Set(words); +} + +function textCoverage( + sourceWords: Set, + decomposedWords: Set, +): { score: number; matched: number } { + if (sourceWords.size === 0) return { score: 1, matched: 0 }; + let matched = 0; + for (const w of sourceWords) { + if (decomposedWords.has(w)) matched += 1; + } + return { score: matched / sourceWords.size, matched }; +} + +function extractTokenValues(text: string): Set { + const values = new Set(); + // Hex colors (#fff, #ffffff, #ffffffff) + for (const m of text.match(/#[0-9a-fA-F]{3,8}\b/g) ?? []) { + values.add(m.toLowerCase()); + } + // rgb / rgba + for (const m of text.match(/rgba?\([^)]+\)/g) ?? []) { + values.add(m.replace(/\s+/g, '').toLowerCase()); + } + // length values: digits + unit + for (const m of text.match(/\b\d+(?:\.\d+)?(?:px|rem|em)\b/g) ?? []) { + values.add(m.toLowerCase()); + } + return values; +} + +function tokenCoverageScore( + source: Set, + tokensCss: string, +): { score: number; matched: number; gaps: ParityGap[] } { + if (source.size === 0) return { score: 1, matched: 0, gaps: [] }; + const tokenValues = extractTokenValues(tokensCss); + const gaps: ParityGap[] = []; + let matched = 0; + for (const v of source) { + if (tokenValues.has(v)) { + matched += 1; + } else { + gaps.push({ + kind: 'token', + message: `value ${v} appears in source but missing from tokens.css`, + }); + } + } + // Cap gaps to first 8 to keep the agent's context small + return { score: matched / source.size, matched, gaps: gaps.slice(0, 8) }; +} + +export function makeVerifyUiKitParityTool( + fs: TextEditorFsCallbacks | undefined, + logger: CoreLogger = NOOP_LOGGER, +): AgentTool { + return { + name: 'verify_ui_kit_parity', + label: 'Verify UI Kit parity', + description: + 'Compare a decomposed ui_kits// output against the source artifact ' + + 'and emit a deterministic parity report (element-count, visible-text-coverage, ' + + 'token-coverage). Call this AFTER decompose_to_ui_kit. If parityScore is below ' + + '0.85, re-call decompose_to_ui_kit with adjustments that address the gaps list ' + + 'returned by this tool. No LLM judge involved — the result is reproducible.', + parameters: VerifyParams, + async execute(_toolCallId, params, _signal): Promise> { + const sourcePath = params.sourcePath ?? 'index.html'; + const decomposedPath = `ui_kits/${params.slug}/index.html`; + const tokensPath = `ui_kits/${params.slug}/tokens.css`; + + if (!fs) { + const empty: ParityReport = { + parityScore: 0, + status: 'needs_iteration', + signals: { + elementCountParity: 0, + visibleTextCoverage: 0, + tokenCoverage: 0, + }, + counts: { + sourceElements: 0, + decomposedElements: 0, + sourceWords: 0, + decomposedWordsMatched: 0, + sourceTokens: 0, + tokenCssMatched: 0, + }, + gaps: [{ kind: 'element', message: 'fs not available; cannot read artifacts to verify' }], + }; + return { + content: [{ type: 'text', text: 'Parity verifier could not access the virtual fs.' }], + details: empty, + }; + } + + const source = fs.view(sourcePath); + const decomposed = fs.view(decomposedPath); + const tokens = fs.view(tokensPath); + + if (!source || !decomposed) { + const which: string[] = []; + if (!source) which.push(sourcePath); + if (!decomposed) which.push(decomposedPath); + const report: ParityReport = { + parityScore: 0, + status: 'needs_iteration', + signals: { + elementCountParity: 0, + visibleTextCoverage: 0, + tokenCoverage: 0, + }, + counts: { + sourceElements: 0, + decomposedElements: 0, + sourceWords: 0, + decomposedWordsMatched: 0, + sourceTokens: 0, + tokenCssMatched: 0, + }, + gaps: [ + { + kind: 'element', + message: `missing artifact(s): ${which.join(', ')}. Re-call decompose_to_ui_kit first.`, + }, + ], + }; + return { + content: [ + { + type: 'text', + text: `Cannot verify: missing ${which.join(', ')}.`, + }, + ], + details: report, + }; + } + + const sourceCounts = countElements(source.content); + const decomposedCounts = countElements(decomposed.content); + const elementResult = elementParityScore(sourceCounts, decomposedCounts); + + const sourceWords = visibleWords(source.content); + const decomposedWords = visibleWords(decomposed.content); + const textResult = textCoverage(sourceWords, decomposedWords); + + const sourceTokenValues = extractTokenValues(source.content); + const tokenResult = tokenCoverageScore(sourceTokenValues, tokens?.content ?? ''); + + // Weighted: structure 0.4, text 0.3, tokens 0.3 + const parityScore = + elementResult.score * 0.4 + textResult.score * 0.3 + tokenResult.score * 0.3; + + const status: ParityReport['status'] = + parityScore >= PARITY_THRESHOLD ? 'ok' : 'needs_iteration'; + const gaps: ParityGap[] = [...elementResult.gaps, ...tokenResult.gaps]; + if (textResult.score < 0.7) { + gaps.push({ + kind: 'text', + message: `only ${textResult.matched}/${sourceWords.size} unique source words present in decomposed index.html`, + }); + } + + const report: ParityReport = { + parityScore: Number(parityScore.toFixed(3)), + status, + signals: { + elementCountParity: Number(elementResult.score.toFixed(3)), + visibleTextCoverage: Number(textResult.score.toFixed(3)), + tokenCoverage: Number(tokenResult.score.toFixed(3)), + }, + counts: { + sourceElements: totalElementCount(sourceCounts), + decomposedElements: totalElementCount(decomposedCounts), + sourceWords: sourceWords.size, + decomposedWordsMatched: textResult.matched, + sourceTokens: sourceTokenValues.size, + tokenCssMatched: tokenResult.matched, + }, + gaps, + }; + + logger.info('[verify_ui_kit_parity] step=ok', { + slug: params.slug, + parityScore: report.parityScore, + status, + gaps: gaps.length, + }); + + const summary = + status === 'ok' + ? `Parity OK (${report.parityScore}). Element ${report.signals.elementCountParity}, text ${report.signals.visibleTextCoverage}, token ${report.signals.tokenCoverage}.` + : `Parity needs iteration (${report.parityScore} < ${PARITY_THRESHOLD}). Element ${report.signals.elementCountParity}, text ${report.signals.visibleTextCoverage}, token ${report.signals.tokenCoverage}. ${gaps.length} gap(s) to address.`; + + return { + content: [{ type: 'text', text: summary }], + details: report, + }; + }, + }; +} diff --git a/packages/core/src/tools/verify-ui-kit-visual-parity.test.ts b/packages/core/src/tools/verify-ui-kit-visual-parity.test.ts new file mode 100644 index 00000000..748bbf1e --- /dev/null +++ b/packages/core/src/tools/verify-ui-kit-visual-parity.test.ts @@ -0,0 +1,295 @@ +import { describe, expect, it, vi } from 'vitest'; +import type { TextEditorFsCallbacks } from './text-editor.js'; +import { + type JudgeVisualParityFn, + type RenderUiKitFn, + STANDARD_VISUAL_PARITY_CHECKS, + makeVerifyUiKitVisualParityTool, + visualParityStatusFromChecks, +} from './verify-ui-kit-visual-parity.js'; + +function makeFs(files: Record): TextEditorFsCallbacks { + return { + view: (path: string) => { + const content = files[path]; + if (content === undefined) return null; + return { content, numLines: content.split('\n').length }; + }, + create: (path: string) => ({ path }), + strReplace: (path: string) => ({ path }), + insert: (path: string) => ({ path }), + listDir: () => [], + }; +} + +const SOURCE_DATA_URL = + 'data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAQAAAC1HAwCAAAAC0lEQVR42mNkAAIAAAoAAv/lxKUAAAAASUVORK5CYII='; + +const ALL_PASS = STANDARD_VISUAL_PARITY_CHECKS.map((c) => ({ + id: c.id, + passed: true, + reason: 'Matches source.', +})); + +const ALL_FAIL = STANDARD_VISUAL_PARITY_CHECKS.map((c) => ({ + id: c.id, + passed: false, + reason: 'Does not match source.', +})); + +const PARTIAL = STANDARD_VISUAL_PARITY_CHECKS.map((c, i) => ({ + id: c.id, + passed: i % 3 !== 0, // 8 of 12 pass = 0.667 + reason: i % 3 === 0 ? 'Failed: specific element wrong.' : 'Passed.', +})); + +describe('visualParityStatusFromChecks', () => { + it('verified when all pass', () => { + expect(visualParityStatusFromChecks(12, 12)).toBe('verified'); + }); + it('needs_review when ratio >= 0.85 (e.g. 11/12 or 10/12 boundary)', () => { + expect(visualParityStatusFromChecks(11, 12)).toBe('needs_review'); + // 0.83 = below 0.85 threshold so falls into needs_iteration + expect(visualParityStatusFromChecks(10, 12)).toBe('needs_iteration'); + }); + it('needs_iteration in the 0.6-0.85 band', () => { + // 8/12 = 0.667 -> needs_iteration + expect(visualParityStatusFromChecks(8, 12)).toBe('needs_iteration'); + // 9/12 = 0.75 -> needs_iteration + expect(visualParityStatusFromChecks(9, 12)).toBe('needs_iteration'); + }); + it('failed below 0.6', () => { + // 7/12 = 0.583 -> failed (below 0.6 boundary) + expect(visualParityStatusFromChecks(7, 12)).toBe('failed'); + expect(visualParityStatusFromChecks(6, 12)).toBe('failed'); + expect(visualParityStatusFromChecks(0, 12)).toBe('failed'); + }); + it('failed when totalChecks is 0', () => { + expect(visualParityStatusFromChecks(0, 0)).toBe('failed'); + }); +}); + +describe('makeVerifyUiKitVisualParityTool', () => { + it('returns unavailable when fs is missing', async () => { + const tool = makeVerifyUiKitVisualParityTool(undefined, undefined, undefined); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('unavailable'); + expect(result.details.summary).toContain('virtual fs not provided'); + }); + + it('returns unavailable when renderUiKit callback is missing', async () => { + const fs = makeFs({}); + const tool = makeVerifyUiKitVisualParityTool(fs, undefined, undefined); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('unavailable'); + expect(result.details.summary).toContain('renderUiKit'); + }); + + it('returns unavailable when judgeVisualParity callback is missing', async () => { + const fs = makeFs({}); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, undefined); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('unavailable'); + expect(result.details.summary).toContain('judgeVisualParity'); + }); + + it('returns needs_iteration when decomposed artifact is missing', async () => { + const fs = makeFs({}); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: ALL_PASS, + summary: 'ok', + costUsd: 0, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'never-decomposed' }, undefined); + expect(result.details.status).toBe('needs_iteration'); + expect(result.details.summary).toContain('missing artifact'); + }); + + it('returns verified when all 12 checks pass', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: ALL_PASS, + summary: 'High parity.', + costUsd: 0.05, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('verified'); + expect(result.details.passCount).toBe(12); + expect(result.details.failCount).toBe(0); + expect(result.details.parityScore).toBe(1); + expect(result.details.judgeCostUsd).toBe(0.05); + }); + + it('returns failed when all 12 checks fail', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: ALL_FAIL, + summary: 'Low parity.', + costUsd: 0.05, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('failed'); + expect(result.details.passCount).toBe(0); + expect(result.details.failCount).toBe(12); + expect(result.details.parityScore).toBe(0); + }); + + it('returns needs_iteration when 8/12 checks pass (parityScore 0.67)', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: PARTIAL, + summary: 'Partial parity.', + costUsd: 0.05, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('needs_iteration'); + expect(result.details.passCount).toBe(8); + expect(result.details.failCount).toBe(4); + expect(result.details.parityScore).toBeCloseTo(0.667, 2); + }); + + it('always emits all 12 standard checks even when judge skips some', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + // Judge only answers 3 of the 12 checks + const partialChecks = ALL_PASS.slice(0, 3); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: partialChecks, + summary: 'Partial response.', + costUsd: 0.05, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.totalChecks).toBe(12); + expect(result.details.passCount).toBe(3); + expect(result.details.failCount).toBe(9); + // The 9 missing checks default to failed with explicit reason + const unanswered = result.details.checks.filter((c) => + c.reason.includes('judge did not answer'), + ); + expect(unanswered.length).toBe(9); + }); + + it('reports unavailable when source image is missing', async () => { + const fs = makeFs({ 'ui_kits/x/index.html': '' }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: ALL_PASS, + summary: '', + costUsd: 0, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('unavailable'); + expect(result.details.summary).toContain('source image not found'); + }); + + it('reports unavailable when source image is not a data URL', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': '', + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: ALL_PASS, + summary: '', + costUsd: 0, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + expect(result.details.status).toBe('unavailable'); + expect(result.details.summary).toContain('must be a data URL'); + }); + + it('threads abort signal to renderUiKit and judgeVisualParity', async () => { + const controller = new AbortController(); + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit = vi.fn().mockResolvedValue({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity = vi.fn().mockResolvedValue({ + checks: ALL_PASS, + summary: '', + costUsd: 0, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + await tool.execute('t', { slug: 'x' }, controller.signal); + expect(renderUiKit).toHaveBeenCalledWith(expect.any(String), controller.signal); + expect(judgeVisualParity).toHaveBeenCalledWith( + expect.objectContaining({ dataUrl: SOURCE_DATA_URL }), + expect.objectContaining({ dataUrl: SOURCE_DATA_URL }), + controller.signal, + ); + }); + + it('every check carries a reason string (HONEST_SCORES rule)', async () => { + const fs = makeFs({ + 'ui_kits/x/index.html': '', + 'source.png': SOURCE_DATA_URL, + }); + const renderUiKit: RenderUiKitFn = async () => ({ + dataUrl: SOURCE_DATA_URL, + mediaType: 'image/png', + }); + const judgeVisualParity: JudgeVisualParityFn = async () => ({ + checks: PARTIAL, + summary: '', + costUsd: 0, + }); + const tool = makeVerifyUiKitVisualParityTool(fs, renderUiKit, judgeVisualParity); + const result = await tool.execute('t', { slug: 'x' }, undefined); + for (const check of result.details.checks) { + expect(check.reason).toBeTruthy(); + expect(check.reason.length).toBeGreaterThan(0); + } + }); +}); diff --git a/packages/core/src/tools/verify-ui-kit-visual-parity.ts b/packages/core/src/tools/verify-ui-kit-visual-parity.ts new file mode 100644 index 00000000..89dc7d05 --- /dev/null +++ b/packages/core/src/tools/verify-ui-kit-visual-parity.ts @@ -0,0 +1,359 @@ +/** + * verify-ui-kit-visual-parity — vision-LLM judge with BOOLEAN-per-dimension scoring. + * + * Pairs with the deterministic verify_ui_kit_parity. The deterministic verifier + * reads HTML/CSS strings and computes structural signals; this one renders the + * decomposed index.html, screenshots it, and asks a multimodal model to compare + * against the source artifact. + * + * Scoring methodology — BOOLEAN per dimension, NOT floating-point: + * - Same 12 standard checks on every run (across layout / color / typography + * / content / components dimensions) + * - Each check is yes/no with an explicit reason + * - parityScore is DERIVED as passCount / totalChecks, never LLM-arbitrary + * - status is BOUNDED ENUM thresholded from passCount (verified / + * needs_review / needs_iteration / failed) + * + * This matches NodeBench's established rule patterns: + * - .claude/rules/pipeline_operational_standard.md (10-gate boolean catalog) + * - .claude/rules/eval_flywheel.md (boolean evaluators, no hardcoded floors) + * - .claude/rules/agent_run_verdict_workflow.md (bounded enum verdicts) + * + * Why boolean over floating-point: lower judge variance, every failure has a + * clear actionable reason, score is derived deterministically not LLM-arbitrary, + * comparable across runs/models/time. + * + * Pattern mirrors generate-image-asset.ts: host injects renderUiKit + the + * underlying judge call. Without injections the tool returns + * status="unavailable" and the agent falls back to verify_ui_kit_parity. + */ + +import type { AgentTool, AgentToolResult } from '@mariozechner/pi-agent-core'; +import { Type } from '@sinclair/typebox'; +import { type CoreLogger, NOOP_LOGGER } from '../logger.js'; +import type { TextEditorFsCallbacks } from './text-editor'; + +const VerifyParams = Type.Object({ + slug: Type.String(), + /** Path to the source artifact image inside the design's virtual fs. + * Defaults to 'source.png' if the host stored a copy there. Source must be + * stored as a data URL string. */ + sourceImagePath: Type.Optional(Type.String()), +}); + +export interface VisualParityGap { + kind: 'layout' | 'color' | 'typography' | 'spacing' | 'content' | 'component' | 'other'; + severity: 'high' | 'medium' | 'low'; + description: string; + suggestion: string; +} + +export interface VisualParityCheck { + id: string; + dimension: 'layout' | 'color' | 'typography' | 'content' | 'components'; + question: string; + passed: boolean; + reason: string; +} + +export type VisualParityStatus = + | 'verified' + | 'needs_review' + | 'needs_iteration' + | 'failed' + | 'unavailable'; + +export interface VisualParityReport { + parityScore: number; + status: VisualParityStatus; + summary: string; + checks: VisualParityCheck[]; + passCount: number; + failCount: number; + totalChecks: number; + reasoning?: string; + gaps: VisualParityGap[]; + judgeCostUsd?: number; + judgeLatencyMs?: number; +} + +/** The 12 standard checks. Identical set across all judge runs so reports are + * comparable across models, runs, and time. Same as headless pipeline. */ +export const STANDARD_VISUAL_PARITY_CHECKS: Array<{ + id: string; + dimension: VisualParityCheck['dimension']; + question: string; +}> = [ + { + id: 'layout.column_count_match', + dimension: 'layout', + question: 'Does the candidate have the same number of major columns / regions as the source?', + }, + { + id: 'layout.region_positions_match', + dimension: 'layout', + question: + 'Are major regions (header / sidebar / main / right rail / footer) in the same positions as the source?', + }, + { + id: 'layout.hierarchy_preserved', + dimension: 'layout', + question: 'Is the visual hierarchy (heading > subhead > body > footer) preserved?', + }, + { + id: 'color.accent_color_match', + dimension: 'color', + question: + 'Is the primary accent color visually equivalent to the source (same hue family, similar saturation)?', + }, + { + id: 'color.palette_consistency_match', + dimension: 'color', + question: + 'Does the overall palette feel match the source (warm/cool, saturated/muted, contrast level)?', + }, + { + id: 'typography.font_family_match', + dimension: 'typography', + question: + 'Does the font family character (serif / sans / mono) match the source for each text role?', + }, + { + id: 'typography.heading_hierarchy_match', + dimension: 'typography', + question: 'Are heading weights and sizes stepped similarly (H1 vs body vs caption)?', + }, + { + id: 'content.text_labels_present', + dimension: 'content', + question: + 'Are all visible text labels from the source present in the candidate (nav items, headings, button text)?', + }, + { + id: 'content.all_sections_present', + dimension: 'content', + question: + 'Are all distinct sections from the source present in the candidate (not just one missing region)?', + }, + { + id: 'components.repeated_pattern_count_match', + dimension: 'components', + question: + 'Does the candidate have approximately the same count of repeated patterns (cards / list items / nav links) as the source?', + }, + { + id: 'components.component_structure_match', + dimension: 'components', + question: + 'Do repeated components have the same internal anatomy (header + body + footer pieces)?', + }, + { + id: 'components.icon_motif_match', + dimension: 'components', + question: 'Are icons / glyphs in the same style (line vs filled, monochrome vs colored)?', + }, +]; + +/** Status thresholds, deterministic from passCount / totalChecks. + * Mirrors agent_run_verdict_workflow.md verdict tiers. */ +export function visualParityStatusFromChecks( + passCount: number, + totalChecks: number, +): VisualParityStatus { + if (totalChecks === 0) return 'failed'; + const ratio = passCount / totalChecks; + if (ratio === 1) return 'verified'; + if (ratio >= 0.85) return 'needs_review'; + if (ratio >= 0.6) return 'needs_iteration'; + return 'failed'; +} + +export interface VisualParityImageRef { + dataUrl: string; + mediaType: 'image/png' | 'image/jpeg' | 'image/webp' | 'image/gif'; +} + +export type RenderUiKitFn = ( + indexHtml: string, + signal?: AbortSignal, +) => Promise; + +/** The host-injected judge call. Returns the boolean per-check answers; this + * tool normalizes/derives parityScore + status deterministically. */ +export type JudgeVisualParityFn = ( + source: VisualParityImageRef, + candidate: VisualParityImageRef, + signal?: AbortSignal, +) => Promise<{ + reasoning?: string; + checks: Array<{ id: string; passed: boolean; reason: string }>; + summary: string; + gaps?: VisualParityGap[]; + costUsd: number; +}>; + +function unavailableReport(reason: string): VisualParityReport { + return { + parityScore: 0, + status: 'unavailable', + summary: `Visual parity judge unavailable: ${reason}. Fall back to verify_ui_kit_parity (deterministic).`, + checks: [], + passCount: 0, + failCount: 0, + totalChecks: 0, + gaps: [ + { + kind: 'other', + severity: 'low', + description: reason, + suggestion: + 'Ensure the host injected judgeVisualParity + renderUiKit callbacks (same pattern as generate_image_asset).', + }, + ], + }; +} + +function normalizeChecks( + reported: Array<{ id: string; passed: boolean; reason: string }>, +): VisualParityCheck[] { + const reportedById = new Map(); + for (const c of reported) { + if (c?.id && typeof c.id === 'string') { + reportedById.set(c.id, { passed: c.passed === true, reason: c.reason ?? '(no reason)' }); + } + } + // Always emit ALL standard checks in canonical order. Missing checks default + // to failed with explicit "judge did not answer" — failure-of-judge counts as + // failure-of-parity (HONEST_SCORES rule from agentic_reliability.md). + return STANDARD_VISUAL_PARITY_CHECKS.map((std) => { + const r = reportedById.get(std.id); + return { + id: std.id, + dimension: std.dimension, + question: std.question, + passed: r?.passed ?? false, + reason: r?.reason ?? '(judge did not answer this check)', + }; + }); +} + +function parseMediaType(dataUrl: string): VisualParityImageRef['mediaType'] { + const m = dataUrl.match(/^data:(image\/(?:png|jpeg|webp|gif))/); + return (m?.[1] as VisualParityImageRef['mediaType']) ?? 'image/png'; +} + +export function makeVerifyUiKitVisualParityTool( + fs: TextEditorFsCallbacks | undefined, + renderUiKit: RenderUiKitFn | undefined, + judgeVisualParity: JudgeVisualParityFn | undefined, + logger: CoreLogger = NOOP_LOGGER, +): AgentTool { + return { + name: 'verify_ui_kit_visual_parity', + label: 'Verify UI Kit visual parity', + description: + 'Render the decomposed ui_kits//index.html in a hidden window, take ' + + 'a screenshot, and ask a multimodal model 12 boolean parity questions ' + + '(across layout / color / typography / content / components dimensions). ' + + 'Each question is yes/no with a clear reason. parityScore = passCount / 12 ' + + '(derived deterministically). Status is bounded enum (verified / ' + + 'needs_review / needs_iteration / failed). Call this AFTER decompose_to_ui_kit ' + + 'and AFTER verify_ui_kit_parity (deterministic). If any check fails, re-call ' + + 'decompose_to_ui_kit addressing the failed-check reasons. If this tool ' + + 'returns status="unavailable", proceed with the deterministic verifier alone.', + parameters: VerifyParams, + async execute(_toolCallId, params, signal): Promise> { + const startedAt = Date.now(); + const decomposedPath = `ui_kits/${params.slug}/index.html`; + + if (!fs) { + const report = unavailableReport('virtual fs not provided'); + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + if (!renderUiKit) { + const report = unavailableReport('host has not injected renderUiKit callback'); + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + if (!judgeVisualParity) { + const report = unavailableReport('host has not injected judgeVisualParity callback'); + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + + const decomposed = fs.view(decomposedPath); + if (!decomposed) { + const report = { + ...unavailableReport(`missing artifact: ${decomposedPath}`), + status: 'needs_iteration' as const, + }; + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + + const sourcePath = params.sourceImagePath ?? 'source.png'; + const sourceFile = fs.view(sourcePath); + if (!sourceFile) { + const report = unavailableReport( + `source image not found at ${sourcePath} — agent must persist source.png before calling`, + ); + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + if (!sourceFile.content.startsWith('data:')) { + const report = unavailableReport( + `source image at ${sourcePath} must be a data URL (got prefix: ${sourceFile.content.slice(0, 40)}...)`, + ); + return { content: [{ type: 'text', text: report.summary }], details: report }; + } + + const sourceImg: VisualParityImageRef = { + dataUrl: sourceFile.content, + mediaType: parseMediaType(sourceFile.content), + }; + + logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug }); + const candidateImg = await renderUiKit(decomposed.content, signal); + + logger.info('[verify_ui_kit_visual_parity] step=judge', { slug: params.slug }); + const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal); + + const checks = normalizeChecks(judgeResult.checks ?? []); + const passCount = checks.filter((c) => c.passed).length; + const failCount = checks.length - passCount; + const totalChecks = checks.length; + const parityScore = totalChecks === 0 ? 0 : passCount / totalChecks; + const status = visualParityStatusFromChecks(passCount, totalChecks); + const totalLatencyMs = Date.now() - startedAt; + + const report: VisualParityReport = { + parityScore: Number(parityScore.toFixed(3)), + status, + summary: judgeResult.summary ?? 'No summary provided.', + checks, + passCount, + failCount, + totalChecks, + ...(judgeResult.reasoning ? { reasoning: judgeResult.reasoning } : {}), + gaps: (judgeResult.gaps ?? []).slice(0, 8), + judgeCostUsd: judgeResult.costUsd, + judgeLatencyMs: totalLatencyMs, + }; + + logger.info('[verify_ui_kit_visual_parity] step=ok', { + slug: params.slug, + parityScore: report.parityScore, + status, + passCount, + failCount, + ms: totalLatencyMs, + }); + + const summary = + status === 'verified' || status === 'needs_review' + ? `Visual parity ${status} (${passCount}/${totalChecks} checks passed). ${report.summary}` + : `Visual parity ${status} (${passCount}/${totalChecks} checks passed). ${failCount} failed check(s) to address.`; + + return { + content: [{ type: 'text', text: summary }], + details: report, + }; + }, + }; +} diff --git a/packages/i18n/src/locales/en.json b/packages/i18n/src/locales/en.json index 72493250..22ecee82 100644 --- a/packages/i18n/src/locales/en.json +++ b/packages/i18n/src/locales/en.json @@ -132,6 +132,15 @@ "linkDesignSystemRepo": "Link design system repo", "refreshDesignSystemRepo": "Refresh design system repo", "referenceUrl": "Reference URL", + "decomposeToUiKit": "Decompose to UI Kit", + "decomposeToUiKitDisabled": "Generate a design first, then decompose it", + "decomposeBusyTitle": "Generation in progress", + "decomposeBusyDescription": "Wait for the current run to finish, then try again.", + "decomposeUnavailableTitle": "Nothing to decompose yet", + "decomposeStartedTitle": "Decomposing to UI Kit…", + "decomposeStartedDescription": "Reading the artifact, extracting components and tokens, then verifying parity. Watch the agent panel for progress.", + "decomposeJudgeResultTitle": "Visual judge: {{passed}}/{{total}} checks passed · {{status}}", + "decomposeJudgeResultDescription": "Self-verify spent {{cost}}. Open BENCHMARKS.md for the boolean rubric methodology.", "attachedFiles": "Attached files", "removeFile": "Remove {{name}}", "activeDesignSystem": "Active design system", diff --git a/packages/i18n/src/locales/zh-CN.json b/packages/i18n/src/locales/zh-CN.json index c13b628a..3d1a63de 100644 --- a/packages/i18n/src/locales/zh-CN.json +++ b/packages/i18n/src/locales/zh-CN.json @@ -132,6 +132,15 @@ "linkDesignSystemRepo": "关联设计系统仓库", "refreshDesignSystemRepo": "刷新设计系统仓库", "referenceUrl": "参考链接", + "decomposeToUiKit": "拆解为 UI Kit", + "decomposeToUiKitDisabled": "先生成一个设计, 然后再拆解", + "decomposeBusyTitle": "正在生成中", + "decomposeBusyDescription": "等当前生成跑完再试。", + "decomposeUnavailableTitle": "暂时还没有可拆解的内容", + "decomposeStartedTitle": "正在拆解为 UI Kit…", + "decomposeStartedDescription": "读 artifact, 抽组件和 token, 自检视觉一致性。看 agent panel 的实时进度。", + "decomposeJudgeResultTitle": "视觉判定: {{passed}}/{{total}} 项通过 · {{status}}", + "decomposeJudgeResultDescription": "自检花了 {{cost}}。BENCHMARKS.md 里有完整的 boolean rubric 方法论。", "attachedFiles": "已附加文件", "removeFile": "移除 {{name}}", "activeDesignSystem": "当前设计系统", diff --git a/website/public/demos/decompose-iter-reel.gif b/website/public/demos/decompose-iter-reel.gif new file mode 100644 index 00000000..65e2b601 Binary files /dev/null and b/website/public/demos/decompose-iter-reel.gif differ diff --git a/website/public/demos/decompose-iter-reel.mp4 b/website/public/demos/decompose-iter-reel.mp4 new file mode 100644 index 00000000..12de83a0 Binary files /dev/null and b/website/public/demos/decompose-iter-reel.mp4 differ diff --git a/website/public/screenshots/decompose-to-ui-kit.png b/website/public/screenshots/decompose-to-ui-kit.png new file mode 100644 index 00000000..c84c5d0c Binary files /dev/null and b/website/public/screenshots/decompose-to-ui-kit.png differ