Skip to content

feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241

Open
HomenShum wants to merge 11 commits intoOpenCoworkAI:mainfrom
HomenShum:feat/decompose-to-ui-kit
Open

feat(core): add decompose-to-ui-kit + boolean parity verifiers (Phase 1 of #225)#241
HomenShum wants to merge 11 commits intoOpenCoworkAI:mainfrom
HomenShum:feat/decompose-to-ui-kit

Conversation

@HomenShum
Copy link
Copy Markdown

Summary

Phase 1 of #225: a single-image → componentized ui_kit/ decomposition pipeline that emits a coding-agent-ready bundle, plus deterministic + vision verifiers that self-check parity using a 12-question boolean rubric and re-iterate on gaps. Uses existing userImages plumbing (PR #193) and adds three new agent tools that mirror existing patterns (done.ts / generate-image-asset.ts). Ends in the chat sidebar with a one-click trigger that fires a structured prompt, walks the agent through decompose → verify → reconcile → done, and surfaces per-decompose cost as a toast. No new prod deps, no SQLite schema change, in-memory output via the Files panel.

This PR closes the Phase 1 part of #225 only. The Phase 2 (gpt-image-2 generation in the loop) and Phase 3 (multi-page flow) cuts I committed to in the issue thread are intentionally not included.

Type of change

  • New feature

Linked issue

Closes #225 (Phase 1 only — Phase 2/3 deferred per my comment)

What's in here

3 new agent tools in packages/core/src/tools/:

  1. decompose-to-ui-kit.ts — orchestrator. Takes a source image (from chat context) + design brief, emits ui_kits/<slug>/{index.html, components/*.tsx, tokens.css, manifest.json, README.md} to the virtual FS. Output carries schemaVersion: 1 so downstream coding agents (Claude Code, Cursor) can evolve safely.
  2. verify-ui-kit-parity.ts — deterministic verifier. 3 signals: element-count parity, visible-text coverage, token coverage. Returns a ParityReport with passCount/totalChecks derived score (no LLM in the loop, no floats).
  3. verify-ui-kit-visual-parity.ts — vision-LLM judge wrapper. Takes a host-injected judgeVisualParity callback, runs a 12-check boolean rubric across 5 dimensions (layout / color / typography / content / components), returns parityScore = passCount / totalChecks and a bounded-enum status (verified | needs_review | needs_iteration | failed | unavailable).

Host wiring in apps/desktop/src/main/:

  • render-ui-kit.ts — offscreen BrowserWindow.capturePage() for the rendered ui_kit
  • judge-visual-parity.ts — vision-judge prompt builder + LLM dispatcher using the existing complete() provider abstraction
  • Both injected into the agent via agent.ts deps interface, mirroring how generate_image_asset was wired

Renderer:

  • AddMenu.tsx — new "Decompose to UI Kit" entry, disabled when no artifact / generation in flight
  • Sidebar.tsxtriggerDecompose(designId, locale) action wired to the menu item
  • store.ts — 3-branch toast feedback (busy / unavailable / started) + per-tool-call cost row when the visual judge resolves
  • hooks/decomposePrompt.ts — locale-aware (EN/ZH) structured prompt that walks the agent through decompose → verify → reconcile → iterate (max 2) → done with HONEST cost summary

Tests — full vitest coverage in *.test.ts next to each tool:

  • decompose-to-ui-kit.test.ts (263 LOC)
  • verify-ui-kit-parity.test.ts (180 LOC)
  • verify-ui-kit-visual-parity.test.ts (295 LOC)

i18n — 9 new keys × EN + ZH for the menu entry, toast titles/descriptions, and cost row.

Design decisions

Boolean rubric, not floats. Every visual parity check is {passed: boolean}, derived parityScore = passCount / totalChecks. The status field is a bounded enum derived from thresholds (100% → verified, ≥85% → needs_review, ≥60% → needs_iteration, <60% → failed). No LLM-fabricated confidence floats, no scoring inflation. Aligns with the project's HONEST_SCORES precedent (done.ts's verified: boolean field).

Host-injected callbacks, not framework lock-in. verify-ui-kit-visual-parity.ts doesn't import any LLM SDK or any Electron API. It takes RenderUiKitFn and JudgeVisualParityFn as deps. If the host doesn't inject them (e.g. a future headless CLI), the tool returns status: 'unavailable' honestly instead of crashing. Mirrors how generate_image_asset is keyed on deps.generateImageAsset.

In-memory output via Files panel, no schema bump. Per my open binary in the issue thread, this PR ships option (a): the ui_kits/<slug>/ lands in the design's virtual FS, surfaces in the existing Files panel, and uses the existing ZIP export for handoff to a coding agent. No SQLite migration, smallest blast radius, consistent with how polishPrompt.ts's second-pass mutates only in-memory state.

schemaVersion: 1 on the manifest. Downstream consumers (Claude Code, Cursor) need a stable contract. Adding fields requires no version bump; renaming or removing fields requires schemaVersion: 2 and a parallel-emit window.

Anti-hallucination guardrails

The deterministic verifier (verify-ui-kit-parity.ts) checks visible-text coverage on the emitted ui_kit vs the source brief — if the agent dropped any text content, it fails BEFORE the LLM judge runs. This catches data hallucination cheap. The LLM judge then handles only semantic-quality dimensions (visual hierarchy, color harmony, typography pairing, etc.).

Cost surfacing

Every verify_ui_kit_visual_parity resolution pushes a toast with passCount/totalChecks · status · $cost.NNNN. Reads defensively from result.details so future contract drift degrades silently rather than crashing the renderer. The done tool's prompt-driven summary additionally requires the agent to report total run cost, per the HONEST_STATUS precedent.

Checklist

  • I read docs/VISION.md, docs/PRINCIPLES.md, and CLAUDE.md before starting
  • Commits are signed with DCO (git commit -s)
  • pnpm lint && pnpm typecheck && pnpm test passes locally (1026 tests pass on this branch as of d6f3a00)
  • Added/updated tests for the change (738 LOC across 3 new test files)
  • Added a changeset (pnpm changeset) — see .changeset/decompose-to-ui-kit.md
  • Updated docs if behavior changed — BENCHMARKS.md (new), README.md + README.zh-CN.md (Decompose to UI Kit feature card + hero PNG + iter-reel GIF)

Dependency additions (if any)

None. All three new tools use only @mariozechner/pi-agent-core's AgentTool factory pattern that's already a prod dep.

Screenshots / recordings (UI changes)

Side-by-side hero — source vs agent-emitted ui_kit (e2e-opus-final run, parityScore 0.90):

Decompose to UI Kit hero

4-frame reconcile reel from the e2e-nodebench-iter run (iter-0 → iter-1 with honest score drift 0.82 → 0.78 — boolean rubric exposes the regression instead of hiding it):

Iter reel

MP4 version for higher fidelity.

Live-recorded session demo (real Electron app, no stitching) — recording in progress, will edit this PR description when the GIF is ready. ETA same day.

Cross-tier benchmarks

BENCHMARKS.md at repo root has the full methodology + run-by-run real-data results across model tiers (Opus, Pro+Pro+iterate, Kimi+Gemini3, NodeBench iter), reproducibility instructions, honest non-claims, and research citations (WebDevJudge, Prometheus-Vision, Trust-but-Verify ICCV 2025).

Run Decompose Judge parityScore Gaps surfaced
e2e-opus-final claude-opus-4-1 claude-opus-4-1 0.90 4
e2e-nodebench-iter (iter-0) gemini-3-pro-preview gemini-3-pro-preview 0.82 6
e2e-nodebench-iter (iter-1) gemini-3-pro-preview gemini-3-pro-preview 0.78 5
e2e-bank-kimi-gemini3 kimi-k2.6 gemini-3-pro-preview 0.78 8
e2e-nodebench-B kimi-k2.6 gemini-3-pro-preview 0.60 7

Note the iter-0 → iter-1 regression on the same source: agent fixed some gaps but introduced new layout drift. The boolean rubric exposes this honestly rather than fudging the score upward. This is the intended behavior, not a bug.

Scope discipline notes

Branch state at PR open

  • 9 commits ahead of upstream/main
  • 11 commits behind (mostly chore(deps) bumps including pi-agent-core 0.67.68 → 0.70.2; my branch is on 0.67.68)
  • Will rebase against latest main on request — wanted to open the PR with the as-built state for clarity first. The pi-agent-core 0.70.2 bump may require small adjustments to the new tools' AgentTool shape; I'll handle that in the rebase pass.

Why this is ready to review now

  • Real cross-tier benchmarks in BENCHMARKS.md, not synthetic
  • Visual proof embedded above (hero + reel)
  • Test coverage matches existing tools
  • Pattern conformance: every new file mirrors an existing precedent
  • Deliberate scope: closes Phase 1 of the issue cleanly, defers the rest visibly

Looking forward to feedback. Happy to address structural concerns first before iterating on smaller polish.

Adds a new agent tool that decomposes the current artifact into a
ui_kits/<slug>/ folder structure (index.html + components/*.tsx +
tokens.css + manifest.json + README.md), shaped for handoff to a
downstream coding agent (Claude Code, Cursor, etc.).

- New tool factory in packages/core/src/tools/decompose-to-ui-kit.ts
  follows the existing factory + AgentTool + typebox pattern from
  done.ts and generate-image-asset.ts.
- New "Decompose to UI Kit" item in chat AddMenu, gated on having
  a current design and not currently generating.
- New triggerDecompose store action + decomposePrompt.ts hook,
  mirroring the polishPrompt.ts pattern but user-triggered (no
  auto-fire). Sends the prompt as a silent follow-up so the chat
  reads as one continuous run.
- Output carries schemaVersion: 1 in manifest.json so downstream
  consumers can evolve safely.
- Decomposition is prompt-driven (model identifies repeated DOM
  subtrees and emits the structured plan); the tool just persists
  to the virtual fs in a single atomic call.

i18n keys added in en + zh-CN. No new dependencies.

Closes the Phase 1 ask in OpenCoworkAI#225.

10 new unit tests cover: typical decomposition, slug sanitization,
fallback slug, manifest schemaVersion, token CSS grouping, token
name normalization, README rendering, empty inputs, return shape,
and undefined-fs handling.

Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)
- pnpm test green (1026 desktop + 252 core tests pass)

Signed-off-by: homen <hshum2018@gmail.com>
Adds a deterministic parity verifier the agent calls AFTER decompose_to_ui_kit
and uses to self-correct before calling done. No LLM judge involved — the
parity report is reproducible from the raw HTML / CSS strings.

Three signals comparing source index.html vs ui_kits/<slug>/index.html and
ui_kits/<slug>/tokens.css:
  1. Element count parity — structural tag distribution (div/section/button/
     h1-h6/table/etc.), weighted 0.4 in overall score
  2. Visible text coverage — % of unique source words present in decomposed,
     weighted 0.3
  3. Token coverage — % of unique hex / rgb / px / rem values from source
     captured in tokens.css (gaps capped at 8 to keep agent context small),
     weighted 0.3

Returns a ParityReport with an explicit gaps list. If parityScore < 0.85
the prompt instructs the agent to re-call decompose_to_ui_kit with
adjustments addressing the specific gaps, then re-verify. Iterates at most
twice to avoid loops; final done() summary honestly states the achieved
parityScore + remaining gaps.

Pattern mirrors done.ts: deterministic checker run during the agent's own
turn so it can self-correct before declaring the artifact complete.

7 new unit tests cover: high-parity faithful decomposition, low-parity thin
decomposition, missing artifact handling, hardcoded values absent from
tokens.css, undefined-fs fallback, byte-identical input, and pass/fail
summary text.

decomposePrompt.ts updated for both EN and ZH locales to walk the agent
through the verify-and-iterate loop explicitly.

Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)
- pnpm test green (252 core + all other packages, 17 new tests across
  decompose-to-ui-kit + verify-ui-kit-parity)

Signed-off-by: homen <hshum2018@gmail.com>
…ension scoring

Adds the vision-LLM judge counterpart to the existing deterministic
verify_ui_kit_parity. Renders the decomposed ui_kits/<slug>/index.html
in a hidden window via the host-injected renderUiKit callback, screenshots
it, and asks a multimodal model to compare against the source artifact via
the host-injected judgeVisualParity callback.

Scoring is BOOLEAN-per-dimension, NOT floating-point — matches NodeBench's
established rule patterns (pipeline_operational_standard.md 10-gate boolean
catalog, eval_flywheel.md boolean evaluators, agent_run_verdict_workflow.md
bounded enum verdicts).

The judge answers 12 standard checks on every run (across layout / color /
typography / content / components dimensions), each yes/no with an explicit
reason string. The aggregate parityScore is DERIVED as passCount/totalChecks
(never LLM-arbitrary). Status is bounded enum (verified / needs_review /
needs_iteration / failed) thresholded deterministically:
  - 100% passed -> verified
  - >=85% passed -> needs_review
  - >=60% passed -> needs_iteration
  - <60% passed -> failed

Why boolean over floating-point: lower judge variance (yes/no is harder to
fudge than a number), every failure has a clear actionable reason, score is
derived not LLM-arbitrary, comparable across runs/models/time. Failure-of-judge
counts as failure-of-parity (HONEST_SCORES rule from agentic_reliability.md).

Pattern mirrors generate-image-asset.ts: host injects two callbacks
(renderUiKit, judgeVisualParity). Without them the tool returns
status="unavailable" and the agent falls back to the deterministic verifier.

decomposePrompt.ts (EN + ZH) updated to call BOTH verifiers and reconcile
gaps before deciding to iterate or finish.

17 new unit tests cover: status thresholds across the verified/needs_review/
needs_iteration/failed bands, all-pass/all-fail/partial check sets, missing
fs/render/judge callbacks, missing artifacts, missing source image, source
image format validation, abort signal threading, and HONEST_SCORES guarantee
that every check carries a reason.

Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)
- pnpm test green (276 core including 17 new + 1026 desktop + others)

Signed-off-by: homen <hshum2018@gmail.com>
…ks + toast feedback

The verify_ui_kit_visual_parity tool was returning status="unavailable"
because the host hadn't injected its two callbacks. This commit completes
the wiring so the visual judge runs LIVE during decompose.

Three new pieces:

1. apps/desktop/src/main/render-ui-kit.ts (~110 LOC)
   Hidden BrowserWindow + offscreen render + capturePage. Mirrors
   done-verify.ts pattern. Loads the decomposed ui_kits/<slug>/index.html
   in a sandboxed offscreen window, waits for did-finish-load + a 1500ms
   settle window for fonts/CSS, then captures a PNG and returns it as a
   base64 data URL. Honors AbortSignal + 12s hard timeout.

2. apps/desktop/src/main/judge-visual-parity.ts (~230 LOC)
   Vision-LLM judge with the same 12 standard boolean parity checks as
   the in-core tool. Decoupled from cfg plumbing — takes a runVisionPrompt
   callback the host wires using its existing generation pipeline. Asks
   the model to answer each check yes/no with a reason, parses defensively
   (code-fence strip + balanced-brace extract), returns structured per-
   check answers that the in-core tool normalizes into a deterministic
   parityScore + bounded-enum status.

3. apps/desktop/src/main/index.ts wiring
   Constructs both callbacks at runGenerate time and passes them to
   generateViaAgent's deps. The judge re-uses the SAME model/apiKey/
   baseUrl/wire/capabilities as the active generation request, so we
   don't need a separate judge config — whatever model the user picked
   for generation is the model that judges parity. If the model isn't
   vision-capable the judge throws and the agent falls back to the
   deterministic verify_ui_kit_parity.

Bonus: triggerDecompose store action now surfaces three toasts covering
all branches (busy / no-artifact-yet / decomposing-now), with i18n keys
in en + zh-CN. Previously the action silently no-op'd when conditions
weren't met, which the user caught during dogfood.

Verified:
- pnpm lint clean (1 noShadowRestrictedNames fix on local `escape` var)
- pnpm typecheck clean (10/10 workspace tasks)
- pnpm test green (276 core + 1026 desktop + others)
- Live-DOM dogfood with Playwright in browser-mode passed all 12 checks
  including the new menu item rendering and console-error-clean reload

Signed-off-by: homen <hshum2018@gmail.com>
…t in done summary

Two pieces, no defer:

1. docs/benchmarks/DECOMPOSE_TO_UI_KIT.md (~280 lines)
   - Full methodology: 4-stage pipeline + 12 standard boolean checks +
     status thresholds + cost methodology + cache key derivation
   - Real numbers from four cross-tier runs on the same NodeBench Reports
     source (cached): Opus reference, Pro+Pro with iteration loop
     demonstration, mixed Flash-Lite-decompose + Pro-judge, cheapest tier
   - Specific gap signal showing the verify-and-iterate loop climbing
     parity 0.69 -> 0.78 in one self-correcting round
   - Recommendation matrix: production / continuous-eval / CI-smoke
   - Reproducibility instructions with exact CLI commands
   - Honest non-claims section (no claim of universal parity, no claim
     gpt-image-1 mockups are production-quality, no claim cheap tier
     hits 0.85 first-pass)
   - Documented model failures (Kimi K2.6 truncation via OpenRouter,
     GLM 4.6V malformed JSON)
   - Citations to 2026 VLM-as-judge research + NodeBench's own internal
     boolean-evaluator rule patterns

2. decomposePrompt.ts updated (EN + ZH) — done summary MUST report:
   - Deterministic verifier passCount/totalChecks + status
   - Visual judge passCount/12 + status
   - Visual judge judgeCostUsd (this run's self-verify spend)
   - Remaining unfixed gaps with failed-check ids + why
   "Do NOT hide cost. Do NOT inflate scores. Failed checks count as failed."

The cost surfacing is prompt-driven (the agent always reports it in
chat) — orthogonal to a future UI cost meter, but ensures honest cost
accounting today without renderer surgery.

Verified:
- pnpm lint clean
- pnpm typecheck clean (10/10 workspace tasks)

Signed-off-by: homen <hshum2018@gmail.com>
Tracks the boolean-rubric methodology + reproducible cross-tier results
for verify_ui_kit_visual_parity. docs/ is gitignored per CLAUDE.md so
this lives at repo root alongside README.md / CONTRIBUTING.md.

Re-publishes the decompose-to-ui-kit benchmark previously committed to
docs/benchmarks/ (which was silently dropped by .gitignore).

Signed-off-by: homen <hshum2018@gmail.com>
When `verify_ui_kit_visual_parity` resolves, the renderer now reads
`judgeCostUsd`, `passCount`, `totalChecks`, and `status` defensively from
the structured ParityReport and pushes a toast — operator sees a per-
decompose cost row without needing a new dashboard. Variant flips to
`success` for verified/needs_review, `info` otherwise. Reads the result
shape with bracket access so a future contract drift degrades to silent
rather than crashing the renderer. New i18n keys
`sidebar.decomposeJudgeResultTitle/Description` in en + zh-CN.

README + README.zh-CN now mention the Decompose to UI Kit feature under
"What's new" + Generation features so the entry point is discoverable
from the repo landing.

Signed-off-by: homen <hshum2018@gmail.com>
Side-by-side hero image: source.png (gpt-image input) vs rendered.png
(agent-emitted ui_kit, headless-rendered) from the e2e-opus-final PoC
run. parityScore badge (0.90) and status are derived deterministically
from the 12-check boolean rubric — passCount / totalChecks — not an
LLM-fabricated float. Hosted in this branch under
website/public/screenshots/decompose-to-ui-kit.png so the github raw
URL renders inline on github.com regardless of upstream merge state.

Both README.md and README.zh-CN.md now embed the same image with a
matching subcaption that calls out: real run (not mock), source ->
rendered direction, and the deterministic-derivation invariant.

Signed-off-by: homen <hshum2018@gmail.com>
4-frame reel from the e2e-nodebench-iter PoC run:
1. SOURCE      gpt-image input
2. ITER-0      parityScore 0.82, status needs_iteration, 6 gaps surfaced
3. ITER-1      parityScore 0.78, status needs_iteration, 5 gaps surfaced
4. HONEST      Δ score -0.04, Δ gaps -1 -- agent fixed some gaps but
                regressed on layout, boolean rubric exposes the drift
                instead of hiding it (HONEST_SCORES rule)

Both gif (393.9 KB, 1080px wide, 10fps) and mp4 (224.2 KB, h.264 yuv420p)
shipped under website/public/demos/ to match the existing demo asset
convention. README.md + README.zh-CN.md embed the gif inline directly
under the side-by-side hero with subcaption explaining the deliberate
choice to show drift, plus a link to the mp4 for quality-sensitive
viewers.

Hosted in this branch so the github raw URL renders inline on github.com
regardless of upstream merge state.

Signed-off-by: homen <hshum2018@gmail.com>
Closes Phase 1 of OpenCoworkAI#225.

Signed-off-by: homen <hshum2018@gmail.com>
@github-actions github-actions Bot added docs Documentation area:desktop apps/desktop (Electron shell, renderer) area:core packages/core (generation orchestration) labels Apr 26, 2026
Comment thread packages/core/src/tools/verify-ui-kit-parity.ts Fixed
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Decompose loop success can never trigger on the first clean pass — decomposePrompt.ts requires both verifiers to return verified or needs_review, but the deterministic verifier only returns ok or needs_iteration. That forces an unnecessary extra iteration even when deterministic parity already passed, which adds avoidable cost and can regress a good bundle, evidence apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:67, packages/core/src/tools/verify-ui-kit-parity.ts:35, packages/core/src/tools/verify-ui-kit-parity.ts:294.
    Suggested fix:

    const deterministicPass = deterministic.status === 'ok';
    const visualPass =
      visual.status === 'unavailable' ||
      visual.status === 'verified' ||
      visual.status === 'needs_review';
    if (deterministicPass && visualPass) {
      // call done
    }
  • [Major] verify_ui_kit_visual_parity({slug}) has no source image on the default runtime path — the tool defaults to source.png, but the agent FS is initialized with index.html, frames, and skills only, while preparePromptContext() keeps attachments in prompt context instead of persisting them into the virtual FS. In normal runs the visual verifier therefore degrades to unavailable instead of actually judging parity, evidence packages/core/src/tools/verify-ui-kit-visual-parity.ts:291, apps/desktop/src/main/index.ts:294, apps/desktop/src/main/index.ts:879, apps/desktop/src/main/prompt-context.ts:287.
    Suggested fix:

    const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
    if (firstImage?.imageDataUrl) {
      await fs.create('source.png', firstImage.imageDataUrl);
    }
  • [Major] Judge/render failures do not fall back to a structured result — makeJudgeVisualParity() throws on empty or non-JSON model replies, and verify_ui_kit_visual_parity awaits both renderUiKit() and judgeVisualParity() without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the advertised status: "unavailable" path, evidence apps/desktop/src/main/judge-visual-parity.ts:153, apps/desktop/src/main/judge-visual-parity.ts:197, packages/core/src/tools/verify-ui-kit-visual-parity.ts:311.
    Suggested fix:

    try {
      const candidateImg = await renderUiKit(decomposed.content, signal);
      const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
      // existing normalization...
    } catch (error) {
      const report = unavailableReport(
        error instanceof Error ? error.message : String(error),
      );
      return { content: [{ type: 'text', text: report.summary }], details: report };
    }
  • [Major] This still looks like partial work for #225, so Closes #225 is misleading — the public PR template says to use Closes only when the issue is fully resolved, but this diff stops at emitting a ui_kits/<slug>/ handoff bundle and explicitly tells the agent not to continue into the downstream prototype flow, evidence .github/PULL_REQUEST_TEMPLATE.md:11, .changeset/decompose-to-ui-kit.md:9, apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58, packages/core/src/tools/decompose-to-ui-kit.ts:155.
    Suggested fix:

    Refs #225

Summary

  • Review mode: initial
  • Found 4 issues: one decompose-loop contract bug, two visual-verifier runtime gaps, and one incomplete issue-closure claim.

Testing

  • Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts verify_ui_kit_visual_parity can read it, plus one unit test that judge/render failures return a structured fallback instead of throwing.

open-codesign Bot

- If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone.
- If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix.
6. Reconcile both reports:
- Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_ui_kit_parity never returns verified or needs_review, so the success branch described here can never fire. Even a clean deterministic pass gets forced into the iteration path and burns another full decompose cycle.

Suggested fix:

const deterministicPass = deterministic.status === 'ok';
const visualPass =
  visual.status === 'unavailable' ||
  visual.status === 'verified' ||
  visual.status === 'needs_review';

return { content: [{ type: 'text', text: report.summary }], details: report };
}

const sourcePath = params.sourceImagePath ?? 'source.png';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool defaults to source.png, but nothing in the runtime seeds an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.

Suggested fix:

const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
  await fs.create('source.png', firstImage.imageDataUrl);
}

};

logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug });
const candidateImg = await renderUiKit(decomposed.content, signal);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback described in the prompt only works for missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool currently errors instead of returning a structured unavailable/failed report.

Suggested fix:

try {
  const candidateImg = await renderUiKit(decomposed.content, signal);
  const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
} catch (error) {
  const report = unavailableReport(error instanceof Error ? error.message : String(error));
  return { content: [{ type: 'text', text: report.summary }], details: report };
}


Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast.

Closes Phase 1 of #225.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This release note says the phase is closed, but the shipped flow still stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.

Suggested fix:

Refs #225

…itespace

CodeQL js/bad-tag-filter (HIGH) on PR OpenCoworkAI#241: the literal `<\/script>` /
`<\/style>` patterns in stripTags() left bodies behind for HTML5-tolerated
end-tag forms like `</script >` (trailing space) and `</script foo="bar">`
(end-tag attributes, silently ignored by browsers per spec). A crafted
source HTML could leak script/style body text into the visible-word
vocabulary used for parity coverage scoring.

Fix: mirror the opening-tag pattern's `\b[^>]*` on the close tag too.
The `\b` after the tag name prevents over-matching `</scripts>` while
the `[^>]*` consumes any tolerated end-tag content up to the closing `>`.

Regression test covers all 4 previously-vulnerable forms:
- `</script >` (trailing whitespace)
- `</script foo="bar">` (end-tag attrs)
- `</style >` (style branch)
- `</SCRIPT>` (case)

Asserts none of 4 leaked tokens appear in the parity report when the
decomposition correctly omits the script/style content.

Signed-off-by: homen <hshum2018@gmail.com>
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Decompose loop success can never trigger on the first clean pass — decomposePrompt.ts still requires both verifiers to return verified or needs_review, but verify_ui_kit_parity only returns ok or needs_iteration. A clean deterministic pass therefore gets forced into another full decompose cycle, which adds avoidable cost and can regress a good bundle, evidence apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:72, packages/core/src/tools/verify-ui-kit-parity.ts:35, packages/core/src/tools/verify-ui-kit-parity.ts:300.
    Suggested fix:

    const deterministicPass = deterministic.status === 'ok';
    const visualPass =
      visual.status === 'unavailable' ||
      visual.status === 'verified' ||
      visual.status === 'needs_review';
    if (deterministicPass && visualPass) {
      // call done
    }
  • [Major] verify_ui_kit_visual_parity({slug}) still has no source image on the default runtime path — the tool defaults to source.png, but the runtime only prepares attachment context and passes images to the model wire; it never persists an attached image into the agent FS. On the normal path the visual verifier therefore returns unavailable instead of actually judging parity, evidence packages/core/src/tools/verify-ui-kit-visual-parity.ts:291, apps/desktop/src/main/prompt-context.ts:287, apps/desktop/src/main/index.ts:914, packages/core/src/index.ts:203.
    Suggested fix:

    const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
    if (firstImage?.imageDataUrl) {
      await fs.create('source.png', firstImage.imageDataUrl);
    }
  • [Major] Judge/render failures still bypass the advertised structured fallback — makeJudgeVisualParity() throws on empty or non-JSON model replies, and verify_ui_kit_visual_parity still awaits both renderUiKit() and judgeVisualParity() without catching those failures. On a text-only model or malformed judge response, the tool errors instead of returning the promised status: "unavailable" path, evidence apps/desktop/src/main/judge-visual-parity.ts:198, apps/desktop/src/main/judge-visual-parity.ts:210, packages/core/src/tools/verify-ui-kit-visual-parity.ts:312.
    Suggested fix:

    try {
      const candidateImg = await renderUiKit(decomposed.content, signal);
      const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
      // existing normalization...
    } catch (error) {
      const report = unavailableReport(
        error instanceof Error ? error.message : String(error),
      );
      return { content: [{ type: 'text', text: report.summary }], details: report };
    }
  • [Major] The issue-closing claim is still overstated for this slice — the public PR template says to use Closes only when the issue is fully resolved, but this diff still stops at emitting a ui_kits/<slug>/ handoff bundle and explicitly tells the agent not to continue into the downstream prototype step, evidence .github/PULL_REQUEST_TEMPLATE.md:11, .changeset/decompose-to-ui-kit.md:9, apps/desktop/src/renderer/src/hooks/decomposePrompt.ts:58, packages/core/src/tools/decompose-to-ui-kit.ts:155.
    Suggested fix:

    Refs #225

Summary

  • Review mode: follow-up after new commits. The new commit fixes the script/style stripping issue, but these 4 major issues are still present on the latest head.

Testing

  • Not run (automation). Suggested: add one agent-runtime integration test that seeds an image attachment into the virtual FS and asserts verify_ui_kit_visual_parity can read it, plus one unit test that render/judge failures return a structured fallback instead of throwing.

open-codesign Bot

- If it returns status="unavailable", the host hasn't injected the judge callback. Proceed with step 4's deterministic report alone.
- If it returns successfully, read each checks[].passed + reason. Failed checks are the things to fix.
6. Reconcile both reports:
- Both status ∈ {verified, needs_review} (12/12 or 11/12 checks passed): call done
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

verify_ui_kit_parity still never returns verified or needs_review, so this success branch can never fire. Even a clean deterministic pass gets forced into another full decompose cycle.

Suggested fix:

const deterministicPass = deterministic.status === 'ok';
const visualPass =
  visual.status === 'unavailable' ||
  visual.status === 'verified' ||
  visual.status === 'needs_review';
if (deterministicPass && visualPass) {
  // call done
}

return { content: [{ type: 'text', text: report.summary }], details: report };
}

const sourcePath = params.sourceImagePath ?? 'source.png';
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tool still defaults to source.png, but nothing in the runtime persists an attached image into the agent FS before the tool runs. On the default path that means the visual verifier returns unavailable instead of ever judging parity.

Suggested fix:

const firstImage = promptContext.attachments.find((a) => a.imageDataUrl);
if (firstImage?.imageDataUrl) {
  await fs.create('source.png', firstImage.imageDataUrl);
}

};

logger.info('[verify_ui_kit_visual_parity] step=render', { slug: params.slug });
const candidateImg = await renderUiKit(decomposed.content, signal);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fallback described in the prompt still only covers missing dependencies/files. If renderUiKit() or judgeVisualParity() throws (for example on a text-only model or malformed JSON), this tool errors instead of returning a structured unavailable report.

Suggested fix:

try {
  const candidateImg = await renderUiKit(decomposed.content, signal);
  const judgeResult = await judgeVisualParity(sourceImg, candidateImg, signal);
} catch (error) {
  const report = unavailableReport(error instanceof Error ? error.message : String(error));
  return { content: [{ type: 'text', text: report.summary }], details: report };
}


Add **Decompose to UI Kit** — one-click in the chat sidebar emits a `ui_kits/<slug>/` folder shaped for coding-agent handoff (`index.html` + `components/*.tsx` + `tokens.css` + `manifest.json` + `README.md`). Built-in deterministic + vision verifiers self-check parity using a 12-question boolean rubric (`parityScore = passCount / totalChecks`, no LLM-fabricated floats) and re-iterate on gaps. Per-decompose cost surfaces inline as a toast.

Closes Phase 1 of #225.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This release note still says the phase is closed, but the shipped flow stops at producing a handoff bundle and explicitly does not continue into the downstream prototype step. Per the public PR template, this should be Refs #225 unless the full issue scope is implemented here.

Suggested fix:

Refs #225

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core packages/core (generation orchestration) area:desktop apps/desktop (Electron shell, renderer) docs Documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: image 2 已经够厉害了,最需要的是如何把生成好的UI 变成组件化,再到原型的过程!

2 participants