Skip to content

fix: Phase 2 — close 5 P1 structural bugs from audit#4

Merged
fbmoulin merged 2 commits intomainfrom
feat/phase2-structural-fixes
Apr 27, 2026
Merged

fix: Phase 2 — close 5 P1 structural bugs from audit#4
fbmoulin merged 2 commits intomainfrom
feat/phase2-structural-fixes

Conversation

@fbmoulin
Copy link
Copy Markdown
Owner

Summary

Phase 2 of ROADMAP. Closes 5 of 9 P1 findings from docs/AUDIT.md. Remaining open (P1-E, P1-F, P1-I) → Phase 3.

Audit Fix File
P1-A Patch D shadow walker no-op Walk live DOM, emit Declarative Shadow DOM, surface skipped_closed_shadow_roots in manifest kratos_clone/capture.py:81-157
P1-B Asset write race asyncio.create_task tracking + gather (10s timeout) before context.close() kratos_clone/capture.py:_on_response_tracked
P1-C Generators NexusFlow-only Semantic class-signature lookup (find_button_by_classes); 10 regression tests scripts/generate_design_system_v{1,2}.py
P1-D Same-origin substring bypass urlparse().netloc compare + about:srcdoc allow-list kratos_clone/capture.py:_extract_html
P1-G Iframe srcdoc unconditional win Length-ratio compare against main doc; KCD_IFRAME_MIN_RATIO (0.5) + KCD_NO_IFRAME_SRCDOC opt-out kratos_clone/capture.py:_extract_html

New env vars

  • KCD_IFRAME_MIN_RATIO (float, default 0.5) — minimum srcdoc/main-doc length ratio to prefer iframe content
  • KCD_NO_IFRAME_SRCDOC (bool, default false) — disable srcdoc detection entirely

Manifest additions

  • shadow_skipped_closed: int — count of closed shadow roots Patch D could not serialize
  • errors: ["asset_write_timeout: N pending"] if asset writes exceed 10s gather timeout

Test plan

  • pytest -q → 62 passed in 0.52s (was 52, +10 from test_generator_helpers.py)
  • ruff check + format --check clean
  • Imports OK, no new threads on import app
  • CI green (lint + smoke + pytest)
  • Bot reviews (CodeRabbit + Gemini + Code Review Doctor)
  • Manual: re-run capture on nexusflow-saas.aura.build with new code; expect same outputs (regression check). Phase 5 will add E2E job to CI.

Closes audit P1-A, P1-B, P1-C, P1-D, P1-G. Remaining open: P1-E (asset disk
cap), P1-F (BeautifulSoup rewriting), P1-I (PII strip) — Phase 3.

## P1-A — Patch D shadow walker fixed (was a no-op)

`kratos_clone/capture.py:81-157` — replaced `cloneNode(true)`-then-walk pattern
(which per HTML spec discards all shadow roots) with a recursive walker over
the LIVE DOM that emits Declarative Shadow DOM `<template shadowrootmode="open">`
for each open shadow root. Closed shadow roots are inaccessible by spec —
count surfaced in `manifest.shadow_skipped_closed` so operators see when a
clone is incomplete (e.g., Stripe Elements, some web components).

The walker now returns `{html, skipped_closed_shadow_roots}` instead of just
a string. `_extract_html` unpacks both; `manifest.json` includes the new field.

## P1-B — Asset write race resolved

`kratos_clone/capture.py:_on_response_tracked` — the `page.on("response", ...)`
handler is now a sync wrapper that creates an `asyncio.Task` and tracks it
in `self._pending_writes`. Before `context.close()`, we
`asyncio.gather(*self._pending_writes, return_exceptions=True)` with a 10s
timeout. Late response handlers no longer get aborted mid-byte; manifest
records `asset_write_timeout` if any did exceed the cap.

## P1-C — Generators no longer NexusFlow-only

`scripts/generate_design_system_v{1,2}.py` — replaced hardcoded
`inv["buttons"][2]`, `[3]`, `[7]` indices (which `IndexError`'d on any site
with fewer or differently-ordered buttons) with `find_button_by_classes()`
semantic lookup. Function takes `*required` substrings and returns first
matching button OR a `{classes:"", label:default}` stub. Empty-classes rows
are dropped from `BUTTON_ROLES` so we don't render bald buttons.

10 regression tests in `tests/test_generator_helpers.py` (52 → 62 total),
including the literal "fewer buttons than the old hardcoded indices" case
that used to crash.

## P1-D — Same-origin via urlparse().netloc

`kratos_clone/capture.py:_extract_html` — replaced
`f_url.startswith(self.url) or "srcdoc" in f_url.lower()` (where any URL
containing the substring "srcdoc" passed) with proper
`urlparse(f_url).netloc == urlparse(self.url).netloc or
f_url.startswith("about:srcdoc")`.

## P1-G — Iframe srcdoc no longer unconditional

`kratos_clone/capture.py:_extract_html` — capture main doc length first; only
prefer `iframe[srcdoc]` content when its length is at least
`KCD_IFRAME_MIN_RATIO * len(main_html)` (default 0.5). Cookie-banner srcdoc
(few KB) no longer replaces 100KB of real content. Decision is logged.
`KCD_NO_IFRAME_SRCDOC=true` opt-out for sites where srcdoc detection misfires.

## Tests + lint

  $ pytest tests/ -q
  ............................................................. [100%]
  62 passed in 0.52s

  $ ruff check + format --check
  All checks passed!

Manifest now includes `shadow_skipped_closed: int`. New env vars:
`KCD_NO_IFRAME_SRCDOC` (bool), `KCD_IFRAME_MIN_RATIO` (float, default 0.5).

## Doc updates

- TODO.md: Phase 2 marked done; Phase 3 promoted to "Now"
- docs/AUDIT.md: P1-A through P1-D + P1-G + P1-H marked ✅ RESOLVED with mechanism
- docs/WORKFLOW.md: removed "BROKEN" warning on Patch D, replaced with "Fixed Phase 2"
- CLAUDE.md: known-issues table now lists only the 3 OPEN P1 items
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

Warning

Rate limit exceeded

@fbmoulin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 31 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 065845dc-4d1b-4e3b-aea8-43c4d0aa3a36

📥 Commits

Reviewing files that changed from the base of the PR and between b54939a and 7f6f39d.

📒 Files selected for processing (9)
  • CLAUDE.md
  • TODO.md
  • docs/AUDIT.md
  • docs/WORKFLOW.md
  • kratos_clone/capture.py
  • scripts/generate_design_system_v1.py
  • scripts/generate_design_system_v2.py
  • tests/test_generator_helpers.py
  • wsgi.py
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/phase2-structural-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@code-review-doctor code-review-doctor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Worth considering though. View full project report here.

Comment thread kratos_clone/capture.py Outdated
if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:
self.log(
f"🔍 Using same-origin iframe content ({len(f_html) // 1024} KB)"
f"🔍 Using same-origin iframe content "
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
f"🔍 Using same-origin iframe content "
"🔍 Using same-origin iframe content "

f-string is unnecessary here. This can just be a string. More info.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements Phase 2 structural fixes, addressing several high-priority issues identified in the audit. Key changes include a rewritten live DOM walker for shadow DOM serialization, improved asset write tracking to prevent race conditions, and semantic button lookup in design system generators. Heuristics for iframe content extraction were also refined to prevent small overlays from replacing main content. Feedback includes a bug in the manual HTML serializer regarding script/style escaping, a missing doctype when using outerHTML, and a suggestion to use environment variables for iframe length ratios.

Comment thread kratos_clone/capture.py
Comment on lines +108 to +151
function serialize(node) {
// TEXT_NODE
if (node.nodeType === 3) return escText(node.nodeValue);
// COMMENT_NODE
if (node.nodeType === 8) return '<!--' + node.nodeValue + '-->';
// CDATA_SECTION_NODE — drop, not valid in HTML
if (node.nodeType !== 1) return '';

const tag = node.tagName.toLowerCase();

// Skip <script type="application/json"> bodies? No — preserve everything.
let out = '<' + tag;
for (let i = 0; i < node.attributes.length; i++) {
const a = node.attributes[i];
out += ' ' + a.name + '="' + escAttr(a.value) + '"';
}
Array.from(node.children).forEach(visit);
};
const clone = (root || document.documentElement).cloneNode(true);
visit(clone);
return '<!DOCTYPE html>\n' + clone.outerHTML;
out += '>';

if (VOID.has(tag)) return out;

// Emit shadow root BEFORE children (Declarative Shadow DOM convention)
const sr = node.shadowRoot;
if (sr) {
if (sr.mode === 'open') {
out += '<template shadowrootmode="open">';
// Walk the actual shadow tree (live, not cloned)
for (let i = 0; i < sr.childNodes.length; i++) {
out += serialize(sr.childNodes[i]);
}
out += '</template>';
} else {
// mode === 'closed' — inaccessible by spec; count and skip silently
skippedClosed++;
}
}

// Children of light DOM
for (let i = 0; i < node.childNodes.length; i++) {
out += serialize(node.childNodes[i]);
}

out += '</' + tag + '>';
return out;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The manual HTML serializer in PATCH_D_SHADOW_DOM_HELPERS incorrectly escapes text content within <script> and <style> elements. These are "raw text" elements in HTML, and escaping characters like < and & inside them (e.g., converting if (a < b) to if (a &lt; b)) will break the scripts and styles of the captured page. The serializer should detect these tags and preserve their text content without escaping.

  function serialize(node, parentTag) {
    // TEXT_NODE
    if (node.nodeType === 3) {
      return (parentTag === 'script' || parentTag === 'style') ? node.nodeValue : escText(node.nodeValue);
    }
    // COMMENT_NODE
    if (node.nodeType === 8) return '<!--' + node.nodeValue + '-->';
    // CDATA_SECTION_NODEdrop, not valid in HTML
    if (node.nodeType !== 1) return '';

    const tag = node.tagName.toLowerCase();

    // Skip <script type="application/json"> bodies? Nopreserve everything.
    let out = '<' + tag;
    for (let i = 0; i < node.attributes.length; i++) {
      const a = node.attributes[i];
      out += ' ' + a.name + '="' + escAttr(a.value) + '"';
    }
    out += '>';

    if (VOID.has(tag)) return out;

    // Emit shadow root BEFORE children (Declarative Shadow DOM convention)
    const sr = node.shadowRoot;
    if (sr) {
      if (sr.mode === 'open') {
        out += '<template shadowrootmode="open">';
        // Walk the actual shadow tree (live, not cloned)
        for (let i = 0; i < sr.childNodes.length; i++) {
          out += serialize(sr.childNodes[i], tag);
        }
        out += '</template>';
      } else {
        // mode === 'closed'inaccessible by spec; count and skip silently
        skippedClosed++;
      }
    }

    // Children of light DOM
    for (let i = 0; i < node.childNodes.length; i++) {
      out += serialize(node.childNodes[i], tag);
    }

    out += '</' + tag + '>';
    return out;
  }

Comment thread kratos_clone/capture.py
Comment on lines +579 to +581
main_html = (
await page.evaluate("() => document.documentElement.outerHTML") or ""
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Capturing main_html using document.documentElement.outerHTML excludes the <!DOCTYPE> declaration. This leads to a regression where the final captured page lacks a doctype if the shadow walker is disabled. Using page.content() instead ensures the full document structure is preserved for both length comparison and final output.

        main_html = await page.content()

Comment thread kratos_clone/capture.py
"() => document.documentElement.outerHTML"
)
if len(f_html) > 1000:
if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The length comparison for same-origin iframes uses a hardcoded 0.5 ratio. This should use the KCD_IFRAME_MIN_RATIO environment variable (which defaults to 0.5) to remain consistent with the srcdoc detection logic and allow for centralized configuration.

Suggested change
if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:
if len(f_html) > 1000 and len(f_html) >= main_html_len * float(os.getenv("KCD_IFRAME_MIN_RATIO", "0.5")):

@fbmoulin fbmoulin merged commit ea6cf1a into main Apr 27, 2026
5 checks passed
@fbmoulin fbmoulin deleted the feat/phase2-structural-fixes branch April 27, 2026 09:09
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2f6326328

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread kratos_clone/capture.py
Comment on lines +109 to +110
// TEXT_NODE
if (node.nodeType === 3) return escText(node.nodeValue);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve raw-text nodes when serializing shadow DOM

The new serializer escapes every text node via escText, including text inside <script> and <style>. In HTML raw-text elements, entities like &lt; are not decoded back to <, so JavaScript/CSS content is changed and can stop working in captured output (for example, if (a < b) becomes invalid script text). This regresses fidelity for pages that rely on inline scripts/styles in the shadow-walker path.

Useful? React with 👍 / 👎.

Comment thread kratos_clone/capture.py
// CDATA_SECTION_NODE — drop, not valid in HTML
if (node.nodeType !== 1) return '';

const tag = node.tagName.toLowerCase();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep case-sensitive SVG/MathML tag names intact

The serializer forces node.tagName.toLowerCase() for all elements, which mutates foreign-element names (e.g., SVG linearGradient, clipPath) that are case-sensitive. Reparsed snapshots can lose gradients, masks, and icon rendering because those elements no longer match valid tag names. Preserve original casing for non-HTML namespaces instead of lowercasing unconditionally.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant