fix: Phase 2 — close 5 P1 structural bugs from audit by fbmoulin · Pull Request #4 · fbmoulin/kratos-clone

fbmoulin · 2026-04-27T09:07:16Z

Summary

Phase 2 of ROADMAP. Closes 5 of 9 P1 findings from docs/AUDIT.md. Remaining open (P1-E, P1-F, P1-I) → Phase 3.

Audit	Fix	File
P1-A Patch D shadow walker no-op	Walk live DOM, emit Declarative Shadow DOM, surface `skipped_closed_shadow_roots` in manifest	`kratos_clone/capture.py:81-157`
P1-B Asset write race	`asyncio.create_task` tracking + `gather` (10s timeout) before `context.close()`	`kratos_clone/capture.py:_on_response_tracked`
P1-C Generators NexusFlow-only	Semantic class-signature lookup (`find_button_by_classes`); 10 regression tests	`scripts/generate_design_system_v{1,2}.py`
P1-D Same-origin substring bypass	`urlparse().netloc` compare + `about:srcdoc` allow-list	`kratos_clone/capture.py:_extract_html`
P1-G Iframe srcdoc unconditional win	Length-ratio compare against main doc; `KCD_IFRAME_MIN_RATIO` (0.5) + `KCD_NO_IFRAME_SRCDOC` opt-out	`kratos_clone/capture.py:_extract_html`

New env vars

KCD_IFRAME_MIN_RATIO (float, default 0.5) — minimum srcdoc/main-doc length ratio to prefer iframe content
KCD_NO_IFRAME_SRCDOC (bool, default false) — disable srcdoc detection entirely

Manifest additions

shadow_skipped_closed: int — count of closed shadow roots Patch D could not serialize
errors: ["asset_write_timeout: N pending"] if asset writes exceed 10s gather timeout

Test plan

pytest -q → 62 passed in 0.52s (was 52, +10 from test_generator_helpers.py)
ruff check + format --check clean
Imports OK, no new threads on import app
CI green (lint + smoke + pytest)
Bot reviews (CodeRabbit + Gemini + Code Review Doctor)
Manual: re-run capture on nexusflow-saas.aura.build with new code; expect same outputs (regression check). Phase 5 will add E2E job to CI.

Closes audit P1-A, P1-B, P1-C, P1-D, P1-G. Remaining open: P1-E (asset disk cap), P1-F (BeautifulSoup rewriting), P1-I (PII strip) — Phase 3. ## P1-A — Patch D shadow walker fixed (was a no-op) `kratos_clone/capture.py:81-157` — replaced `cloneNode(true)`-then-walk pattern (which per HTML spec discards all shadow roots) with a recursive walker over the LIVE DOM that emits Declarative Shadow DOM `<template shadowrootmode="open">` for each open shadow root. Closed shadow roots are inaccessible by spec — count surfaced in `manifest.shadow_skipped_closed` so operators see when a clone is incomplete (e.g., Stripe Elements, some web components). The walker now returns `{html, skipped_closed_shadow_roots}` instead of just a string. `_extract_html` unpacks both; `manifest.json` includes the new field. ## P1-B — Asset write race resolved `kratos_clone/capture.py:_on_response_tracked` — the `page.on("response", ...)` handler is now a sync wrapper that creates an `asyncio.Task` and tracks it in `self._pending_writes`. Before `context.close()`, we `asyncio.gather(*self._pending_writes, return_exceptions=True)` with a 10s timeout. Late response handlers no longer get aborted mid-byte; manifest records `asset_write_timeout` if any did exceed the cap. ## P1-C — Generators no longer NexusFlow-only `scripts/generate_design_system_v{1,2}.py` — replaced hardcoded `inv["buttons"][2]`, `[3]`, `[7]` indices (which `IndexError`'d on any site with fewer or differently-ordered buttons) with `find_button_by_classes()` semantic lookup. Function takes `*required` substrings and returns first matching button OR a `{classes:"", label:default}` stub. Empty-classes rows are dropped from `BUTTON_ROLES` so we don't render bald buttons. 10 regression tests in `tests/test_generator_helpers.py` (52 → 62 total), including the literal "fewer buttons than the old hardcoded indices" case that used to crash. ## P1-D — Same-origin via urlparse().netloc `kratos_clone/capture.py:_extract_html` — replaced `f_url.startswith(self.url) or "srcdoc" in f_url.lower()` (where any URL containing the substring "srcdoc" passed) with proper `urlparse(f_url).netloc == urlparse(self.url).netloc or f_url.startswith("about:srcdoc")`. ## P1-G — Iframe srcdoc no longer unconditional `kratos_clone/capture.py:_extract_html` — capture main doc length first; only prefer `iframe[srcdoc]` content when its length is at least `KCD_IFRAME_MIN_RATIO * len(main_html)` (default 0.5). Cookie-banner srcdoc (few KB) no longer replaces 100KB of real content. Decision is logged. `KCD_NO_IFRAME_SRCDOC=true` opt-out for sites where srcdoc detection misfires. ## Tests + lint $ pytest tests/ -q ............................................................. [100%] 62 passed in 0.52s $ ruff check + format --check All checks passed! Manifest now includes `shadow_skipped_closed: int`. New env vars: `KCD_NO_IFRAME_SRCDOC` (bool), `KCD_IFRAME_MIN_RATIO` (float, default 0.5). ## Doc updates - TODO.md: Phase 2 marked done; Phase 3 promoted to "Now" - docs/AUDIT.md: P1-A through P1-D + P1-G + P1-H marked ✅ RESOLVED with mechanism - docs/WORKFLOW.md: removed "BROKEN" warning on Patch D, replaced with "Fixed Phase 2" - CLAUDE.md: known-issues table now lists only the 3 OPEN P1 items

coderabbitai · 2026-04-27T09:07:24Z

Warning

Rate limit exceeded

@fbmoulin has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 40 minutes and 31 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 065845dc-4d1b-4e3b-aea8-43c4d0aa3a36

📥 Commits

Reviewing files that changed from the base of the PR and between b54939a and 7f6f39d.

📒 Files selected for processing (9)

CLAUDE.md
TODO.md
docs/AUDIT.md
docs/WORKFLOW.md
kratos_clone/capture.py
scripts/generate_design_system_v1.py
scripts/generate_design_system_v2.py
tests/test_generator_helpers.py
wsgi.py

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/phase2-structural-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

code-review-doctor

Looks good. Worth considering though. View full project report here.

code-review-doctor · 2026-04-27T09:07:35Z

+                    if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:
                        self.log(
-                            f"🔍 Using same-origin iframe content ({len(f_html) // 1024} KB)"
+                            f"🔍 Using same-origin iframe content "


Suggested change

f"🔍 Using same-origin iframe content "

"🔍 Using same-origin iframe content "

f-string is unnecessary here. This can just be a string. More info.

gemini-code-assist

Code Review

This pull request implements Phase 2 structural fixes, addressing several high-priority issues identified in the audit. Key changes include a rewritten live DOM walker for shadow DOM serialization, improved asset write tracking to prevent race conditions, and semantic button lookup in design system generators. Heuristics for iframe content extraction were also refined to prevent small overlays from replacing main content. Feedback includes a bug in the manual HTML serializer regarding script/style escaping, a missing doctype when using outerHTML, and a suggestion to use environment variables for iframe length ratios.

gemini-code-assist · 2026-04-27T09:09:14Z

+  function serialize(node) {
+    // TEXT_NODE
+    if (node.nodeType === 3) return escText(node.nodeValue);
+    // COMMENT_NODE
+    if (node.nodeType === 8) return '<!--' + node.nodeValue + '-->';
+    // CDATA_SECTION_NODE — drop, not valid in HTML
+    if (node.nodeType !== 1) return '';
+
+    const tag = node.tagName.toLowerCase();
+
+    // Skip <script type="application/json"> bodies? No — preserve everything.
+    let out = '<' + tag;
+    for (let i = 0; i < node.attributes.length; i++) {
+      const a = node.attributes[i];
+      out += ' ' + a.name + '="' + escAttr(a.value) + '"';
    }
-    Array.from(node.children).forEach(visit);
-  };
-  const clone = (root || document.documentElement).cloneNode(true);
-  visit(clone);
-  return '<!DOCTYPE html>\n' + clone.outerHTML;
+    out += '>';
+
+    if (VOID.has(tag)) return out;
+
+    // Emit shadow root BEFORE children (Declarative Shadow DOM convention)
+    const sr = node.shadowRoot;
+    if (sr) {
+      if (sr.mode === 'open') {
+        out += '<template shadowrootmode="open">';
+        // Walk the actual shadow tree (live, not cloned)
+        for (let i = 0; i < sr.childNodes.length; i++) {
+          out += serialize(sr.childNodes[i]);
+        }
+        out += '</template>';
+      } else {
+        // mode === 'closed' — inaccessible by spec; count and skip silently
+        skippedClosed++;
+      }
+    }
+
+    // Children of light DOM
+    for (let i = 0; i < node.childNodes.length; i++) {
+      out += serialize(node.childNodes[i]);
+    }
+
+    out += '</' + tag + '>';
+    return out;
+  }


The manual HTML serializer in PATCH_D_SHADOW_DOM_HELPERS incorrectly escapes text content within <script> and <style> elements. These are "raw text" elements in HTML, and escaping characters like < and & inside them (e.g., converting if (a < b) to if (a < b)) will break the scripts and styles of the captured page. The serializer should detect these tags and preserve their text content without escaping.

function serialize(node, parentTag) { // TEXT_NODE if (node.nodeType === 3) { return (parentTag === 'script' || parentTag === 'style') ? node.nodeValue : escText(node.nodeValue); } // COMMENT_NODE if (node.nodeType === 8) return ''; // CDATA_SECTION_NODE — drop, not valid in HTML if (node.nodeType !== 1) return ''; const tag = node.tagName.toLowerCase(); // Skip <script type="application/json"> bodies? No — preserve everything. let out = '<' + tag; for (let i = 0; i < node.attributes.length; i++) { const a = node.attributes[i]; out += ' ' + a.name + '="' + escAttr(a.value) + '"'; } out += '>'; if (VOID.has(tag)) return out; // Emit shadow root BEFORE children (Declarative Shadow DOM convention) const sr = node.shadowRoot; if (sr) { if (sr.mode === 'open') { out += '<template shadowrootmode="open">'; // Walk the actual shadow tree (live, not cloned) for (let i = 0; i < sr.childNodes.length; i++) { out += serialize(sr.childNodes[i], tag); } out += '</template>'; } else { // mode === 'closed' — inaccessible by spec; count and skip silently skippedClosed++; } } // Children of light DOM for (let i = 0; i < node.childNodes.length; i++) { out += serialize(node.childNodes[i], tag); } out += '</' + tag + '>'; return out; }

gemini-code-assist · 2026-04-27T09:09:14Z

+        main_html = (
+            await page.evaluate("() => document.documentElement.outerHTML") or ""
+        )


Capturing main_html using document.documentElement.outerHTML excludes the <!DOCTYPE> declaration. This leads to a regression where the final captured page lacks a doctype if the shadow walker is disabled. Using page.content() instead ensures the full document structure is preserved for both length comparison and final output.

main_html = await page.content()

gemini-code-assist · 2026-04-27T09:09:14Z

                        "() => document.documentElement.outerHTML"
                    )
-                    if len(f_html) > 1000:
+                    if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:


The length comparison for same-origin iframes uses a hardcoded 0.5 ratio. This should use the KCD_IFRAME_MIN_RATIO environment variable (which defaults to 0.5) to remain consistent with the srcdoc detection logic and allow for centralized configuration.

Suggested change

if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:

if len(f_html) > 1000 and len(f_html) >= main_html_len * float(os.getenv("KCD_IFRAME_MIN_RATIO", "0.5")):

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a2f6326328

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-27T09:09:41Z

+    // TEXT_NODE
+    if (node.nodeType === 3) return escText(node.nodeValue);


Preserve raw-text nodes when serializing shadow DOM

The new serializer escapes every text node via escText, including text inside <script> and <style>. In HTML raw-text elements, entities like < are not decoded back to <, so JavaScript/CSS content is changed and can stop working in captured output (for example, if (a < b) becomes invalid script text). This regresses fidelity for pages that rely on inline scripts/styles in the shadow-walker path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-27T09:09:41Z

+    // CDATA_SECTION_NODE — drop, not valid in HTML
+    if (node.nodeType !== 1) return '';
+
+    const tag = node.tagName.toLowerCase();


Keep case-sensitive SVG/MathML tag names intact

The serializer forces node.tagName.toLowerCase() for all elements, which mutates foreign-element names (e.g., SVG linearGradient, clipPath) that are case-sensitive. Reparsed snapshots can lose gradients, masks, and icon rendering because those elements no longer match valid tag names. Preserve original casing for non-HTML namespaces instead of lowercasing unconditionally.

Useful? React with 👍 / 👎.

code-review-doctor Bot suggested changes Apr 27, 2026

View reviewed changes

fix(capture): drop redundant f-string prefix (Code Review Doctor)

7f6f39d

gemini-code-assist Bot reviewed Apr 27, 2026

View reviewed changes

fbmoulin merged commit ea6cf1a into main Apr 27, 2026
5 checks passed

fbmoulin deleted the feat/phase2-structural-fixes branch April 27, 2026 09:09

chatgpt-codex-connector Bot reviewed Apr 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Phase 2 — close 5 P1 structural bugs from audit#4

fix: Phase 2 — close 5 P1 structural bugs from audit#4
fbmoulin merged 2 commits intomainfrom
feat/phase2-structural-fixes

fbmoulin commented Apr 27, 2026

Uh oh!

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading

Rate limit exceeded

Uh oh!

code-review-doctor Bot left a comment

Uh oh!

code-review-doctor Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	f"🔍 Using same-origin iframe content "
	"🔍 Using same-origin iframe content "

	if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5:
	if len(f_html) > 1000 and len(f_html) >= main_html_len * float(os.getenv("KCD_IFRAME_MIN_RATIO", "0.5")):

		// TEXT_NODE
		if (node.nodeType === 3) return escText(node.nodeValue);

Conversation

fbmoulin commented Apr 27, 2026

Summary

New env vars

Manifest additions

Test plan

Uh oh!

coderabbitai Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Uh oh!

code-review-doctor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

code-review-doctor Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Apr 27, 2026 •

edited

Loading