fix: Phase 2 — close 5 P1 structural bugs from audit#4
Conversation
Closes audit P1-A, P1-B, P1-C, P1-D, P1-G. Remaining open: P1-E (asset disk
cap), P1-F (BeautifulSoup rewriting), P1-I (PII strip) — Phase 3.
## P1-A — Patch D shadow walker fixed (was a no-op)
`kratos_clone/capture.py:81-157` — replaced `cloneNode(true)`-then-walk pattern
(which per HTML spec discards all shadow roots) with a recursive walker over
the LIVE DOM that emits Declarative Shadow DOM `<template shadowrootmode="open">`
for each open shadow root. Closed shadow roots are inaccessible by spec —
count surfaced in `manifest.shadow_skipped_closed` so operators see when a
clone is incomplete (e.g., Stripe Elements, some web components).
The walker now returns `{html, skipped_closed_shadow_roots}` instead of just
a string. `_extract_html` unpacks both; `manifest.json` includes the new field.
## P1-B — Asset write race resolved
`kratos_clone/capture.py:_on_response_tracked` — the `page.on("response", ...)`
handler is now a sync wrapper that creates an `asyncio.Task` and tracks it
in `self._pending_writes`. Before `context.close()`, we
`asyncio.gather(*self._pending_writes, return_exceptions=True)` with a 10s
timeout. Late response handlers no longer get aborted mid-byte; manifest
records `asset_write_timeout` if any did exceed the cap.
## P1-C — Generators no longer NexusFlow-only
`scripts/generate_design_system_v{1,2}.py` — replaced hardcoded
`inv["buttons"][2]`, `[3]`, `[7]` indices (which `IndexError`'d on any site
with fewer or differently-ordered buttons) with `find_button_by_classes()`
semantic lookup. Function takes `*required` substrings and returns first
matching button OR a `{classes:"", label:default}` stub. Empty-classes rows
are dropped from `BUTTON_ROLES` so we don't render bald buttons.
10 regression tests in `tests/test_generator_helpers.py` (52 → 62 total),
including the literal "fewer buttons than the old hardcoded indices" case
that used to crash.
## P1-D — Same-origin via urlparse().netloc
`kratos_clone/capture.py:_extract_html` — replaced
`f_url.startswith(self.url) or "srcdoc" in f_url.lower()` (where any URL
containing the substring "srcdoc" passed) with proper
`urlparse(f_url).netloc == urlparse(self.url).netloc or
f_url.startswith("about:srcdoc")`.
## P1-G — Iframe srcdoc no longer unconditional
`kratos_clone/capture.py:_extract_html` — capture main doc length first; only
prefer `iframe[srcdoc]` content when its length is at least
`KCD_IFRAME_MIN_RATIO * len(main_html)` (default 0.5). Cookie-banner srcdoc
(few KB) no longer replaces 100KB of real content. Decision is logged.
`KCD_NO_IFRAME_SRCDOC=true` opt-out for sites where srcdoc detection misfires.
## Tests + lint
$ pytest tests/ -q
............................................................. [100%]
62 passed in 0.52s
$ ruff check + format --check
All checks passed!
Manifest now includes `shadow_skipped_closed: int`. New env vars:
`KCD_NO_IFRAME_SRCDOC` (bool), `KCD_IFRAME_MIN_RATIO` (float, default 0.5).
## Doc updates
- TODO.md: Phase 2 marked done; Phase 3 promoted to "Now"
- docs/AUDIT.md: P1-A through P1-D + P1-G + P1-H marked ✅ RESOLVED with mechanism
- docs/WORKFLOW.md: removed "BROKEN" warning on Patch D, replaced with "Fixed Phase 2"
- CLAUDE.md: known-issues table now lists only the 3 OPEN P1 items
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (9)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Looks good. Worth considering though. View full project report here.
| if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5: | ||
| self.log( | ||
| f"🔍 Using same-origin iframe content ({len(f_html) // 1024} KB)" | ||
| f"🔍 Using same-origin iframe content " |
There was a problem hiding this comment.
| f"🔍 Using same-origin iframe content " | |
| "🔍 Using same-origin iframe content " |
f-string is unnecessary here. This can just be a string. More info.
There was a problem hiding this comment.
Code Review
This pull request implements Phase 2 structural fixes, addressing several high-priority issues identified in the audit. Key changes include a rewritten live DOM walker for shadow DOM serialization, improved asset write tracking to prevent race conditions, and semantic button lookup in design system generators. Heuristics for iframe content extraction were also refined to prevent small overlays from replacing main content. Feedback includes a bug in the manual HTML serializer regarding script/style escaping, a missing doctype when using outerHTML, and a suggestion to use environment variables for iframe length ratios.
| function serialize(node) { | ||
| // TEXT_NODE | ||
| if (node.nodeType === 3) return escText(node.nodeValue); | ||
| // COMMENT_NODE | ||
| if (node.nodeType === 8) return '<!--' + node.nodeValue + '-->'; | ||
| // CDATA_SECTION_NODE — drop, not valid in HTML | ||
| if (node.nodeType !== 1) return ''; | ||
|
|
||
| const tag = node.tagName.toLowerCase(); | ||
|
|
||
| // Skip <script type="application/json"> bodies? No — preserve everything. | ||
| let out = '<' + tag; | ||
| for (let i = 0; i < node.attributes.length; i++) { | ||
| const a = node.attributes[i]; | ||
| out += ' ' + a.name + '="' + escAttr(a.value) + '"'; | ||
| } | ||
| Array.from(node.children).forEach(visit); | ||
| }; | ||
| const clone = (root || document.documentElement).cloneNode(true); | ||
| visit(clone); | ||
| return '<!DOCTYPE html>\n' + clone.outerHTML; | ||
| out += '>'; | ||
|
|
||
| if (VOID.has(tag)) return out; | ||
|
|
||
| // Emit shadow root BEFORE children (Declarative Shadow DOM convention) | ||
| const sr = node.shadowRoot; | ||
| if (sr) { | ||
| if (sr.mode === 'open') { | ||
| out += '<template shadowrootmode="open">'; | ||
| // Walk the actual shadow tree (live, not cloned) | ||
| for (let i = 0; i < sr.childNodes.length; i++) { | ||
| out += serialize(sr.childNodes[i]); | ||
| } | ||
| out += '</template>'; | ||
| } else { | ||
| // mode === 'closed' — inaccessible by spec; count and skip silently | ||
| skippedClosed++; | ||
| } | ||
| } | ||
|
|
||
| // Children of light DOM | ||
| for (let i = 0; i < node.childNodes.length; i++) { | ||
| out += serialize(node.childNodes[i]); | ||
| } | ||
|
|
||
| out += '</' + tag + '>'; | ||
| return out; | ||
| } |
There was a problem hiding this comment.
The manual HTML serializer in PATCH_D_SHADOW_DOM_HELPERS incorrectly escapes text content within <script> and <style> elements. These are "raw text" elements in HTML, and escaping characters like < and & inside them (e.g., converting if (a < b) to if (a < b)) will break the scripts and styles of the captured page. The serializer should detect these tags and preserve their text content without escaping.
function serialize(node, parentTag) {
// TEXT_NODE
if (node.nodeType === 3) {
return (parentTag === 'script' || parentTag === 'style') ? node.nodeValue : escText(node.nodeValue);
}
// COMMENT_NODE
if (node.nodeType === 8) return '<!--' + node.nodeValue + '-->';
// CDATA_SECTION_NODE — drop, not valid in HTML
if (node.nodeType !== 1) return '';
const tag = node.tagName.toLowerCase();
// Skip <script type="application/json"> bodies? No — preserve everything.
let out = '<' + tag;
for (let i = 0; i < node.attributes.length; i++) {
const a = node.attributes[i];
out += ' ' + a.name + '="' + escAttr(a.value) + '"';
}
out += '>';
if (VOID.has(tag)) return out;
// Emit shadow root BEFORE children (Declarative Shadow DOM convention)
const sr = node.shadowRoot;
if (sr) {
if (sr.mode === 'open') {
out += '<template shadowrootmode="open">';
// Walk the actual shadow tree (live, not cloned)
for (let i = 0; i < sr.childNodes.length; i++) {
out += serialize(sr.childNodes[i], tag);
}
out += '</template>';
} else {
// mode === 'closed' — inaccessible by spec; count and skip silently
skippedClosed++;
}
}
// Children of light DOM
for (let i = 0; i < node.childNodes.length; i++) {
out += serialize(node.childNodes[i], tag);
}
out += '</' + tag + '>';
return out;
}| main_html = ( | ||
| await page.evaluate("() => document.documentElement.outerHTML") or "" | ||
| ) |
There was a problem hiding this comment.
Capturing main_html using document.documentElement.outerHTML excludes the <!DOCTYPE> declaration. This leads to a regression where the final captured page lacks a doctype if the shadow walker is disabled. Using page.content() instead ensures the full document structure is preserved for both length comparison and final output.
main_html = await page.content()| "() => document.documentElement.outerHTML" | ||
| ) | ||
| if len(f_html) > 1000: | ||
| if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5: |
There was a problem hiding this comment.
The length comparison for same-origin iframes uses a hardcoded 0.5 ratio. This should use the KCD_IFRAME_MIN_RATIO environment variable (which defaults to 0.5) to remain consistent with the srcdoc detection logic and allow for centralized configuration.
| if len(f_html) > 1000 and len(f_html) >= main_html_len * 0.5: | |
| if len(f_html) > 1000 and len(f_html) >= main_html_len * float(os.getenv("KCD_IFRAME_MIN_RATIO", "0.5")): |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a2f6326328
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| // TEXT_NODE | ||
| if (node.nodeType === 3) return escText(node.nodeValue); |
There was a problem hiding this comment.
Preserve raw-text nodes when serializing shadow DOM
The new serializer escapes every text node via escText, including text inside <script> and <style>. In HTML raw-text elements, entities like < are not decoded back to <, so JavaScript/CSS content is changed and can stop working in captured output (for example, if (a < b) becomes invalid script text). This regresses fidelity for pages that rely on inline scripts/styles in the shadow-walker path.
Useful? React with 👍 / 👎.
| // CDATA_SECTION_NODE — drop, not valid in HTML | ||
| if (node.nodeType !== 1) return ''; | ||
|
|
||
| const tag = node.tagName.toLowerCase(); |
There was a problem hiding this comment.
Keep case-sensitive SVG/MathML tag names intact
The serializer forces node.tagName.toLowerCase() for all elements, which mutates foreign-element names (e.g., SVG linearGradient, clipPath) that are case-sensitive. Reparsed snapshots can lose gradients, masks, and icon rendering because those elements no longer match valid tag names. Preserve original casing for non-HTML namespaces instead of lowercasing unconditionally.
Useful? React with 👍 / 👎.
Summary
Phase 2 of ROADMAP. Closes 5 of 9 P1 findings from
docs/AUDIT.md. Remaining open (P1-E, P1-F, P1-I) → Phase 3.skipped_closed_shadow_rootsin manifestkratos_clone/capture.py:81-157asyncio.create_tasktracking +gather(10s timeout) beforecontext.close()kratos_clone/capture.py:_on_response_trackedfind_button_by_classes); 10 regression testsscripts/generate_design_system_v{1,2}.pyurlparse().netloccompare +about:srcdocallow-listkratos_clone/capture.py:_extract_htmlKCD_IFRAME_MIN_RATIO(0.5) +KCD_NO_IFRAME_SRCDOCopt-outkratos_clone/capture.py:_extract_htmlNew env vars
KCD_IFRAME_MIN_RATIO(float, default 0.5) — minimum srcdoc/main-doc length ratio to prefer iframe contentKCD_NO_IFRAME_SRCDOC(bool, default false) — disable srcdoc detection entirelyManifest additions
shadow_skipped_closed: int— count of closed shadow roots Patch D could not serializeerrors: ["asset_write_timeout: N pending"]if asset writes exceed 10s gather timeoutTest plan
pytest -q→ 62 passed in 0.52s (was 52, +10 fromtest_generator_helpers.py)ruff check + format --checkcleanimport appnexusflow-saas.aura.buildwith new code; expect same outputs (regression check). Phase 5 will add E2E job to CI.