feat: Phase 5 — Pipeline completion + P2-8 closure#8
Conversation
- ProbeResult dataclass with url, status, reachable, framework, csp, headers, error - detect_framework: Next.js / React / Vue / Svelte / Angular / Astro markers in HTML body or x-powered-by header. Order matters (Next before React: Next serves React markers too) - summarize_csp: parses CSP header into directive map - run_probe: HEAD then conditional GET; never raises (errors recorded in result) - CLI: python -m scripts.probe <url> --output probe.json - 12 tests; full suite 151 passing
- audit_assets: walk <capture>/assets/, return AssetAudit (count, total_bytes, paths) - inline_small_assets: rewrite <tag attr="assets/..."> with data: URIs for files under max_bytes; external (http(s)://) and missing files left untouched - CLI: python -m scripts.post_process <capture_dir> [--inline-max-bytes 5000] - 6 tests; full suite 157 passing
- coverage_scorecard(inventory): data-driven 13-row DTCG audit (full|partial|missing + evidence string per category). NOT hardcoded — closes audit P2-8. - check_asset_refs: every src=/href= pointing at assets/ must resolve to disk file - check_placeholders: regex grep for lorem ipsum, 'your headline here', etc. - check_wcag_contrast: WCAG 2.1 §1.4.3 contrast ratio for inline color/background - run_validate: aggregate report; passed=True iff zero findings on all 4 checks - ValidationReport dataclass with to_json() - CLI: python -m scripts.validate <dir> [--output report.json] (exit 1 on fail) - Visual diff via Playwright deferred (keeps gate headless + CI-friendly) 21 tests; full suite 178 passing.
…corecard (closes P2-8) Previously the 13-row scorecard + 80.8 score were a literal baked to the NexusFlow capture, producing the same numbers for any site (audit P2-8 called this 'a tautology'). Now the rows + score come from coverage_scorecard(inv) which judges each category against actual inventory evidence. Honest consequence: the score on the legacy NexusFlow capture will likely drop from 80.8 to ~30-50, reflecting what scripts/inventory.py actually extracts today (only 7/13 categories: headings, paragraphs, buttons, arbitrary_colors, named_color_tokens, containers, v_rhythm). To raise the score, extend inventory.py to extract font_families, durations, shadows, gradients, transitions, stroke_styles, borders. Tests: 178 passing (no regression).
📝 WalkthroughWalkthroughThis pull request ships Phase 5 of the pipeline: three new stage scripts (probe, post_process, validate) and updates scorecard generation to use runtime-driven DTCG category evaluation instead of hardcoded categories. Documentation and tests are updated accordingly; visual diff validation is deferred to a future version. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Probe as probe.py
participant Website as Target Website
participant PostProc as post_process.py
participant Validate as validate.py
participant Report as ValidationReport
User->>Probe: run_probe(url)
Probe->>Website: HEAD + GET requests
Website-->>Probe: headers, HTML content
Probe->>Probe: detect_framework()<br/>summarize_csp()
Probe-->>User: ProbeResult (JSON)
User->>PostProc: inline_small_assets(capture_dir)
PostProc->>PostProc: audit_assets()<br/>scan index.html<br/>base64 encode small files
PostProc-->>User: index.inlined.html
User->>Validate: run_validate(capture_dir)
Validate->>Validate: coverage_scorecard(inventory)<br/>check_asset_refs()<br/>check_placeholders()<br/>check_wcag_contrast()
Validate->>Report: aggregate findings
Report-->>User: ValidationReport (JSON)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~65 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
| log = structlog.get_logger() | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
| @dataclass | |
| @dataclass(frozen=True) |
Use frozen=True to make the dataclasses immutable and hashable. More.
| return out | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
| @dataclass | |
| @dataclass(frozen=True) |
Use frozen=True to make the dataclasses immutable and hashable. Explained here.
| # --- run_validate end-to-end ---------------------------------------------- | ||
|
|
||
|
|
||
| @dataclass |
There was a problem hiding this comment.
| @dataclass | |
| @dataclass(frozen=True) |
Use frozen=True to make the dataclasses immutable and hashable. Explained here.
There was a problem hiding this comment.
Code Review
This pull request completes Phase 5 of the pipeline by introducing three new scripts: probe.py for site reconnaissance, post_process.py for asset auditing and inlining, and validate.py for quality validation. It resolves audit item P2-8 by replacing hardcoded design system coverage scores with a dynamic, inventory-driven scorecard. Documentation and roadmaps have been updated to reflect these changes. Review feedback highlights an unimplemented function mentioned in a docstring and suggests improving regex patterns in the contrast validation logic to prevent false positives.
| - ``strip_scroll_fix_cli`` — wrapper around ``kratos_clone.post.strip_scroll_fix`` | ||
| for CLI invocation post-capture |
There was a problem hiding this comment.
The docstring mentions strip_scroll_fix_cli, but this function is not implemented in the script. This appears to be a leftover from the original Phase 5 plan, as the feature was removed from the summary in ROADMAP.md. These lines should be removed to keep the documentation consistent with the implementation.
| _INLINE_COLOR_RE = re.compile(r"color\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) | ||
| _INLINE_BG_RE = re.compile(r"background(?:-color)?\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) |
There was a problem hiding this comment.
The regular expressions for color and background properties can produce false positives by matching substrings of other properties (e.g., color: matching inside border-color:). This can lead to incorrect contrast validation results if the text color is not explicitly defined or appears later in the style attribute. Using a negative lookbehind to ensure the property name is not part of a longer identifier would improve the reliability of the validation gate.
| _INLINE_COLOR_RE = re.compile(r"color\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) | |
| _INLINE_BG_RE = re.compile(r"background(?:-color)?\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) | |
| _INLINE_COLOR_RE = re.compile(r"(?<![\w-])color\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) | |
| _INLINE_BG_RE = re.compile(r"(?<![\w-])background(?:-color)?\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c0c4a62f43
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # baked to the original NexusFlow capture, producing the same 80.8/100 score | ||
| # for any site. validate.coverage_scorecard reads inventory keys and judges | ||
| # each of the 13 categories against actual extracted evidence.) | ||
| from scripts.validate import coverage_scorecard # noqa: E402 |
There was a problem hiding this comment.
Keep generator runnable outside package context
This new absolute import makes generate_design_system_v2.py fail in the documented standalone workflow (cp .../generate_design_system_v2.py . && python generate_design_system_v2.py), because there is no scripts package in the capture directory. In that mode the script raises ModuleNotFoundError before producing design-system.html, so the main generation path breaks for users who run the tool as described.
Useful? React with 👍 / 👎.
| return ProbeResult( | ||
| url=url, | ||
| status=head.status_code, | ||
| reachable=True, |
There was a problem hiding this comment.
Treat failed GET responses as unreachable
After a successful HEAD, the code unconditionally marks the probe as reachable and reports the HEAD status, even if the GET request returns an HTTP error (for example, HEAD 200 but GET 403/503). That causes false-positive probe results and can send blocked/broken targets further down the pipeline as if they were valid captures.
Useful? React with 👍 / 👎.
| return (bright + 0.05) / (dim + 0.05) | ||
|
|
||
|
|
||
| _INLINE_COLOR_RE = re.compile(r"color\s*:\s*(#[0-9a-fA-F]{3,6})", re.I) |
There was a problem hiding this comment.
Prevent color regex from matching background-color
The foreground regex also matches the color: suffix inside background-color, so elements that only define background-color are incorrectly treated as having both foreground and background colors. This can manufacture contrast failures (often 1.0 ratio) and incorrectly fail the validation gate on otherwise valid captures.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (6)
tests/test_probe.py (1)
61-92: Minor:monkeypatchparameter is unused.
test_success_pathandtest_4xx_marks_unreachable_but_records_statusacceptmonkeypatchbut never use it — the session is injected directly. Drop the parameter to keep the fixture surface honest.Proposed change
- def test_success_path(self, monkeypatch): + def test_success_path(self): ... - def test_4xx_marks_unreachable_but_records_status(self, monkeypatch): + def test_4xx_marks_unreachable_but_records_status(self):🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_probe.py` around lines 61 - 92, The tests test_success_path and test_4xx_marks_unreachable_but_records_status declare an unused monkeypatch parameter; remove the unused fixture parameter from both function signatures (i.e., update def test_success_path(self, monkeypatch): -> def test_success_path(self): and def test_4xx_marks_unreachable_but_records_status(self, monkeypatch): -> def test_4xx_marks_unreachable_but_records_status(self):) so the test signatures reflect the actual fixtures used and avoid unused-parameter warnings.scripts/validate.py (2)
140-158: Minor: duplicatesassets/-ref walking logic fromscripts/post_process.py.
_iter_asset_refsinscripts/post_process.py(lines 56–64) does almost exactly this iteration. Consider extracting it to a small shared helper (e.g.,scripts/_html_assets.py) so the two stages stay in lockstep when the prefix rules evolve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate.py` around lines 140 - 158, The HTML asset-ref iteration in check_asset_refs duplicates logic already implemented in _iter_asset_refs (scripts/post_process.py); refactor by extracting that shared iteration into a small helper (e.g., scripts/_html_assets.py) and have check_asset_refs call the new helper instead of duplicating the loop; update check_asset_refs to accept or construct the BeautifulSoup or Path input needed, remove its local for attr in ("src","href") loop, and import/ reuse _iter_asset_refs (or the new name) so both check_asset_refs and the existing post-processing function use the single canonical implementation.
79-134: Optional: replace the long if-elif chain with a dispatch table.
_judge_categoryis 13 mutually-exclusive branches that all share the "compute count → apply thresholds → format evidence" shape. A dispatch dict (or per-category callables) would make adding a new DTCG category a one-line edit and cut the cyclomatic complexity here. Not a blocker — current code is correct and easy to follow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/validate.py` around lines 79 - 134, The _judge_category function is a long if/elif chain; replace it with a dispatch table mapping category names to small handler callables to reduce cyclomatic complexity and make adding categories one-line. Create a dict (e.g., CATEGORY_HANDLERS) whose keys are the category strings and values are functions/lambdas that accept inv and return the same (status, evidence) tuple used now; move per-category logic for "color", "dimension", "fontFamily", "fontWeight", "duration", "cubicBezier", "number", "typography", "shadow", "gradient", "transition", "strokeStyle", and "border" into those handlers, then change _judge_category(cat, inv) to lookup CATEGORY_HANDLERS.get(cat) and call it (or return ("missing","unknown category") when absent). Ensure each handler preserves the existing threshold logic and evidence formatting used by the current branches.scripts/post_process.py (1)
47-52: Minor: align event names withsnake_case_verb_nounconvention.The structured-log event names emit
"asset_audit"and"inline_small_assets"(noun_noun). The guideline asks forverb_nounfor greppability — e.g.,audit_assets_complete/inlined_small_assets.As per coding guidelines: "Use
snake_case_verb_nounfor event names in log messages for greppability".Also applies to: 95-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/post_process.py` around lines 47 - 52, Update the structured log event names emitted via the log.info calls that currently use noun_noun (e.g., "asset_audit" and "inline_small_assets") to follow the snake_case_verb_noun convention; replace "asset_audit" with a verb_noun form such as "audit_assets_complete" in the log.info call that passes capture_dir/count/total, and change "inline_small_assets" (the similar log.info block around lines ~95-100) to a verb_noun form such as "inlined_small_assets" so all event names follow the guideline.tests/test_validate.py (1)
60-68: Tightentest_partial_typography— accepting"full"defeats the test.With
h1only (1 heading level, well below thelevels >= 3threshold) plus 2 paragraphs,_judge_category("typography", …)will deterministically return"partial". Allowing"full"in the assertion means a regression that flips this branch tofullwould slip through silently — exactly the bug a "partial" test should catch.🧪 Proposed fix
def test_partial_typography(self): # Only h1 + paragraphs present → partial inv = { "headings": {"h1": ["Hello"], "h2": [], "h3": [], "h4": []}, "paragraphs": ["a", "b"], } score = coverage_scorecard(inv) typ = next(s for s in score if s["category"] == "typography") - assert typ["status"] in ("partial", "full") + assert typ["status"] == "partial"Same nit applies to
test_full_when_color_evidence_present(line 31):"2" in color["evidence"]is so loose it would pass on"a 2-hex string"or any digit. Keep just the"2 hex"substring check to actually exercise the evidence formatting.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_validate.py` around lines 60 - 68, The test test_partial_typography currently allows "full" which defeats the assertion; tighten it by asserting typ["status"] == "partial" to reflect that coverage_scorecard with headings {"h1": ["Hello"], ...} and two paragraphs should deterministically yield "partial" (see _judge_category handling of "typography"); also adjust the related color evidence check in test_full_when_color_evidence_present to look for the exact substring "2 hex" in color["evidence"] rather than the loose presence of the character "2" so the test actually verifies the formatted evidence output.scripts/generate_design_system_v2.py (1)
224-230: Tautological conditional — remove the no-op ternary.
{status if status != "full" else "full"}always evaluates tostatus. Looks like a leftover from an earlier refactor; it just adds noise.♻️ Proposed cleanup
row.append( BeautifulSoup( f'<div class="cat">{cat}</div>' - f'<div class="status {status if status != "full" else "full"}">' + f'<div class="status {status}">' f"{'✓' if status == 'full' else '◐' if status == 'partial' else '✗'} {status}</div>" f'<div class="evidence">{evidence}</div>', "html.parser", ) )As a minor side note, the icon mapping (
✓ / ◐ / ✗) silently degrades to✗for any unknown future status returned bycoverage_scorecard. That’s acceptable today (only three statuses are possible), just worth keeping in mind if_judge_categoryever grows new return values.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@scripts/generate_design_system_v2.py` around lines 224 - 230, The ternary in the status class f-string is a no-op — replace the expression f'<div class="status {status if status != "full" else "full"}">' with a direct insertion of the status variable (use {status}) so the class becomes simply status plus the status value; update the same f-string that builds the HTML (the lines constructing cat, status and evidence) to remove the unnecessary conditional and leave the icon logic unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/AUDIT.md`:
- Line 57: The P2-8 table row contains unescaped pipe characters in the text
"(full|partial|missing + evidence string)" which breaks the Markdown table;
update that cell in the P2-8 row (the row mentioning
`scripts/validate.py:coverage_scorecard` and the phrase "(full|partial|missing +
evidence string)") to escape each pipe as `\|` or replace them with a non-pipe
delimiter (e.g., "/" or "or") so the row stays a single cell and the table
column count matches the header.
In `@scripts/generate_design_system_v2.py`:
- Line 214: The import of coverage_scorecard in generate_design_system_v2.py
fails when the script is run from the capture directory (where scripts/ is not
on sys.path); update the top of generate_design_system_v2.py to robustly import
coverage_scorecard by first trying the normal package import (from
scripts.validate import coverage_scorecard) and, on ImportError, adding a small
fallback that appends the repository root or the parent directory containing the
scripts package to sys.path (or attempts a local import like from validate
import coverage_scorecard) so the subsequent call coverage_scorecard(inv) works
whether the file is executed from repo root or copied into the capture
directory; keep the change minimal and clearly reference coverage_scorecard and
generate_design_system_v2.py so reviewers can find it.
In `@scripts/post_process.py`:
- Around line 121-133: Replace the three print() calls with structured logging
using the existing structlog logger (named log) and mirror to stderr if CLI
human output is still desired: use log.info for the audit summary ("assets:
count..."), log.error when src not found, and log.info for the "wrote ...
(rewrote ... refs)" message after inline_small_assets; locate the prints around
the audit handling and the inline_small_assets call (references: audit, args,
src, out, inline_small_assets) and emit the same text as the current print
messages via log.*, optionally keeping the current print(...) to stderr in
addition only if a human-facing message is required.
In `@scripts/probe.py`:
- Around line 162-168: Replace direct print calls in the args.output branch so
they use the project structlog logger instead: when args.output == "-" either
emit payload via logger.info(...) (or if a stdout JSON contract must be
preserved, keep printing payload to stdout but log side-channel events via
logger.info), and when writing to file after
Path(args.output).write_text(payload, encoding="utf-8") replace print(f"wrote
{args.output}", file=sys.stderr) with logger.info("wrote output",
path=args.output) (use an existing logger or obtain one with
structlog.get_logger() named logger).
- Around line 99-120: The probe currently returns unreachable immediately on any
HEAD status >=400, causing false negatives for sites that disallow HEAD; change
the logic around the sess.head handling so that on head.status_code >= 400 you
do not early-return but instead attempt a GET (using the same sess, url,
timeout) and use the GET response to determine reachability/status/headers/CSP;
if GET raises an exception, return a ProbeResult with reachable=False and
include the GET exception details (fall back to HEAD info only if GET fails),
and update the test test_4xx_marks_unreachable_but_records_status to expect the
new GET fallback behavior; look for the code around sess.head, ProbeResult, and
summarize_csp to implement this flow.
In `@scripts/validate.py`:
- Around line 163-170: The current _PLACEHOLDER_PATTERNS tuple includes a broad
pattern re.compile(r"\b(foo|bar|baz)\b", re.I) that produces false positives;
update _PLACEHOLDER_PATTERNS to either remove that single-token pattern or
replace it with stricter multi-token patterns (e.g., only match canonical
metasyntactic sequences like "foo bar" or "bar baz") so it no longer flags
common words like "bar"; edit the tuple where _PLACEHOLDER_PATTERNS is defined
and replace the offending re.compile entry accordingly.
---
Nitpick comments:
In `@scripts/generate_design_system_v2.py`:
- Around line 224-230: The ternary in the status class f-string is a no-op —
replace the expression f'<div class="status {status if status != "full" else
"full"}">' with a direct insertion of the status variable (use {status}) so the
class becomes simply status plus the status value; update the same f-string that
builds the HTML (the lines constructing cat, status and evidence) to remove the
unnecessary conditional and leave the icon logic unchanged.
In `@scripts/post_process.py`:
- Around line 47-52: Update the structured log event names emitted via the
log.info calls that currently use noun_noun (e.g., "asset_audit" and
"inline_small_assets") to follow the snake_case_verb_noun convention; replace
"asset_audit" with a verb_noun form such as "audit_assets_complete" in the
log.info call that passes capture_dir/count/total, and change
"inline_small_assets" (the similar log.info block around lines ~95-100) to a
verb_noun form such as "inlined_small_assets" so all event names follow the
guideline.
In `@scripts/validate.py`:
- Around line 140-158: The HTML asset-ref iteration in check_asset_refs
duplicates logic already implemented in _iter_asset_refs
(scripts/post_process.py); refactor by extracting that shared iteration into a
small helper (e.g., scripts/_html_assets.py) and have check_asset_refs call the
new helper instead of duplicating the loop; update check_asset_refs to accept or
construct the BeautifulSoup or Path input needed, remove its local for attr in
("src","href") loop, and import/ reuse _iter_asset_refs (or the new name) so
both check_asset_refs and the existing post-processing function use the single
canonical implementation.
- Around line 79-134: The _judge_category function is a long if/elif chain;
replace it with a dispatch table mapping category names to small handler
callables to reduce cyclomatic complexity and make adding categories one-line.
Create a dict (e.g., CATEGORY_HANDLERS) whose keys are the category strings and
values are functions/lambdas that accept inv and return the same (status,
evidence) tuple used now; move per-category logic for "color", "dimension",
"fontFamily", "fontWeight", "duration", "cubicBezier", "number", "typography",
"shadow", "gradient", "transition", "strokeStyle", and "border" into those
handlers, then change _judge_category(cat, inv) to lookup
CATEGORY_HANDLERS.get(cat) and call it (or return ("missing","unknown category")
when absent). Ensure each handler preserves the existing threshold logic and
evidence formatting used by the current branches.
In `@tests/test_probe.py`:
- Around line 61-92: The tests test_success_path and
test_4xx_marks_unreachable_but_records_status declare an unused monkeypatch
parameter; remove the unused fixture parameter from both function signatures
(i.e., update def test_success_path(self, monkeypatch): -> def
test_success_path(self): and def
test_4xx_marks_unreachable_but_records_status(self, monkeypatch): -> def
test_4xx_marks_unreachable_but_records_status(self):) so the test signatures
reflect the actual fixtures used and avoid unused-parameter warnings.
In `@tests/test_validate.py`:
- Around line 60-68: The test test_partial_typography currently allows "full"
which defeats the assertion; tighten it by asserting typ["status"] == "partial"
to reflect that coverage_scorecard with headings {"h1": ["Hello"], ...} and two
paragraphs should deterministically yield "partial" (see _judge_category
handling of "typography"); also adjust the related color evidence check in
test_full_when_color_evidence_present to look for the exact substring "2 hex" in
color["evidence"] rather than the loose presence of the character "2" so the
test actually verifies the formatted evidence output.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 85710bd8-8795-4fee-87d7-5ea1931bcd8c
📒 Files selected for processing (11)
ROADMAP.mdTODO.mddocs/AUDIT.mdscripts/__init__.pyscripts/generate_design_system_v2.pyscripts/post_process.pyscripts/probe.pyscripts/validate.pytests/test_post_process.pytests/test_probe.pytests/test_validate.py
| | P2-6 | `pyproject.toml:gunicorn>=21.2.0` | **CVE-2024-1135** (HTTP request smuggling) fixed in 22.0.0. Bump floor to `gunicorn>=22.0.0`. | | ||
| | P2-7 | `app.py:80,188` | **Module-level side effects** (`cleanup_downloads_folder()` + `threading.Thread.start()`) make app untestable. Fix: extract to `create_app(start_janitor=True)` factory. | | ||
| | P2-8 | `scripts/generate_design_system_v2.py:282-285` | **"80.8/100 DTCG coverage" is a tautology** — score computed from hardcoded `DTCG_CATEGORIES` list. Same number for any site. Fix: drive scorecard from `_inventory.json` actual contents OR reframe as "audit checklist", not "coverage score". | | ||
| | ~~P2-8~~ | `scripts/validate.py:coverage_scorecard` + `scripts/generate_design_system_v2.py` | ✅ **RESOLVED 2026-04-27** — Phase 5. `coverage_scorecard(inventory)` now judges each of the 13 W3C DTCG categories against inventory evidence (full|partial|missing + evidence string). The literal `DTCG_CATEGORIES` list is gone; both the rendered scorecard and the total score are computed at runtime. Honest consequence: legacy NexusFlow capture's score drops from 80.8 (tautology) to a genuine number reflecting what inventory.py actually extracts. | |
There was a problem hiding this comment.
Unescaped pipes in the P2-8 row break the table column count.
The description contains (full|partial|missing + evidence string). Those literal | characters are interpreted as column separators by Markdown, expanding this row to 5 cells against the table's 3-column header — which is exactly what markdownlint MD056 is reporting. Escape them as \| (or replace with /) to render correctly.
Proposed fix
-| ~~P2-8~~ | `scripts/validate.py:coverage_scorecard` + `scripts/generate_design_system_v2.py` | ✅ **RESOLVED 2026-04-27** — Phase 5. `coverage_scorecard(inventory)` now judges each of the 13 W3C DTCG categories against inventory evidence (full|partial|missing + evidence string). The literal `DTCG_CATEGORIES` list is gone; both the rendered scorecard and the total score are computed at runtime. Honest consequence: legacy NexusFlow capture's score drops from 80.8 (tautology) to a genuine number reflecting what inventory.py actually extracts. |
+| ~~P2-8~~ | `scripts/validate.py:coverage_scorecard` + `scripts/generate_design_system_v2.py` | ✅ **RESOLVED 2026-04-27** — Phase 5. `coverage_scorecard(inventory)` now judges each of the 13 W3C DTCG categories against inventory evidence (full / partial / missing + evidence string). The literal `DTCG_CATEGORIES` list is gone; both the rendered scorecard and the total score are computed at runtime. Honest consequence: legacy NexusFlow capture's score drops from 80.8 (tautology) to a genuine number reflecting what inventory.py actually extracts. |📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| | ~~P2-8~~ | `scripts/validate.py:coverage_scorecard` + `scripts/generate_design_system_v2.py` | ✅ **RESOLVED 2026-04-27** — Phase 5. `coverage_scorecard(inventory)` now judges each of the 13 W3C DTCG categories against inventory evidence (full|partial|missing + evidence string). The literal `DTCG_CATEGORIES` list is gone; both the rendered scorecard and the total score are computed at runtime. Honest consequence: legacy NexusFlow capture's score drops from 80.8 (tautology) to a genuine number reflecting what inventory.py actually extracts. | | |
| | ~~P2-8~~ | `scripts/validate.py:coverage_scorecard` + `scripts/generate_design_system_v2.py` | ✅ **RESOLVED 2026-04-27** — Phase 5. `coverage_scorecard(inventory)` now judges each of the 13 W3C DTCG categories against inventory evidence (full / partial / missing + evidence string). The literal `DTCG_CATEGORIES` list is gone; both the rendered scorecard and the total score are computed at runtime. Honest consequence: legacy NexusFlow capture's score drops from 80.8 (tautology) to a genuine number reflecting what inventory.py actually extracts. | |
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 57-57: Table column count
Expected: 3; Actual: 5; Too many cells, extra data will be missing
(MD056, table-column-count)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@docs/AUDIT.md` at line 57, The P2-8 table row contains unescaped pipe
characters in the text "(full|partial|missing + evidence string)" which breaks
the Markdown table; update that cell in the P2-8 row (the row mentioning
`scripts/validate.py:coverage_scorecard` and the phrase "(full|partial|missing +
evidence string)") to escape each pipe as `\|` or replace them with a non-pipe
delimiter (e.g., "/" or "or") so the row stays a single cell and the table
column count matches the header.
| # baked to the original NexusFlow capture, producing the same 80.8/100 score | ||
| # for any site. validate.coverage_scorecard reads inventory keys and judges | ||
| # each of the 13 categories against actual extracted evidence.) | ||
| from scripts.validate import coverage_scorecard # noqa: E402 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# 1) Find every place this generator is invoked.
rg -nP -C2 'generate_design_system_v2(\.py)?\b'
# 2) Check whether scripts/ is a proper package (so `python -m scripts....` works).
fd -a '__init__.py' scripts 2>/dev/null || true
# 3) Confirm the README / Makefile invocation pattern.
fd -a -i 'makefile|readme.md|^run.*\.sh$'Repository: fbmoulin/kratos-clone
Length of output: 8095
🏁 Script executed:
# Check if validate.py is meant to be copied alongside (mentioned in README/CLAUDE)
rg -n "validate\.py" README.md CLAUDE.md docs/
# Also check if there's any conditional import logic or sys.path manipulation in the script
head -50 scripts/generate_design_system_v2.py | grep -E "sys\.path|import|from"Repository: fbmoulin/kratos-clone
Length of output: 1614
🏁 Script executed:
# Get context around line 214 where the import is
sed -n '210,220p' scripts/generate_design_system_v2.py
# Check if coverage_scorecard is actually used in the script
rg -n "coverage_scorecard" scripts/generate_design_system_v2.pyRepository: fbmoulin/kratos-clone
Length of output: 771
Documented invocation pattern (cd capture && python generate_design_system_v2.py) breaks the import.
The README and CLAUDE.md show copying generate_design_system_v2.py and inventory.py into the capture directory, then running the script from there. With that invocation, from scripts.validate import coverage_scorecard will fail because scripts/ is not accessible from the capture directory.
The import is active (line 216 calls coverage_scorecard(inv)), so this is not a dead-code issue. To fix: either update the docs to invoke via python -m scripts.generate_design_system_v2 from the repo root, or add a fallback import (try/except) that allows the script to run when copied into an isolated directory.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/generate_design_system_v2.py` at line 214, The import of
coverage_scorecard in generate_design_system_v2.py fails when the script is run
from the capture directory (where scripts/ is not on sys.path); update the top
of generate_design_system_v2.py to robustly import coverage_scorecard by first
trying the normal package import (from scripts.validate import
coverage_scorecard) and, on ImportError, adding a small fallback that appends
the repository root or the parent directory containing the scripts package to
sys.path (or attempts a local import like from validate import
coverage_scorecard) so the subsequent call coverage_scorecard(inv) works whether
the file is executed from repo root or copied into the capture directory; keep
the change minimal and clearly reference coverage_scorecard and
generate_design_system_v2.py so reviewers can find it.
| print( | ||
| f"assets: count={audit.count} total_bytes={audit.total_bytes}", | ||
| file=sys.stderr, | ||
| ) | ||
| if args.no_inline or audit.count == 0: | ||
| return 0 | ||
| src = args.capture_dir / "index.html" | ||
| if not src.exists(): | ||
| print(f"error: {src} not found", file=sys.stderr) | ||
| return 2 | ||
| out = args.capture_dir / "index.inlined.html" | ||
| n = inline_small_assets(src, out, max_bytes=args.inline_max_bytes) | ||
| print(f"wrote {out} (rewrote {n} refs)", file=sys.stderr) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace print() with structlog calls per project convention.
The coding guideline for **/*.py requires logger.info/.warning/.error from structlog and disallows print(). The audit/wrote messages here should be emitted via the existing log logger (you already use it for inline_small_assets). If you need to keep human-readable stderr output for the CLI, do it in addition to (not in place of) structured logging — or at minimum mirror the logged event.
Proposed change
audit = audit_assets(args.capture_dir)
- print(
- f"assets: count={audit.count} total_bytes={audit.total_bytes}",
- file=sys.stderr,
- )
+ log.info(
+ "post_process_audit",
+ count=audit.count,
+ total_bytes=audit.total_bytes,
+ )
if args.no_inline or audit.count == 0:
return 0
src = args.capture_dir / "index.html"
if not src.exists():
- print(f"error: {src} not found", file=sys.stderr)
+ log.error("post_process_index_missing", path=str(src))
return 2
out = args.capture_dir / "index.inlined.html"
n = inline_small_assets(src, out, max_bytes=args.inline_max_bytes)
- print(f"wrote {out} (rewrote {n} refs)", file=sys.stderr)
+ log.info("post_process_inlined", out=str(out), rewritten=n)
return 0As per coding guidelines: "Always use logger.info(), .warning(), .error() from structlog — never print()".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/post_process.py` around lines 121 - 133, Replace the three print()
calls with structured logging using the existing structlog logger (named log)
and mirror to stderr if CLI human output is still desired: use log.info for the
audit summary ("assets: count..."), log.error when src not found, and log.info
for the "wrote ... (rewrote ... refs)" message after inline_small_assets; locate
the prints around the audit handling and the inline_small_assets call
(references: audit, args, src, out, inline_small_assets) and emit the same text
as the current print messages via log.*, optionally keeping the current
print(...) to stderr in addition only if a human-facing message is required.
| try: | ||
| head = sess.head(url, allow_redirects=True, timeout=timeout) | ||
| except Exception as exc: | ||
| log.warning("probe_head_failed", url=url, error=str(exc)) | ||
| return ProbeResult( | ||
| url=url, | ||
| status=None, | ||
| reachable=False, | ||
| framework="unknown", | ||
| error=f"{type(exc).__name__}: {exc}", | ||
| ) | ||
|
|
||
| headers = {k.lower(): v for k, v in head.headers.items()} | ||
| if head.status_code >= 400: | ||
| return ProbeResult( | ||
| url=url, | ||
| status=head.status_code, | ||
| reachable=False, | ||
| framework="unknown", | ||
| csp=summarize_csp(headers), | ||
| headers=headers, | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's examine the scripts/probe.py file to understand the full context
fd probe.pyRepository: fbmoulin/kratos-clone
Length of output: 102
🏁 Script executed:
# Get the full run_probe function and surrounding context
rg -A 50 "def run_probe" scripts/probe.pyRepository: fbmoulin/kratos-clone
Length of output: 1766
🏁 Script executed:
# Search for test files related to probe functionality
fd test.*probeRepository: fbmoulin/kratos-clone
Length of output: 85
🏁 Script executed:
# Check if there are any tests for 4xx HEAD handling
rg "test_4xx_marks_unreachable" .Repository: fbmoulin/kratos-clone
Length of output: 161
🏁 Script executed:
# Let's examine the full test file
cat -n tests/test_probe.pyRepository: fbmoulin/kratos-clone
Length of output: 4363
🏁 Script executed:
# Get the full implementation of the test that was mentioned in the review
rg -A 20 "def test_4xx_marks_unreachable_but_records_status" tests/test_probe.pyRepository: fbmoulin/kratos-clone
Length of output: 787
🏁 Script executed:
# Check the full run_probe function to see if there's a fallback to GET on 4xx
rg -A 60 "def run_probe" scripts/probe.py | head -80Repository: fbmoulin/kratos-clone
Length of output: 2065
HEAD-only failure mode misses sites that disallow HEAD.
Many real-world targets (some CDNs, WAFs, and API gateways) reject HEAD with 403/405/501 while GET succeeds. The current implementation returns reachable=False immediately when HEAD status >= 400, without attempting GET. This causes false negatives for sites that only allow certain HTTP methods.
Consider falling through to GET on 4xx/5xx HEAD responses, especially 405 (Method Not Allowed) and 5xx errors. Note that test_4xx_marks_unreachable_but_records_status explicitly validates the current early-return behavior and will need updating.
Suggested approach
headers = {k.lower(): v for k, v in head.headers.items()}
- if head.status_code >= 400:
+ head_blocked = head.status_code in (403, 405, 501) or head.status_code >= 500
+ if head.status_code >= 400 and not head_blocked:
return ProbeResult(
url=url,
status=head.status_code,
reachable=False,
framework="unknown",
csp=summarize_csp(headers),
headers=headers,
)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/probe.py` around lines 99 - 120, The probe currently returns
unreachable immediately on any HEAD status >=400, causing false negatives for
sites that disallow HEAD; change the logic around the sess.head handling so that
on head.status_code >= 400 you do not early-return but instead attempt a GET
(using the same sess, url, timeout) and use the GET response to determine
reachability/status/headers/CSP; if GET raises an exception, return a
ProbeResult with reachable=False and include the GET exception details (fall
back to HEAD info only if GET fails), and update the test
test_4xx_marks_unreachable_but_records_status to expect the new GET fallback
behavior; look for the code around sess.head, ProbeResult, and summarize_csp to
implement this flow.
| if args.output == "-": | ||
| print(payload) | ||
| else: | ||
| from pathlib import Path | ||
|
|
||
| Path(args.output).write_text(payload, encoding="utf-8") | ||
| print(f"wrote {args.output}", file=sys.stderr) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Replace print() with structlog per project convention.
Same finding as scripts/post_process.py: emit the JSON payload and the wrote <path> confirmation through the structlog logger (or, if you must keep a stdout JSON contract, route only the JSON to stdout and log the side-channel events via log.info).
As per coding guidelines: "Always use logger.info(), .warning(), .error() from structlog — never print()".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/probe.py` around lines 162 - 168, Replace direct print calls in the
args.output branch so they use the project structlog logger instead: when
args.output == "-" either emit payload via logger.info(...) (or if a stdout JSON
contract must be preserved, keep printing payload to stdout but log side-channel
events via logger.info), and when writing to file after
Path(args.output).write_text(payload, encoding="utf-8") replace print(f"wrote
{args.output}", file=sys.stderr) with logger.info("wrote output",
path=args.output) (use an existing logger or obtain one with
structlog.get_logger() named logger).
| _PLACEHOLDER_PATTERNS = ( | ||
| re.compile(r"\blorem\s+ipsum\b", re.I), | ||
| re.compile(r"\byour\s+(headline|copy|text|tagline)\s+here\b", re.I), | ||
| re.compile(r"\bplaceholder\b", re.I), | ||
| re.compile(r"\btodo[: ]", re.I), | ||
| re.compile(r"\bfixme\b", re.I), | ||
| re.compile(r"\b(foo|bar|baz)\b", re.I), | ||
| ) |
There was a problem hiding this comment.
\b(foo|bar|baz)\b will produce false positives on real sites.
bar is a common English word ("snack bar", "navigation bar", "menu bar", "side bar"), and foo appears in some product/brand names. Running this regex against rendered HTML text will almost certainly flag legitimate content as a placeholder. Either drop this pattern or restrict it to multi-token contexts ("foo bar", "bar baz") so it only fires on the canonical metasyntactic sequence.
Proposed tightening
- re.compile(r"\b(foo|bar|baz)\b", re.I),
+ # Match the metasyntactic sequence, not standalone "bar" which appears
+ # in normal English text (navigation bar, snack bar, etc.).
+ re.compile(r"\bfoo\s+bar(?:\s+baz)?\b", re.I),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@scripts/validate.py` around lines 163 - 170, The current
_PLACEHOLDER_PATTERNS tuple includes a broad pattern
re.compile(r"\b(foo|bar|baz)\b", re.I) that produces false positives; update
_PLACEHOLDER_PATTERNS to either remove that single-token pattern or replace it
with stricter multi-token patterns (e.g., only match canonical metasyntactic
sequences like "foo bar" or "bar baz") so it no longer flags common words like
"bar"; edit the tuple where _PLACEHOLDER_PATTERNS is defined and replace the
offending re.compile entry accordingly.
Closes Phase 5 of ROADMAP.md. Three new pipeline-stage scripts make the previously aspirational stages runnable, plus the hardcoded DTCG scorecard is finally data-driven.
Summary
Closes audit P2-8
The 80.8/100 DTCG coverage score was a tautology — same number for any site because DTCG_CATEGORIES was a hardcoded list of statuses. The new coverage_scorecard(inventory) judges each of the 13 W3C DTCG categories against actual inventory keys. Honest consequence: the legacy NexusFlow capture's score will drop to a genuine number (likely ~30-50) reflecting that inventory.py extracts only 7 of 13 categories today.
Out of scope (deferred)
Test plan
Commits
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests