Hardened SPA site cloner + DTCG design system extractor + OpenAI personalization spec.
Fork of asimov-academy/Website-Downloader
with substantial additions on top of the original Flask UI.
Status: MVP operational. Three implementation flaws identified in
docs/AUDIT.mdare tracked inROADMAP.md(Phase 2). Two upstream-overlapping modules co-exist (downloader.py= original,kratos_clone/= new hardened module). Tests are the largest gap (see Phase 1).
| Layer | Module | Purpose |
|---|---|---|
| Hardened capture | kratos_clone/ |
Playwright module with 5 patches (IO pre-fire, DOM-stable, 3-pass scroll, shadow walker, computed-style snapshot) for SPA-heavy sites where the original downloader missed content. CLI: python -m kratos_clone <url>. |
| Design-system extraction | scripts/inventory.py + scripts/generate_design_system_v{1,2}.py |
Parse a captured HTML + emit a self-contained design-system.html showcase (typography, colors, components, motion, icons) with embedded DTCG token JSON. |
| Observability | app.py + templates/index.html |
structlog backend + inline browser logger (window.onerror, unhandledrejection, console.error, slow fetch, SSE close) → POST /api/client-errors → same log stream. |
| Architecture specs | docs/PROMPT_v2.md, docs/WORKFLOW.md, docs/PERSONALIZATION.md |
Optimized LLM prompt for design-system extraction, 6-stage workflow plan, and OpenAI Responses-API personalization architecture (spec only — not yet implemented). |
| Original tool | app.py (UI) + downloader.py (legacy capture) |
Preserved from upstream. The Flask UI at http://localhost:5001 still uses downloader.py; kratos_clone/ is invoked via CLI today. |
uv sync
uv run playwright install chromiumuv run python -m kratos_clone https://nexusflow-saas.aura.build/ \
--output-dir ./captureKnobs (all overridable via KCD_* env vars — full reference below):
--passes {1,2,3} # scroll passes (default 3)
--viewport WxH # default 1920x1080
--headed # visible browser (for WebGL/Spline)
--no-styles # skip computed-style snapshot
--no-shadow # skip shadow-DOM walker
--no-io-polyfill # disable IntersectionObserver pre-fire (debug)Output: <dir>/index.html, <dir>/styles.json, <dir>/manifest.json, <dir>/assets/*.
Capture-time tunables (kratos_clone/capture.py):
| Variable | Default | Purpose |
|---|---|---|
KCD_VIEWPORT_WIDTH |
1920 |
Browser viewport width (px) |
KCD_VIEWPORT_HEIGHT |
1080 |
Browser viewport height (px) |
KCD_USER_AGENT |
Chrome 120 | UA string sent on requests |
KCD_NAV_TIMEOUT |
90000 |
page.goto timeout (ms) |
KCD_DOM_STABLE_MS |
1500 |
Mutation-quiet window before capture (ms) |
KCD_NETIDLE_BUFFER |
5000 |
Networkidle settle (ms) |
KCD_SCROLL_PASSES |
3 |
Three-pass scroll: forward fast / forward slow / backward slow |
KCD_MAX_SCROLL_S |
120 |
Wall-clock budget for scroll loop (Phase 3, P2-2) |
KCD_MAX_TOTAL_MB |
200 |
Hard cap on total asset bytes (Phase 3, P1-E) |
KCD_MAX_ASSETS |
500 |
Hard cap on number of assets (Phase 3, P1-E) |
KCD_HEADED |
0 |
1 to launch a visible browser |
KCD_CAPTURE_COMPUTED_STYLES |
1 |
Patch E — emit styles.json |
KCD_IO_POLYFILL |
1 |
Patch A — IntersectionObserver pre-fire |
KCD_SHADOW_WALKER |
1 |
Patch D — Declarative Shadow DOM serializer |
KCD_BLOCK_ANALYTICS |
0 |
Drop common analytics URLs at the route layer |
KCD_IFRAME_MIN_RATIO |
0.5 |
Length ratio for iframe srcdoc inclusion (Phase 2) |
KCD_NO_IFRAME_SRCDOC |
0 |
1 to skip iframe srcdoc capture |
Server-side (app.py):
| Variable | Default | Purpose |
|---|---|---|
LOG_LEVEL |
INFO |
structlog level |
LOG_FORMAT |
console |
json for production-ready structured output |
TRUST_PROXY |
0 |
1 enables ProxyFix(x_for=1, x_proto=1) for X-Forwarded-* |
RATE_LIMIT_STORAGE_URI |
memory:// |
Flask-Limiter backend (redis://... for prod) |
CLIENT_ERRORS_RATE_LIMIT |
60 per minute |
Per-IP cap on /api/client-errors |
PERSONALIZE_STRUCTURE_RATE_LIMIT |
5 per minute |
Per-IP cap on /api/personalize/structure |
PERSONALIZE_RUN_RATE_LIMIT |
2 per minute |
Per-IP cap on /api/personalize/run |
OPENAI_API_KEY |
(required for personalize) | Loaded via python-dotenv from .env |
cd ./capture
cp ../scripts/inventory.py .
cp ../scripts/generate_design_system_v2.py .
python inventory.py > _inventory.json
python generate_design_system_v2.py
# Open design-system.html in a browserGenerators currently have hardcoded indices into the inventory — they work end-to-end on the NexusFlow template but
IndexErroron arbitrary sites. Tracked as P1-C indocs/AUDIT.mdand Phase 2 inROADMAP.md.
uv run python app.py
# http://localhost:5001The UI uses downloader.py (original, unchanged). New structlog observability
captures backend events + receives browser errors at /api/client-errors.
kratos-clone/
├── app.py # Flask UI (upstream) + structlog + /api/client-errors
├── downloader.py # Upstream capture (used by Flask UI)
├── templates/
│ └── index.html # UI + inline browser logger
├── kratos_clone/ # NEW — hardened Playwright capture module
│ ├── __init__.py
│ ├── __main__.py # CLI entry
│ ├── capture.py # 5 patches A-E
│ └── post.py # HTML rewrite + orphan CSS injection
├── scripts/ # NEW — design-system extractors
│ ├── inventory.py
│ ├── generate_design_system_v1.py
│ └── generate_design_system_v2.py
├── docs/ # NEW — architecture + audit
│ ├── AUDIT.md # Multi-agent audit findings
│ ├── PROMPT_v2.md # Optimized LLM extraction prompt
│ ├── WORKFLOW.md # 6-stage pipeline plan
│ └── PERSONALIZATION.md # OpenAI personalization spec (NOT YET IMPLEMENTED)
├── ROADMAP.md # Phased plan, derived from audit
├── TODO.md # Short-term actionable items
├── CLAUDE.md # Guidance for Claude Code on this repo
├── LICENSE # MIT (our additions)
├── NOTICE # Upstream attribution
├── pyproject.toml
└── .github/workflows/ci.yml # Ruff + smoke (extends to E2E in Phase 1)
| Patch | What | Status | File:line |
|---|---|---|---|
| A — IntersectionObserver pre-fire polyfill | Forces every observer to fire isIntersecting:true immediately. Solves lazy-load capture on Aura/Webflow/Framer-style sites. |
✅ Working | capture.py:38-77 |
B — networkidle + DOM-stable predicate |
MutationObserver-based: resolves only after DOM hasn't mutated for KCD_DOM_STABLE_MS (default 1500). |
✅ Working | capture.py:107-117 |
| C — Three-pass scroll | Forward fast → forward slow → backward slow. Detects + disables Lenis. No wall-clock budget yet (P2-2). | 🟡 Working, missing time guard | capture.py:444-481 |
| D — Shadow DOM walker | Recursive walk emitting Declarative Shadow DOM <template shadowrootmode="open">. |
🔴 Broken — cloneNode doesn't copy shadow roots; walker visits a clone with all shadowRoot=null. P1-A in audit. |
capture.py:78-101 |
| E — Computed-style snapshot | Per-element getComputedStyle capture → styles.json for downstream design-system extraction. |
✅ Working | capture.py:_capture_computed_styles |
Plus: post.py orphan-CSS injection (recovered the 440 KB Tailwind bundle from
the iframe-srcdoc wrapper page on Aura sites — likely the highest-impact line of
the entire fork).
CI (.github/workflows/ci.yml) runs ruff lint + format + a smoke test that imports
modules and round-trips the /api/client-errors endpoint. No tests/ directory
exists yet. Highest-priority Phase 1 item — see ROADMAP.md.
docs/AUDIT.md— current state of the codebase, prioritized findingsROADMAP.md— phased plan to address audit + extend functionalityTODO.md— actionable next-sprint itemsdocs/WORKFLOW.md— 6-stage pipeline architecture (Stages 1, 6 are aspirational)docs/PROMPT_v2.md— LLM prompt template if extracting via Claude/Opusdocs/PERSONALIZATION.md— proposed OpenAI personalization layer (spec only)CLAUDE.md— guidance for Claude Code agents working on this repo
- Our additions (
kratos_clone/,scripts/,docs/, observability patches toapp.py,templates/index.html, CI, etc.): MIT — seeLICENSE. - Upstream code (
downloader.py, originalapp.pyskeleton,templates/index.htmlbase,Dockerfile, deploy configs): retains original "personal and educational use" terms per upstream README. SeeNOTICE.
For commercial use of upstream-derived code, contact Asimov Academy directly.
main is protected (squash/rebase merges only, requires CI green). Open a PR; CodeRabbit
- Gemini + Code Review Doctor auto-review on every PR.