Skip to content

BalaShankar9/Site-Knight

Repository files navigation

SiteSentry

SiteSentry is a production-grade website audit engine for SEO, technical, performance, and conversion analysis — built for evidence-based detection and executive-ready reporting.

Python License

🎯 Features

  • Full Website Crawling – Unlimited pages/depth with concurrent crawling
  • JavaScript Rendering – Playwright-powered rendering for SPAs and dynamic content
  • Subdomain Discovery – Automatic detection and crawling of subdomains
  • 8 Accuracy Guardrails – Evidence-based issue detection (no false positives)
  • Dual Scoring System – Site Health Score (0-100) + Revenue Score (0-100)
  • Full-Stack Exports – 12 output files including 3 executive PDFs
  • Technology Detection – Identifies CMS, frameworks, analytics, CDN, and more

📦 Installation

Prerequisites

  • Python 3.9+
  • pip

Setup

# Clone the repository
git clone https://github.com/BalaShankar9/parcellab-audit-toolkit.git
cd parcellab-audit-toolkit

# Create virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

# Install Playwright browsers (required for JS rendering)
playwright install chromium

🚀 Quick Start

# Basic audit (50 pages, auto rendering)
python audit.py --url https://example.com --max-pages 50

# Full unlimited audit with JS rendering
python audit.py --url https://example.com --max-pages 0 --max-depth 0 --render always

# Quick scan without rendering (faster)
python audit.py --url https://example.com --max-pages 100 --render never

⚙️ Command Line Options

Option Default Description
--url (required) Target website URL
--max-pages 0 Max pages to crawl (0 = unlimited)
--max-depth 0 Max crawl depth (0 = unlimited)
--render auto JS rendering: always, never, auto
--output ./audit-output Output directory
--audit-date current date Date string for reports
--workers 10 Concurrent crawl workers
--timeout 30 Request timeout (seconds)
--accuracy strict Accuracy mode: strict or normal
--no-journeys flag Skip journey tests

🔧 Rendering Modes

Mode Speed Use Case
never ⚡ Fast (10-20 pages/sec) Static HTML sites, quick scans
auto 🔄 Adaptive Renders only when JS detected
always 🐢 Slow (2-5 pages/sec) SPAs, React/Vue/Angular sites

🛡️ Accuracy Guardrails (STRICT Mode)

The toolkit uses 8 evidence-based rules to ensure accuracy:

  1. Evidence Required – Every issue must have HTML/data proof
  2. Confidence Levels – HIGH/MEDIUM/LOW with appropriate weighting
  3. Valid Pages Only – Only status 200 pages are scored
  4. No Inference – Never assume; only report what's verifiable
  5. Verification Steps – Each issue includes reproducible steps
  6. Category Caps – Prevents score from hitting 0 unfairly
  7. Manual Validation Flag – Low-confidence issues marked for review
  8. Source Attribution – Every finding cites its data source

📊 Output Files

Each audit generates a timestamped folder with:

CSV Exports

File Description
pages.csv One row per crawled page with metadata
issues.csv All issues with affected URLs
issues_summary.csv One row per issue type
internal_links.csv Link graph (source → target)
redirects.csv Redirect chains
errors.csv Timeouts, failures, HTTP errors
fix_backlog.csv Jira-ready task list
audit_data.csv Comprehensive single-file export

JSON

File Description
run_manifest.json Run config, counts, durations, scores

PDF Reports

File Description
Executive_Summary.pdf 3-5 page leadership briefing
Full_Audit_Report.pdf 20-40 page detailed analysis
Appendix.pdf Raw data tables and URL lists

📈 Scoring System

Site Health Score (0-100)

Measures overall website technical health with category-capped penalties:

Category Max Penalty
Technical 20 pts
SEO 20 pts
Performance 15 pts
UX-CRO 15 pts
Content 10 pts
Tracking 8 pts
Security 5 pts
Accessibility 3 pts

Revenue Score (0-100)

Focuses on conversion-critical pages (pricing, demo, contact):

Category Max Penalty
Conversion Journey 35 pts
Tracking & Attribution 20 pts
Trust & Compliance 15 pts
Performance (Money Pages) 15 pts
UX Friction 10 pts
SEO Readiness 5 pts

🔍 Detected Technologies

The toolkit automatically detects:

  • CMS: WordPress, HubSpot, Contentful, Webflow, etc.
  • Frameworks: React, Vue, Angular, Next.js, jQuery
  • Analytics: GA4, Amplitude, Heap, FullStory, Mixpanel
  • Tag Managers: GTM, Segment, Tealium
  • CDN: Cloudflare, Fastly, Akamai, CloudFront
  • Marketing: HubSpot, Marketo, Pardot, Intercom, Drift

📁 Project Structure

parcellab-audit-toolkit/
├── audit.py              # Main entry point
├── score_audit.py        # Standalone scoring CLI
├── requirements.txt      # Python dependencies
├── src/
│   ├── crawler/          # Crawl engine & URL normalization
│   ├── render/           # Playwright renderer
│   ├── analyze/          # Issue detection & tech detection
│   ├── scoring/          # Health & Revenue scoring
│   ├── reporting/        # PDF generator
│   ├── outputs/          # Output manager
│   └── journeys/         # User journey tests
└── audit-output/         # Generated reports (gitignored)

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments


Built with ❤️ for website optimization professionals

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages