Skip to content

Tktirth/ai-web-vulnerability-scanner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

README.md

AI-Powered Web Vulnerability Scanner

Crawl. Detect. Classify. Actually understand what's broken.


I built this because every free scanner I tried had the same annoying problem — it floods you with 60 findings and treats a missing X-XSS-Protection header with the same urgency as a raw SQL injection. You end up spending more time sorting the output than actually fixing anything.

So I wrote my own. This one crawls your target first, runs five detection modules, and then feeds every finding into a trained ML model that scores severity based on what the vulnerability actually is — not some hardcoded priority list. The whole thing lives inside a Streamlit dashboard with a live terminal so you can watch it work. When it finishes, you get a clean JSON report you can keep, share, or pipe into whatever workflow you have.


What it tests

The scanner doesn't just ping a URL. It crawls the site first using a BFS traversal, discovers internal pages up to your depth limit, then fans out across everything it found.

Cross-Site Scripting (XSS)
Throws 10 payloads at every URL parameter and HTML form input it finds. It's not just checking for raw reflection — it also catches partial-encoding bypasses using pattern matching. Both GET parameters and POST form fields are covered independently.

SQL Injection
Two strategies running together. Error-based detection listens for database error signatures from MySQL, PostgreSQL, MSSQL, Oracle, and SQLite — over 25 patterns in total. Boolean-based detection compares response lengths between 1=1 and 1=2 conditions to catch cases where the server stays quiet but still behaves differently. URL params and form inputs both get tested.

Security Headers
Checks eight headers: X-Frame-Options, X-Content-Type-Options, X-XSS-Protection, Content-Security-Policy, Strict-Transport-Security, Referrer-Policy, Permissions-Policy, and Cache-Control. Also flags the ones that leak information — Server, X-Powered-By, X-Generator — which hand attackers your full stack without them having to do anything.

Open Redirect
Tests over 20 redirect-flavored parameter names (url, next, return, goto, redirect, callback, destination, and more) against 9 payloads. Catches server-side 3xx redirects and client-side ones hiding in meta-refresh tags or window.location calls.

Directory and File Discovery
Probes 40+ paths that people routinely forget to lock down: admin panels, .env files, .git/config, raw database dumps, backup zips, Swagger UI, server-status pages, upload directories. Anything that comes back with a 200, 301, or 302 gets flagged.

Every single finding goes through the AI module and comes out tagged as Critical, High, Medium, or Low.


Getting started

Python 3.9 or higher. That's the only real requirement.

cd ai-web-vulnerability-scanner

pip install -r requirements.txt

streamlit run app.py

It opens at http://localhost:8501.


Running a scan

  1. Drop your target URL into the input field at the top — http:// or https://, it handles both
  2. Pick which modules you want to run from the left sidebar (everything's on by default)
  3. Set your crawl depth, delay, and timeout if the defaults don't fit
  4. Press ▶ SCAN
  5. Watch the live terminal output as each phase runs
  6. Once results are in, use the filters to cut down to what you care about
  7. Grab the JSON report from the export panel at the bottom

One thing worth paying attention to: the Request delay slider in the sidebar controls how fast the scanner hits the server. Don't set it to zero just because you can.


Project layout

ai-web-vulnerability-scanner/
│
├── app.py                     ← Dashboard UI, state management, results rendering
├── scanner_engine.py          ← Runs each phase in order, wires progress callbacks
├── crawler.py                 ← BFS link crawler with depth control and URL deduplication
├── requirements.txt
│
├── ai/
│   └── vulnerability_ai.py   ← RandomForest classifier, feature extraction, severity scoring
│
├── detectors/
│   ├── xss_detector.py        ← XSS via URL params and form fields
│   ├── sql_detector.py        ← Error-based and boolean-based SQLi
│   ├── header_detector.py     ← Missing headers, weak values, server info leakage
│   ├── redirect_detector.py   ← Open redirect parameter injection
│   └── directory_detector.py  ← Sensitive path and exposed file detection
│
├── utils/
│   ├── request_manager.py     ← Shared HTTP session, retry handling, SSL fallback
│   └── payloads.py            ← Every payload in one place, easy to extend
│
└── reports/
    └── report_generator.py    ← JSON report structure, risk scoring, export logic

Every detector is completely independent. If you want to run just the header check against a single endpoint, import detect_missing_headers and call it directly — no need to drag in the whole engine.


Sample report output

{
  "report_metadata": {
    "tool": "AI Web Vulnerability Scanner",
    "version": "1.0.0",
    "generated_at": "20260316_142205",
    "target": "http://testphp.vulnweb.com",
    "total_findings": 8
  },
  "executive_summary": {
    "risk_level": "CRITICAL",
    "total_vulnerabilities": 8,
    "pages_scanned": 14,
    "requests_made": 387,
    "scan_duration_seconds": 42.1,
    "severity_breakdown": {
      "Critical": 2,
      "High": 2,
      "Medium": 2,
      "Low": 2
    }
  },
  "vulnerabilities": [
    {
      "id": "VULN-0001",
      "type": "SQLi",
      "subtype": "Error-based SQL Injection",
      "severity": "Critical",
      "severity_score": 4,
      "url": "http://testphp.vulnweb.com/listproducts.php",
      "parameter": "cat",
      "http_method": "GET",
      "payload_used": "' OR 1=1 --",
      "evidence": "Database error message exposed in response",
      "description": "SQL injection confirmed in parameter 'cat'. The app returned a raw database error, meaning user input is going straight into the query without any sanitization.",
      "remediation": "Switch to parameterized queries or prepared statements. Kill verbose error messages in production — they're free recon for attackers."
    },
    {
      "id": "VULN-0002",
      "type": "XSS",
      "subtype": "Reflected XSS",
      "severity": "High",
      "severity_score": 3,
      "url": "http://testphp.vulnweb.com/search.php",
      "parameter": "q",
      "http_method": "GET",
      "payload_used": "<script>alert('XSS')</script>",
      "evidence": "Payload reflected in response body without encoding",
      "remediation": "Encode all output before it touches HTML. Add a Content-Security-Policy header while you're at it."
    }
  ],
  "remediation_priority": [
    {
      "vulnerability_type": "SQLi",
      "severity": "Critical",
      "count": 2,
      "remediation": "Use parameterized queries. Remove raw database errors from responses."
    },
    {
      "vulnerability_type": "XSS",
      "severity": "High",
      "count": 2,
      "remediation": "Encode user output. Implement Content-Security-Policy."
    }
  ]
}

How the AI classifier works

The model is a RandomForest trained on synthetic feature vectors. Each vulnerability gets converted into nine numeric features before classification:

  • Type score — base risk weight of the vulnerability class (SQLi = 4, XSS = 3, headers = 1, etc.)
  • Subtype score — more granular risk for specific variants
  • Method score — POST carries more weight than GET
  • Has payload — whether an active injection payload was used to trigger the finding
  • Has evidence — whether concrete confirmation was captured (error message, reflection, redirect)
  • Is injection — binary flag for XSS and SQLi class
  • Is header — binary flag for header-based findings
  • Is redirect — binary flag for redirect-based findings
  • Is disclosure — binary flag for information exposure and directory findings

This approach means the classifier isn't just matching a type name to a severity. It's reasoning about the full context of how the vulnerability was found and confirmed. A blind SQLi with evidence will always outrank a theoretical one, and missing headers stay low unless they're Strict-Transport-Security or Content-Security-Policy.


Stack

Library What it does here
Python Everything
Streamlit Dashboard UI and live state management
Requests HTTP session, retries, SSL fallback
BeautifulSoup HTML parsing for links and form extraction
scikit-learn RandomForest classifier for severity scoring
NumPy Feature vector construction
Pandas Results table in the UI

Legal practice targets

Don't scan anything you don't control or have written permission to test. These are specifically designed to be broken:

Target What's useful about it
http://testphp.vulnweb.com Acunetix's deliberately vulnerable PHP app. Has SQLi, XSS, and open redirects baked in. Best place to start.
https://ginandjuice.shop PortSwigger's vulnerable shop. Good for testing redirect and injection detection.
http://zero.webappsecurity.com Demo banking app. Useful for header auditing.

Honest limitations

**It doesn't authenticate. It won't find stored XSS, IDOR, broken access control, or any logic-layer vulnerability. Think of it as a surface scan — a solid starting point, not a substitute for a real pentest.

The AI model is trained on synthetic data. It's not pulling from a CVE database or live exploit feeds. It reasons about feature patterns, which works well for prioritization but won't give you CVSS scores or CWE mappings.

SSL issues are handled automatically. If a certificate is expired or self-signed, the request manager falls back to skipping verification. The scan continues either way.

Rate limiting is on for a reason. There's a built-in request delay. It keeps the scanner from looking like a DDoS, prevents getting your IP blocked, and is just the professional way to run a tool like this.


Legal

Scanning systems without explicit written permission is illegal. This applies in the US (Computer Fraud and Abuse Act), UK (Computer Misuse Act), India (IT Act), and most other jurisdictions. This project exists for authorized security testing, CTF practice, and learning. How you use it is entirely on you.


Author

Tirth — IT undergrad, IIT Delhi ethical hacking certified, IIT Guwahati AI/ML track in progress.
GitHub: @Tktirth

About

Python-based web vulnerability scanner with ML severity classification and a live Streamlit dashboard. Detects XSS, SQLi, open redirects, missing headers, and exposed directories.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages