Skip to content

ML-powered YouTube scanner detecting potential doxxing content targeting K-pop idols (TF-IDF + Flask dashboard)

License

Notifications You must be signed in to change notification settings

NagisaSano/KpopDoxHunter

Repository files navigation

KpopDoxHunter

Hybrid ML + regex anti-doxxing detector for K-pop idols. It runs suspicious YouTube queries, scores titles/descriptions with TF-IDF plus explicit regex patterns (GPS, addresses, stalking terms), and serves the latest results via a color-coded Flask dashboard.


Why this project?

After stumbling on a video leaking an idol’s presumed address, the goal was to:

  • Detect similar content quickly (before it spreads)
  • Demonstrate ethical use of API + NLP + lightweight rules
  • Provide a small portfolio project mixing API, ML, and a web dashboard

Features (v2.0)

  • 30+ doxxing examples for TF-IDF semantic matching
  • 6 regex categories: GPS, Korean address, home indicators, distances, stalking terms, dox keywords
  • Composite scoring: 50% ML + 50% regex, severity badges (LOW/MEDIUM/HIGH/CRITICAL)
  • Flask dashboard with sortable table and severity colors
  • Timestamped CSV reports in reports/

Tech stack

  • Python 3.12
  • YouTube Data API v3
  • pandas, numpy
  • scikit-learn (TF-IDF + cosine similarity)
  • Flask (dashboard)

Install dependencies:

pip install -r requirements.txt

Project structure

KpopDoxHunter/
├─ scan_kpop_doxhunter.py   # Hybrid ML + regex scanner
├─ dashboard.py             # Flask app (serves latest report)
├─ templates/
│  └─ index.html            # Dashboard HTML with severity colors
├─ reports/                 # Generated CSV reports (git-ignored)
├─ run_all.bat              # Windows helper script
├─ requirements.txt
├─ tests/test_scan.py       # Unit tests
└─ SECURITY.md

Setup & usage

  1. Clone the repo
git clone https://github.com/NagisaSano/KpopDoxHunter.git
cd KpopDoxHunter
  1. (Optional) Virtualenv
python -m venv .venv
.\.venv\Scripts\activate  # Windows
  1. Install deps
python -m pip install -r requirements.txt
  1. Configure your YouTube API key (read at runtime)
$env:YOUTUBE_API_KEY = "YOUR_YOUTUBE_API_KEY"   # PowerShell
  1. Run scan + dashboard
.\run_all.bat

This runs the scanner, writes a CSV in reports/, then serves the dashboard at http://127.0.0.1:5000.

  1. Run tests
python -m unittest discover -s tests -p "test*.py" -v

Notes & limitations

  • Educational prototype; corpus biased toward Felix/Stray Kids.
  • Threshold MIN_DOX_SCORE defaults to 0.25 (adjust in scan_kpop_doxhunter.py).
  • Flask runs with debug=False; use a real WSGI server if you deploy.
  • On 403/429 (quota), partial results are saved then the scan stops with a clear error.

Detection methodology

  • ML (TF-IDF + cosine): semantic similarity against 30+ doxxing examples → ml_score
  • Regex rules: 6 pattern categories (GPS, address, home terms, distance, stalking, dox keywords) → rule_score
  • Composite: dox_score = 0.5 * ml_score + 0.5 * rule_score
  • Severity: LOW / MEDIUM / HIGH / CRITICAL based on dox_score or rule_score thresholds

Responsible use

For ethical monitoring only. Do not use this to harass, target individuals, leak private information, or violate YouTube’s Terms of Service. See SECURITY.md for reporting issues.

About

ML-powered YouTube scanner detecting potential doxxing content targeting K-pop idols (TF-IDF + Flask dashboard)

Topics

Resources

License

Security policy

Stars

Watchers

Forks