Massively parallel async web scraper that routes through a fleet of Tor instances for IP diversity. Built for scraping Cloudflare-protected sites where cloud IPs get instantly blocked.
I originally built this to scrape Wowhead for TrinityCore private server data. The 39 entity parsers and ID list generator are Wowhead-specific, but the scraping engine works for any target — see Adapting to Other Targets.
Cloudflare blocklists all major cloud IP ranges (AWS, GCP, Azure, etc.). I tried Lambda first — 3,000 concurrent Lambdas with perfect browser TLS fingerprints. 90% WAF rate. The same fingerprints through Tor? Under 1%.
Tor exit nodes run on residential ISPs, university networks, and volunteer hosts. Cloudflare doesn't blocklist them the way it does datacenters.
Measured performance: 230K pages/hr peak, 193K sustained at 400 Tor instances × 5 workers on a single machine. 1.2M Wowhead pages scraped across 39 entity types. Under 1% Cloudflare block rate.
- HTTP/2 multiplexing — shared session per Tor instance, workers send concurrent streams on one connection
- Async engine —
asyncio+curl_cffi, no GIL bottleneck - Browser TLS fingerprints — 7 profiles (Chrome, Edge, Safari, Firefox) via
curl_cffi - Adaptive WAF throttling — adjusts delay based on real-time block rate
- Circuit rotation — fresh exit IP every 150 requests per instance
- HTML caching — gzip-compressed raw HTML for re-parsing without re-scraping
- 39 Wowhead entity parsers — spells, items, NPCs, quests, transmog, and more
pip install -e .Requires Python 3.11+. On Windows, use python not python3 (the Windows Store alias uses separate site-packages).
Windows: Download the Tor Expert Bundle (not the Browser). Extract so you have tor/tor.exe.
Linux: sudo apt install tor (Debian/Ubuntu) or sudo dnf install tor (Fedora).
macOS: brew install tor
Tell the scraper where Tor is:
# Option A: put the Expert Bundle next to the script
tor-army/
tor_army.py
tor/ # Tor Expert Bundle extracted here
tor/tor.exe # or tor/tor on Linux/Mac
data/geoip
data/geoip6
# Option B: flag or env var
python tor_army.py --tor-dir /path/to/tor ...
export TOR_DIR=/path/to/torThe scraper reads ID lists from id_lists/{target}.txt (one ID per line). For Wowhead, generate these from Wago DB2 CSV exports:
python generate_id_lists.py --csv-dir /path/to/wago-csvs/
python generate_id_lists.py --csv-dir /path/to/wago-csvs/ --stats # preview onlyFor non-Wowhead targets, just create the text files yourself — one ID per line.
# Default: 400 Tor instances x 5 workers = 2,000 concurrent
python tor_army.py --start-tor --targets spell,item,npc,quest
# Smoke test
python tor_army.py --start-tor --workers 5 --smoke 10 --targets npc
# Aggressive
python tor_army.py --start-tor --workers 600 --multiplier 8 --targets all
# Check progress
python tor_army.py --list-targets
# Re-parse cached HTML offline
python tor_army.py --targets npc --reparse
# Kill leftover Tor processes
python tor_army.py --kill-torOutput goes to wowhead_data/{target}/raw/ (JSON) and wowhead_data/{target}/html/ (gzip HTML). Already-scraped IDs are skipped automatically.
Each Tor instance gets one shared HTTP/2 connection. Multiple workers send concurrent requests as HTTP/2 streams on that connection. This gives you N workers per instance but only 1 file descriptor per instance.
Five throttling layers prevent WAF blocks:
- Per-instance rate limiter — minimum interval between requests from the same exit IP
- Adaptive delay — backs off based on WAF hits per minute
- Circuit rotation — new exit IP every 150 requests
- Jittered exponential backoff — on consecutive errors
- WAF-triggered rotation — immediate circuit reset on 403
| Flag | Default | Description |
|---|---|---|
--workers |
400 | Tor instances (~25MB RAM each) |
--multiplier |
5 | Workers per instance (HTTP/2 streams) |
--delay |
0.15 | Base delay between requests (seconds) |
--per-circuit |
150 | Requests before circuit rotation |
--cache-html |
true | Cache raw HTML as gzip |
--targets |
npc | Comma-separated entity types |
--tor-dir |
./tor/ |
Tor installation path (or TOR_DIR env var) |
| Config | Workers | RAM | Measured |
|---|---|---|---|
| 240x2 | 480 | ~6 GB | 160-198K/hr |
| 400x5 | 2,000 | ~10 GB | 193-230K/hr |
| 600x8 | 4,800 | ~15 GB | untested |
All numbers are from real runs scraping Wowhead (1.2M pages total). The ceiling is Tor exit node diversity (~1,500 globally), not hardware — 400 instances already overlap significantly.
Returns diminish past ~600 instances. There are only ~1,000-1,500 Tor exit nodes globally — past that you're sharing exit IPs between instances and hitting the same Cloudflare rate limits.
The Wowhead parsers and ID generator are specific to my use case. The scraping engine is not. To scrape a different site:
TARGET_CONFIGSintor_army.py— change URL patternsparsers.py— replace with your own HTML parsers, or skip parsing and just cache raw HTML. Seeparser_template.pyfor a working skeletonid_lists/{target}.txt— create your own ID/URL lists (one per line)- WAF detection — the code looks for HTTP 403 and
cf-challengein the response, which is Cloudflare-specific. Adjustasync_worker()for other WAFs
Everything else — Tor fleet management, HTTP/2 multiplexing, circuit rotation, rate limiting, the live dashboard — is target-agnostic.
See examples/ for sample output from real scrapes.
- Python 3.11+
- Windows, Linux, or macOS
- Tor
- ~25 MB RAM per Tor instance
MIT