Production-grade web crawler with Google and Bing search support, featuring proxy rotation, stealth mode, and true deep data extraction using Headless Chrome.
- ✅ Google & Bing Search - First page results with exact match/verbatim support
- ✅ True Deep Crawl - Uses Headless Chrome for both SERP and target website extraction (executes JS, bypasses Cloudflare)
- ✅ Resilient Extraction - Automatic retries (up to 3x) for Google blocks/CAPTCHAs
- ✅ Rich Data Extraction - Captures:
- Main Text (Readability + Fallbacks)
- Metadata (Description, Authors, Keywords)
- Schema.org (JSON-LD)
- Open Graph Tags
- Emails & Phone Numbers
- Images & Outbound Links
- ✅ Stealth Mode - Bypasses webdriver detection, canvas fingerprinting, WebGL
- Visual Interface: Dark-themed dashboard at
http://localhost:3000 - Live Monitoring: View crawl status, results, and extract details in real-time.
- ✅ Authenticated proxies - Support for
user:pass@host:portformat - ✅ 4 Rotation Strategies - RoundRobin, LeastUsed, Random, Weighted
- ✅ Health tracking - Auto-disables proxies after consecutive failures
- ✅ Runtime management - Add/remove/enable proxies via API
cp .env.example .env
# Edit .env with your DATABASE_URLcd /home/guest/tzdump/crawling
docker-compose up -dcd rust-crawler
source .env
cargo runOpen your browser to: http://localhost:3000
# Bing Search
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"keyword": "Top 5 Dota2 Players", "engine": "bing"}'
# Google Search (now with properties Retry)
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"keyword": "Top 5 Dota2 Players", "engine": "google"}'| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
PostgreSQL connection string | Required |
PROXY_LIST |
Comma-separated proxies | (empty = direct) |
PROXY_ROTATION |
roundrobin, leastused, random, weighted | roundrobin |
PROXY_MAX_FAILS |
Failures before proxy disabled | 3 |
# Simple
PROXY_LIST="proxy1.com:8080,proxy2.com:3128"
# With authentication
PROXY_LIST="user:pass@premium-proxy.com:8080,user2:pass2@backup.com:3128"The crawler stores rich JSON data in the database.
{
"results": [
{
"title": "Example Result",
"link": "https://example.com",
"snippet": "Description text..."
}
],
"people_also_ask": ["Question 1?", "Question 2?"],
"related_searches": ["Topic A", "Topic B"],
"total_results": "About 1,000,000 results"
}Contains full text, HTML, and contacts extracted via Headless Chrome.
rust-crawler/
├── src/
│ ├── main.rs # API server and routes
│ ├── api.rs # API handlers & Dashboard Endpoint
│ ├── crawler.rs # Core Logic (Google/Bing + Deep Extract)
│ ├── db.rs # Database operations
│ └── proxy.rs # Proxy rotation module
├── static/ # Dashboard HTML/CSS/JS
├── debug/ # Debug screenshots and HTML
├── logs/ # Application logs
├── crawl-results/ # Output files
└── Cargo.toml
- Refactor: Deep Extractor now uses Headless Chrome (JS-Enabled)
- Feature: Google Search Retry Mechanism (Exponential Backoff)
- UI: Added Dark Mode Dashboard (
/tasks) - Fix: Resolved "Null" data issues on complex sites using browser-based extraction
- Stealth: Enhanced canvas/WebGL fingerprinting protection
For a comprehensive deep dive into the Architecture, Dependency Graph, Internal Modules, and AI Tech Stack, please refer to our Official Technical Documentation.
graph TD
%% Nodes
Client["Client (Browser / API)"]
Caddy["Caddy Reverse Proxy"]
Crawler["Rust Crawler Service"]
Adminer["Adminer DB GUI"]
DB[("PostgreSQL 15")]
Chrome["Headless Chrome Instance"]
Storage["Filesystem / Results"]
%% Styles
%% Professional Cool/Warm Palette
style Client fill:#E1F5FE,stroke:#0277BD,stroke-width:2px,color:#01579B
style Caddy fill:#B3E5FC,stroke:#0277BD,stroke-width:2px,color:#01579B
style Crawler fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
style Adminer fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
style DB fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px,shape:cylinder,color:#F57F17
style Chrome fill:#F5F5F5,stroke:#616161,stroke-width:1px,stroke-dasharray: 5 5,color:#212121
style Storage fill:#E0E0E0,stroke:#616161,stroke-width:2px,color:#212121
%% Connections
Client -->|HTTPS :443| Caddy
Caddy -->|/crawl :3000| Crawler
Caddy -->|/proxies :3000| Crawler
Caddy -->|/adminer :8080| Adminer
Crawler -->|SQL Queries| DB
Crawler -->|CDP Protocol| Chrome
Crawler -->|Write JSON/HTML| Storage
Adminer -->|Manage DB| DB
MIT