Rust Crawler

Production-grade web crawler with Google and Bing search support, featuring proxy rotation, stealth mode, and true deep data extraction using Headless Chrome.

Features

Core Capabilities

✅ Google & Bing Search - First page results with exact match/verbatim support
✅ True Deep Crawl - Uses Headless Chrome for both SERP and target website extraction (executes JS, bypasses Cloudflare)
✅ Resilient Extraction - Automatic retries (up to 3x) for Google blocks/CAPTCHAs
✅ Rich Data Extraction - Captures:
- Main Text (Readability + Fallbacks)
- Metadata (Description, Authors, Keywords)
- Schema.org (JSON-LD)
- Open Graph Tags
- Emails & Phone Numbers
- Images & Outbound Links
✅ Stealth Mode - Bypasses webdriver detection, canvas fingerprinting, WebGL

Dashboard 📊

Visual Interface: Dark-themed dashboard at http://localhost:3000
Live Monitoring: View crawl status, results, and extract details in real-time.

Proxy Rotation (Production-Grade)

✅ Authenticated proxies - Support for user:pass@host:port format
✅ 4 Rotation Strategies - RoundRobin, LeastUsed, Random, Weighted
✅ Health tracking - Auto-disables proxies after consecutive failures
✅ Runtime management - Add/remove/enable proxies via API

Quick Start

1. Environment Setup

cp .env.example .env
# Edit .env with your DATABASE_URL

2. Run with Docker Compose

cd /home/guest/tzdump/crawling
docker-compose up -d

3. Run Locally (Development)

cd rust-crawler
source .env
cargo run

4. Access Dashboard

Open your browser to: http://localhost:3000

5. API Testing

# Bing Search
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"keyword": "Top 5 Dota2 Players", "engine": "bing"}'

# Google Search (now with properties Retry)
curl -X POST http://localhost:3000/crawl \
  -H "Content-Type: application/json" \
  -d '{"keyword": "Top 5 Dota2 Players", "engine": "google"}'

Configuration

Environment Variables

Variable	Description	Default
`DATABASE_URL`	PostgreSQL connection string	Required
`PROXY_LIST`	Comma-separated proxies	(empty = direct)
`PROXY_ROTATION`	roundrobin, leastused, random, weighted	roundrobin
`PROXY_MAX_FAILS`	Failures before proxy disabled	3

Proxy Format Examples

# Simple
PROXY_LIST="proxy1.com:8080,proxy2.com:3128"

# With authentication
PROXY_LIST="user:pass@premium-proxy.com:8080,user2:pass2@backup.com:3128"

Data Structure

The crawler stores rich JSON data in the database.

Crawl Result JSON (`results_json`)

{
  "results": [
    {
      "title": "Example Result",
      "link": "https://example.com",
      "snippet": "Description text..."
    }
  ],
  "people_also_ask": ["Question 1?", "Question 2?"],
  "related_searches": ["Topic A", "Topic B"],
  "total_results": "About 1,000,000 results"
}

Deep Extracted Content

Contains full text, HTML, and contacts extracted via Headless Chrome.

Directory Structure

rust-crawler/
├── src/
│   ├── main.rs       # API server and routes
│   ├── api.rs        # API handlers & Dashboard Endpoint
│   ├── crawler.rs    # Core Logic (Google/Bing + Deep Extract)
│   ├── db.rs         # Database operations
│   └── proxy.rs      # Proxy rotation module
├── static/           # Dashboard HTML/CSS/JS
├── debug/            # Debug screenshots and HTML
├── logs/             # Application logs
├── crawl-results/    # Output files
└── Cargo.toml

Changelog

2025-12-13

Refactor: Deep Extractor now uses Headless Chrome (JS-Enabled)
Feature: Google Search Retry Mechanism (Exponential Backoff)
UI: Added Dark Mode Dashboard (/tasks)
Fix: Resolved "Null" data issues on complex sites using browser-based extraction
Stealth: Enhanced canvas/WebGL fingerprinting protection

🏗️ Deep Technical Architecture

For a comprehensive deep dive into the Architecture, Dependency Graph, Internal Modules, and AI Tech Stack, please refer to our Official Technical Documentation.

📄 View Full Technical Stack Documentation (Gist)

Architecture Diagram

graph TD
    %% Nodes
    Client["Client (Browser / API)"]
    Caddy["Caddy Reverse Proxy"]
    Crawler["Rust Crawler Service"]
    Adminer["Adminer DB GUI"]
    DB[("PostgreSQL 15")]
    Chrome["Headless Chrome Instance"]
    Storage["Filesystem / Results"]

    %% Styles
    %% Professional Cool/Warm Palette
    style Client fill:#E1F5FE,stroke:#0277BD,stroke-width:2px,color:#01579B
    style Caddy fill:#B3E5FC,stroke:#0277BD,stroke-width:2px,color:#01579B
    style Crawler fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style Adminer fill:#C8E6C9,stroke:#2E7D32,stroke-width:2px,color:#1B5E20
    style DB fill:#FFF9C4,stroke:#FBC02D,stroke-width:2px,shape:cylinder,color:#F57F17
    style Chrome fill:#F5F5F5,stroke:#616161,stroke-width:1px,stroke-dasharray: 5 5,color:#212121
    style Storage fill:#E0E0E0,stroke:#616161,stroke-width:2px,color:#212121

    %% Connections
    Client -->|HTTPS :443| Caddy
    Caddy -->|/crawl :3000| Crawler
    Caddy -->|/proxies :3000| Crawler
    Caddy -->|/adminer :8080| Adminer
    
    Crawler -->|SQL Queries| DB
    Crawler -->|CDP Protocol| Chrome
    Crawler -->|Write JSON/HTML| Storage
    Adminer -->|Manage DB| DB

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
ProductionNotes		ProductionNotes
python-crawler		python-crawler
rust-crawler		rust-crawler
vpn_enhanced_deploy		vpn_enhanced_deploy
vpn_legacy_backup		vpn_legacy_backup
.gitignore		.gitignore
.woodpecker.yml		.woodpecker.yml
Caddyfile		Caddyfile
README.md		README.md
deploy_with_local_build.sh		deploy_with_local_build.sh
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.yml		docker-compose.yml
generate_token.py		generate_token.py
rust_crawler_api.postman_collection.json		rust_crawler_api.postman_collection.json
start_locally.sh		start_locally.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rust Crawler

Features

Core Capabilities

Dashboard 📊

Proxy Rotation (Production-Grade)

Quick Start

1. Environment Setup

2. Run with Docker Compose

3. Run Locally (Development)

4. Access Dashboard

5. API Testing

Configuration

Environment Variables

Proxy Format Examples

Data Structure

Crawl Result JSON (`results_json`)

Deep Extracted Content

Directory Structure

Changelog

2025-12-13

🏗️ Deep Technical Architecture

Architecture Diagram

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rust Crawler

Features

Core Capabilities

Dashboard 📊

Proxy Rotation (Production-Grade)

Quick Start

1. Environment Setup

2. Run with Docker Compose

3. Run Locally (Development)

4. Access Dashboard

5. API Testing

Configuration

Environment Variables

Proxy Format Examples

Data Structure

Crawl Result JSON (results_json)

Deep Extracted Content

Directory Structure

Changelog

2025-12-13

🏗️ Deep Technical Architecture

Architecture Diagram

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Crawl Result JSON (`results_json`)

Packages