An objective comparison framework for web crawling implementations. Benchmarks Python (httpx), Node.js (undici), Scrapy, and Scrapy-lite across performance profiles to identify the optimal crawler for company data extraction.
Core Question: Which web crawler performs best for extracting structured data from websites?
This project provides:
- π¬ Objective benchmarks across 4 crawler implementations
- βοΈ Profile-based testing (aggressive, balanced, conservative)
- π Automated reporting with side-by-side metrics
| Crawler | Runtime | HTTP Client | Lines of Code | Tests | Status |
|---|---|---|---|---|---|
| Python | Python 3.11+ | httpx (async) | 400 | 99 | β Complete |
| Node.js | Node.js 20+ | undici | 450 | 59 | β Complete |
| Scrapy | Python 3.11+ | Twisted | 150 | Framework | β Complete |
| Scrapy-lite | Python 3.11+ | Twisted | 150 | Framework | β Complete |
All crawlers extract: phones (E.164), social media URLs, physical addresses, with robots.txt compliance and user-agent rotation.
Prerequisites: Docker & Docker Compose
make demo # Spins up services and runs python crawler pipeline end to end with reports
make benchmark # Run all crawlers with default profile
make benchmark BENCH_CONFIGS="python:balanced scrapy:aggressive" # Compare specific configurations
make evaluate # Generate comparison reportsmake crawl-python PROFILE=aggressive # Python httpx implementation
make crawl-node PROFILE=balanced # Node.js undici implementation
make crawl-scrapy PROFILE=conservative # Scrapy native extraction
make crawl-scrapy-lite PROFILE=balanced # Scrapy with shared regex utilsProfiles control timeout, concurrency, retry logic, and crawl delay:
aggressive: Fast, high concurrency (50 workers), minimal delaysbalanced: Default, respectful crawling (25 workers)conservative: Slow, very polite (10 workers), maximum delays
Defined in configs/profiles/*.yaml
After running benchmarks, compare crawlers side-by-side:
make evaluate # Generates reports in data/reports/Key Metrics:
- Coverage: % of sites successfully crawled
- Speed: Total runtime and sites/second
- Data Quality: Extraction accuracy for phones, social links, addresses
- Politeness: Robots.txt compliance, request spacing
Results help identify the optimal crawler for your use case (speed vs. politeness tradeoff).
See Architecture Documentation for a detailed overview.
Input (CSV) β Crawler β Extraction β Output (NDJSON)
Shared Components (ensures fair comparison):
- Extraction logic:
src/common/extraction_utils.py(regex-based) - Robots.txt:
src/common/robots_parser.py(24h cache) - User-agent rotation:
src/common/user_agent_rotation.py(7 browser UAs) - Configuration:
src/common/config_loader.py(YAML profiles)
Crawler-Specific:
- HTTP client implementation
- Concurrency model (asyncio, threads, Twisted)
- Error handling strategies
- Retry logic
Crawl β ETL β Elasticsearch β Matching API
The project includes a complete data pipeline, but the core focus is crawler benchmarking.
configs/
βββ profiles/ # Performance profiles (aggressive/balanced/conservative)
βββ crawl.policy.yaml # Robots.txt, UA, timeout configs
βββ weights.yaml # Matching algorithm weights (for API)
src/
βββ crawlers/
β βββ python/ # httpx async implementation
β βββ node/ # undici TypeScript implementation
β βββ scrapy/ # Scrapy native extraction
β βββ scrapy-lite/ # Scrapy with shared regex utils
βββ common/ # Shared utilities (extraction, robots.txt, UA rotation)
βββ eval/ # Benchmark comparison and reporting
βββ etl/ # ETL pipeline (normalize, merge, dedupe, load)
βββ api/ # FastAPI matching service (optional)
data/
βββ inputs/ # Sample CSV datasets
βββ outputs/ # Crawler results (NDJSON)
βββ reports/ # Benchmark comparison reports
tests/ # pytest (Python) and mocha (Node.js) test suites
Python (httpx): Modern async/await, native Python ecosystem integration
Node.js (undici): Official Node.js HTTP client, 5-10Γ faster than axios
Scrapy: Industry-standard framework, battle-tested at scale
Scrapy-lite: Scrapy performance with shared extraction logic for fair comparison
All crawler implementations include:
- β Robots.txt compliance: 24h cache, fail-open strategy
- β User-agent rotation: 7 realistic browser UAs with ethical ID
- β Data extraction: Phones (E.164), social URLs, addresses (regex-based)
- β Error handling: Exponential backoff, HTTP fallback, timeout management
- β Configurable profiles: Tune performance/politeness tradeoffs
- β NDJSON output: Streaming-friendly format for large datasets
make test # All tests in Docker
pytest # Python tests (native)
make test-node # Node.js tests (native)# Docker (recommended)
make crawl-python INPUT=data/inputs/sample.csv PROFILE=balanced
# Native
python src/crawlers/python/main.py --input data/inputs/sample.csv --profile configs/profiles/balanced.yaml
node src/crawlers/node/dist/index.js --input data/inputs/sample.csv --profile configs/profiles/balanced.yaml- Create directory in
src/crawlers/<name>/ - Implement extraction logic using existing examples (TODO: To be implemented as a shared utility)
- Output NDJSON format:
{"url": "...", "phones": [...], ...} - Add Makefile target:
make crawl-<name> - Add to benchmark configs in
Makefile - Run comparative benchmark:
make benchmark
The project includes ETL and API components for end-to-end demonstrations:
make up # Elasticsearch + API
make down # Stop servicescurl -X POST http://localhost:8000/match \
-H "Content-Type: application/json" \
-d '{"company_name": "Acme Corp", "website": "acme.com"}'See API Documentation for details.
Benchmarking & Comparison:
- βοΈ Configuration - Profile system explained
- π Improvements - Coverage optimization techniques
Full Pipeline (optional):
- π Architecture - System design overview
- π API Reference - REST API documentation
- π Scalability - Billion-scale deployment strategy
Implementation Details:
- π€ Robots.txt - Compliance implementation
- π User-Agent Rotation - UA rotation details
- π‘ Solution - Design decisions and lessons learned
Benchmarking Focus: New features should help compare crawler performance or improve extraction accuracy across all implementations.
Guidelines:
- Shared utilities go in
src/common/ - All features require tests (pytest or mocha)
- Keep it simple: regex over complex parsers
MIT License - Copyright (c) 2025 Vlad Balan
TL;DR: Run make demo to start all services and run one crawler pipeline or make up && make benchmark to compare 4 web crawler implementations.
