I've built this project just for fun and learning. The goal was to design and implement a production-grade distributed web crawler from first principles, exploring various distributed systems concepts along the way.
This document chronicles the design and implementation of a production-grade distributed web crawler built from first principles. The project evolved through three distinct phases, each introducing progressively complex distributed systems concepts while maintaining pragmatic engineering decisions.
Final Architecture:
- Coordinator-worker pattern with Redis as shared state
- Bloom filter-based deduplication (memory-efficient)
- Quality-first data collection for LLM training
- Deployed across multiple AWS EC2 instances with ElastiCache
Objective: Understand core crawling mechanics without concurrency complexity.
Architecture:
┌─────────────┐
│ Main │
│ Process │
├─────────────┤
│ Queue (slice)│
│ Visited (map)│
└─────────────┘
│
├──> Fetch URL
├──> Parse HTML
├──> Quality Filter
├──> Save Data
└──> Add Links to Queue
Critical Design Decisions:
-
Stopping Condition: Max Pages vs Max Depth
- Problem: Web graphs are effectively infinite
- Options considered:
- Queue empty + no workers (fails: queue never truly empties)
- Max depth (fails: unbounded total pages)
- Max pages (selected)
- Decision: Max pages provides bounded execution
- Rationale: Depth-first can exhaust resources; max pages guarantees termination
-
Queue Data Structure: FIFO vs LIFO
- Options:
- FIFO (breadth-first): explores widely before going deep
- LIFO (depth-first): follows one path completely
- Decision: FIFO (breadth-first)
- Rationale: For LLM training data, diversity matters more than depth in any single topic
- Options:
-
URL Deduplication: Initial Approach
- Implementation:
map[string]bool - Memory cost: ~50 bytes per URL
- For 1M URLs: ~50MB (acceptable for Phase 1)
- Known limitation: doesn't scale to 100M+ URLs
- Implementation:
-
Quality Filtering Strategy
- Initial metrics considered:
- Text-to-HTML ratio
- Link density
- Image count
- Domain reputation
- Content deduplication
- Selected for Phase 1:
- Text-to-HTML ratio (easy to implement, high signal)
- Language detection (prevents non-English content)
- Link density (spam indicator)
- Deferred to later:
- Domain reputation (requires external data)
- Content deduplication (too expensive for Phase 1)
- Initial metrics considered:
Key Learnings:
The single-threaded implementation revealed the fundamental challenge: parsing is CPU-bound, but fetching is I/O-bound. Sequential execution wastes CPU cycles while waiting for network responses.
Measured bottleneck: 1 page per 3-5 seconds (fetch: 2-4s, parse: 0.1s, filter: 0.01s, sleep: 1s)
Objective: Maximize single-machine throughput through concurrency.
Architecture:
┌──────────────────────────────────────┐
│ Manager Goroutine │
│ ┌────────┐ ┌──────────┐ │
│ │ Queue │ │ Visited │ │
│ │(slice) │ │ (map) │ │
│ └────────┘ └──────────┘ │
└──────┬───────────────────────────────┘
│
├─────> Job Channel (buffered: 100)
│
┌───┴────────────────────────┐
│ Worker Pool (25 goroutines)│
└───┬────────────────────────┘
│
└─────> Result Channel (buffered: 100)
Critical Design Decisions:
-
Concurrency Pattern: Manager-Worker vs Shared State
- Options:
- Shared state with mutexes (workers access queue directly)
- Manager-worker (single goroutine owns state)
- Lock-free data structures
- Decision: Manager-worker
- Rationale:
- No race conditions (single owner of queue/visited)
- Clear separation of concerns
- Easier to reason about
- Manager becomes bottleneck only at very high scales
- Options:
-
Channel Buffer Sizing
- Job channel: 100 (prevents manager blocking)
- Result channel: 100 (prevents worker blocking)
- Rationale: Bursts occur when pages have many links; buffering smooths throughput
-
Worker Pool Size
- Options tested: 5, 10, 25, 50
- Selected: 25
- Rationale:
- Below 25: CPU underutilized during network waits
- Above 50: diminishing returns, risk of Wikipedia rate limiting
- 25: Sweet spot for single machine
-
Bloom Filter Introduction
- Problem: Map scales linearly with URL count
- Bloom filter characteristics:
- Space: O(n) with small constant (1.44 bits per element at 1% FPR)
- Time: O(k) where k=10 hash functions
- Tradeoff: False positives acceptable (slightly reduced coverage)
- Configuration:
- Capacity: 1M URLs
- False positive rate: 1%
- Memory: 14MB vs 50MB for map
- Formula: m = -n * ln(p) / (ln(2)^2) where n=capacity, p=FPR
Performance Analysis:
Without Bloom filter (map):
- 203 pages in 49 seconds = 4.11 pages/second
- Memory: ~10MB for visited set
With Bloom filter:
- 199 pages in 32 seconds = 6.20 pages/second
- Memory: ~14MB total
Speedup: 51% improvement (primarily from reduced memory pressure, better cache locality)
Key Learnings:
- Politeness delay (1 second) dominated total time
- Concurrency improved throughput but couldn't eliminate network latency
- Bloom filter provided minimal memory savings at this scale but critical for distributed deployment
Objective: Scale horizontally across multiple machines while maintaining coordination.
Architecture:
┌─────────────────────────────────────────────────┐
│ AWS Region │
│ │
│ ┌────────────────┐ │
│ │ Coordinator │ │
│ │ (EC2) │ │
│ │ - Monitor │ │
│ │ - Signal STOP │ │
│ └────────┬───────┘ │
│ │ │
│ │ │
│ ┌────▼─────────────────────────┐ │
│ │ Redis (ElastiCache) │ │
│ │ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ URL Queue (List) │ │ │
│ │ │ LPUSH/RPOP │ │ │
│ │ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Bloom Filter │ │ │
│ │ │ BF.ADD/BF.EXISTS │ │ │
│ │ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Stats (Hash) │ │ │
│ │ │ HINCRBY │ │ │
│ │ └─────────────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────────┐ │ │
│ │ │ Control (String) │ │ │
│ │ │ SET/GET │ │ │
│ │ └─────────────────────┘ │ │
│ └──────────┬───────────────────┘ │
│ │ │
│ ┌─────────────┼─────────────┐ │
│ │ │ │ │
│ ┌─▼──┐ ┌─▼──┐ ┌─▼──┐ │
│ │ W1 │ │ W2 │ │ W3 │ ... (5x) │
│ │EC2 │ │EC2 │ │EC2 │ │
│ └────┘ └────┘ └────┘ │
│ │
└─────────────────────────────────────────────────┘
Critical Design Decisions:
-
Coordinator vs Self-Coordinating Workers
- Options:
- Coordinator + Workers (selected)
- Pure self-coordination (distributed consensus)
- Decision: Coordinator pattern
- Rationale:
- Simpler implementation
- Coordinator is lightweight (just monitors Redis)
- Can be made stateless (all state in Redis)
- Production systems (K8s, Spark) use this pattern
- Self-coordination adds complexity without clear benefit at this scale
- Options:
-
Shared State Design
- Redis data structures:
crawl:queue → List (LPUSH/RPOP) crawl:visited → RedisBloom (BF.ADD/BF.EXISTS) crawl:stats → Hash (HINCRBY for atomic increments) crawl:control → String (RUNNING/STOP signal) crawl:workers → Hash (heartbeat tracking) - Key design principle: Atomic operations prevent race conditions
- Redis data structures:
-
Bloom Filter: Local vs Remote
- Options:
- Serialize entire filter (14MB network transfer per check)
- RedisBloom module (atomic operations)
- Sharded filters (complex coordination)
- Decision: RedisBloom
- Rationale:
- Atomic BF.ADD and BF.EXISTS operations
- No serialization overhead
- Sub-millisecond latency within VPC
- Options:
-
Race Condition in URL Discovery
- Problem:
Worker A: BF.EXISTS("url") → false Worker B: BF.EXISTS("url") → false (simultaneous!) Worker A: BF.ADD("url"), LPUSH queue Worker B: BF.ADD("url"), LPUSH queue → DUPLICATE - Options:
- Lua script for atomic check-and-set
- Accept small duplicate percentage
- Decision Phase 1: Accept duplicates (pragmatic)
- Rationale: Bloom filter already allows false positives; small duplicate rate acceptable for learning phase
- Future improvement: Lua script for production
- Problem:
-
Deployment Strategy
- Infrastructure as Code: Terraform
- Rationale:
- Reproducible infrastructure
- Version controlled
- Easy teardown (cost control)
- Industry standard
Network Topology:
VPC: 10.0.0.0/16
│
└── Subnet: 10.0.1.0/24 (ap-south-1a)
│
├── Coordinator: 10.0.1.x
├── Workers (5x): 10.0.1.y
└── Redis: 10.0.1.z
Security Group Rules:
- Ingress: SSH (port 22) from operator IP
- Ingress: All TCP within VPC (10.0.0.0/16)
- Egress: All traffic (for crawling external sites)
Key Learnings:
-
Coordinator as Single Point of Failure (Acceptable)
- Coordinator crash impact: Workers continue processing current URLs
- Recovery: Restart coordinator, resume monitoring
- Mitigation: Coordinator is stateless (can run on any machine)
-
Politeness vs Throughput
- 1-second delay per worker = theoretical max 5 pages/second with 5 workers
- Actual: 0.60 pages/second (network latency, parsing overhead)
- Improvement opportunity: Domain-specific rate limiting (not implemented)
-
Cost vs Performance
- t2.micro instances: Sufficient for I/O-bound crawling
- ElastiCache t3.micro: Adequate for 5 workers
- Total cost: ~$0.50/hour
- Cost optimization: Spot instances (not implemented for simplicity)
Quality Filter Pipeline:
Raw HTML
│
├─> Strip <script>, <style>, <noscript>
│
├─> Extract text from <p>, <article>, <div>
│
├─> Language Detection (parse <html lang="">)
│ └─> Reject if not "en"
│
├─> Text Length Check
│ └─> Reject if < 1000 characters
│
├─> Link Density
│ └─> links_per_1000_chars = (num_links / text_length) * 1000
│ └─> Reject if > 50
│
├─> Image Count
│ └─> Reject if > 50 (likely gallery)
│
└─> Special Page Filter
└─> Reject URLs containing:
- Wikipedia:
- Special:
- Help:
- Template:
- Talk:
- User:
- File:
Quality Metrics (52 pages crawled):
- Acceptance rate: 71% (52 accepted / 73 total attempted)
- Average text length: ~15,000 characters
- Average links per page: ~300
- Link density: ~20 links per 1000 characters
Design Rationale:
For LLM training, quality trumps quantity. Better to have 10K high-quality pages than 100K pages with 90K being navigation/spam. The filtering thresholds were calibrated based on empirical observation of Wikipedia content structure.
Concept:
Workers coordinate via Redis without coordinator:
- active_workers counter (INCR/DECR)
- pending_urls counter
- Exit when: pending_urls == 0 AND active_workers == 1
Why not selected:
- Adds complexity (WATCH/MULTI transactions)
- Risk of premature termination (race conditions)
- Coordinator is not a bottleneck at this scale
- Coordinator failure mode is acceptable (workers continue, just need restart)
Concept:
Each worker "owns" specific domains:
- hash(domain) % num_workers = assigned_worker
- Better politeness (one worker per domain)
- Natural load balancing
Why not selected:
- More complex work distribution
- Requires domain extraction and routing
- Benefit minimal for Wikipedia (single domain)
- Better suited for multi-domain crawls
Concept:
Score URLs based on:
- Domain authority
- Page depth
- Content type hints
Crawl high-priority pages first
Why not selected:
- Requires scoring function
- Adds complexity to Redis queue
- Marginal benefit for bounded crawl (500 pages)
- Worthwhile for unbounded crawls targeting specific content
Rationale:
- Native concurrency primitives (goroutines, channels)
- Fast compilation and execution
- Strong standard library (net/http, html parsing)
- Explicit error handling (reduces bugs)
Rationale:
- Atomic operations (LPUSH/RPOP, HINCRBY)
- RedisBloom module for space-efficient deduplication
- Low latency (<1ms within VPC)
- Simple mental model (key-value + data structures)
Rationale:
- Terraform: Infrastructure as code, reproducible, version controlled
- EC2: Simple compute, predictable costs
- ElastiCache: Managed Redis (no ops overhead)
- VPC: Network isolation, low-latency inter-instance communication
Single Machine (Phase 2):
- Workers: 25
- Throughput: 6.20 pages/second
- Bottleneck: Network latency + politeness delay
- Max theoretical: ~25 pages/second (if no delay)
Distributed (Phase 3):
- Workers: 5 (across 5 machines)
- Throughput: 0.60 pages/second
- Bottleneck: Politeness delay (1s per request)
- Max theoretical: 5 pages/second (if no delay)
Scaling projection:
Workers | Pages/sec (1s delay) | Pages/hour | Pages/day
----------|---------------------|------------|----------
5 | 0.60 | 2,160 | 51,840
25 | 3.00 | 10,800 | 259,200
100 | 12.00 | 43,200 | 1,036,800
500 | 60.00 | 216,000 | 5,184,000
Assumes linear scaling (valid until Redis becomes bottleneck at ~1000 workers).
AWS Infrastructure (per hour):
Component | Type | Cost/hour
-----------------------|----------------|----------
Coordinator EC2 | t2.micro | $0.0116
Worker EC2 (5x) | t2.micro | $0.0580
Redis ElastiCache | t3.micro | $0.0170
Data Transfer (est) | Outbound | $0.0100
-----------------------|----------------|----------
Total | | $0.0966
Monthly cost (24/7 operation): ~$70 Per-page cost (@ 0.60 pages/sec): ~$0.000037
Cost optimization opportunities:
- Spot instances: 70% savings on EC2
- Reserved instances: 40% savings for long-term
- Larger instance types: Better throughput per dollar at scale
Building single-threaded first revealed core challenges without concurrency noise. Each phase built on previous learnings.
No amount of distributed architecture can overcome rate limiting. Respectful crawling requires delays, which fundamentally limits throughput.
While distributed consensus is academically interesting, coordinator-worker is simpler, debuggable, and sufficient for most use cases. Production systems agree (Kubernetes, Spark, Hadoop all use coordinator patterns).
At 1M+ URLs, memory becomes the constraint. Bloom filters trade slight accuracy for massive space savings (97% reduction).
Aggressive filtering (71% rejection rate) is correct. Better to train on 10K high-quality pages than 100K pages of which 90K is noise.
Terraform enabled rapid iteration, cost control (easy teardown), and reproducibility. Manual infrastructure management doesn't scale.
Logging, metrics, and monitoring scripts (monitor.sh) were essential for understanding distributed behavior. Without visibility, debugging is impossible.
This project demonstrated that building distributed systems requires careful consideration of tradeoffs at every level. The final architecture balances simplicity, scalability, and pragmatism.
Core insight: Distributed systems are about managing tradeoffs. Perfect consistency, infinite scale, and zero latency are mutually exclusive. The art lies in choosing which constraints to relax based on the problem domain.
For web crawling: eventual consistency (Bloom filter false positives), bounded scale (politeness limits), and acceptable latency (1-2 second delays) proved to be the right tradeoffs.
- Use of AI tools for README drafting and editing.
- Inspiration from open-source crawlers and distributed systems literature.