Skip to content

Scalable distributed web crawler with quality-focused data collection for LLM training. Implements concurrent workers, RedisBloom deduplication, and IaaC deployment.

Notifications You must be signed in to change notification settings

Manas-Nanivadekar/Grawl

Repository files navigation

Distributed Web Crawler

I've built this project just for fun and learning. The goal was to design and implement a production-grade distributed web crawler from first principles, exploring various distributed systems concepts along the way.

Executive Summary

This document chronicles the design and implementation of a production-grade distributed web crawler built from first principles. The project evolved through three distinct phases, each introducing progressively complex distributed systems concepts while maintaining pragmatic engineering decisions.

Final Architecture:

  • Coordinator-worker pattern with Redis as shared state
  • Bloom filter-based deduplication (memory-efficient)
  • Quality-first data collection for LLM training
  • Deployed across multiple AWS EC2 instances with ElastiCache

Project Evolution

Phase 1: Single-Threaded Foundation

Objective: Understand core crawling mechanics without concurrency complexity.

Architecture:

┌─────────────┐
│    Main     │
│   Process   │
├─────────────┤
│ Queue (slice)│
│ Visited (map)│
└─────────────┘
       │
       ├──> Fetch URL
       ├──> Parse HTML
       ├──> Quality Filter
       ├──> Save Data
       └──> Add Links to Queue

Critical Design Decisions:

  1. Stopping Condition: Max Pages vs Max Depth

    • Problem: Web graphs are effectively infinite
    • Options considered:
      • Queue empty + no workers (fails: queue never truly empties)
      • Max depth (fails: unbounded total pages)
      • Max pages (selected)
    • Decision: Max pages provides bounded execution
    • Rationale: Depth-first can exhaust resources; max pages guarantees termination
  2. Queue Data Structure: FIFO vs LIFO

    • Options:
      • FIFO (breadth-first): explores widely before going deep
      • LIFO (depth-first): follows one path completely
    • Decision: FIFO (breadth-first)
    • Rationale: For LLM training data, diversity matters more than depth in any single topic
  3. URL Deduplication: Initial Approach

    • Implementation: map[string]bool
    • Memory cost: ~50 bytes per URL
    • For 1M URLs: ~50MB (acceptable for Phase 1)
    • Known limitation: doesn't scale to 100M+ URLs
  4. Quality Filtering Strategy

    • Initial metrics considered:
      • Text-to-HTML ratio
      • Link density
      • Image count
      • Domain reputation
      • Content deduplication
    • Selected for Phase 1:
      • Text-to-HTML ratio (easy to implement, high signal)
      • Language detection (prevents non-English content)
      • Link density (spam indicator)
    • Deferred to later:
      • Domain reputation (requires external data)
      • Content deduplication (too expensive for Phase 1)

Key Learnings:

The single-threaded implementation revealed the fundamental challenge: parsing is CPU-bound, but fetching is I/O-bound. Sequential execution wastes CPU cycles while waiting for network responses.

Measured bottleneck: 1 page per 3-5 seconds (fetch: 2-4s, parse: 0.1s, filter: 0.01s, sleep: 1s)


Phase 2: Concurrent Single-Machine

Objective: Maximize single-machine throughput through concurrency.

Architecture:

┌──────────────────────────────────────┐
│         Manager Goroutine            │
│  ┌────────┐  ┌──────────┐           │
│  │ Queue  │  │ Visited  │           │
│  │(slice) │  │  (map)   │           │
│  └────────┘  └──────────┘           │
└──────┬───────────────────────────────┘
       │
       ├─────> Job Channel (buffered: 100)
       │
   ┌───┴────────────────────────┐
   │   Worker Pool (25 goroutines)│
   └───┬────────────────────────┘
       │
       └─────> Result Channel (buffered: 100)

Critical Design Decisions:

  1. Concurrency Pattern: Manager-Worker vs Shared State

    • Options:
      • Shared state with mutexes (workers access queue directly)
      • Manager-worker (single goroutine owns state)
      • Lock-free data structures
    • Decision: Manager-worker
    • Rationale:
      • No race conditions (single owner of queue/visited)
      • Clear separation of concerns
      • Easier to reason about
      • Manager becomes bottleneck only at very high scales
  2. Channel Buffer Sizing

    • Job channel: 100 (prevents manager blocking)
    • Result channel: 100 (prevents worker blocking)
    • Rationale: Bursts occur when pages have many links; buffering smooths throughput
  3. Worker Pool Size

    • Options tested: 5, 10, 25, 50
    • Selected: 25
    • Rationale:
      • Below 25: CPU underutilized during network waits
      • Above 50: diminishing returns, risk of Wikipedia rate limiting
      • 25: Sweet spot for single machine
  4. Bloom Filter Introduction

    • Problem: Map scales linearly with URL count
    • Bloom filter characteristics:
      • Space: O(n) with small constant (1.44 bits per element at 1% FPR)
      • Time: O(k) where k=10 hash functions
      • Tradeoff: False positives acceptable (slightly reduced coverage)
    • Configuration:
      • Capacity: 1M URLs
      • False positive rate: 1%
      • Memory: 14MB vs 50MB for map
    • Formula: m = -n * ln(p) / (ln(2)^2) where n=capacity, p=FPR

Performance Analysis:

Without Bloom filter (map):

  • 203 pages in 49 seconds = 4.11 pages/second
  • Memory: ~10MB for visited set

With Bloom filter:

  • 199 pages in 32 seconds = 6.20 pages/second
  • Memory: ~14MB total

Speedup: 51% improvement (primarily from reduced memory pressure, better cache locality)

Key Learnings:

  1. Politeness delay (1 second) dominated total time
  2. Concurrency improved throughput but couldn't eliminate network latency
  3. Bloom filter provided minimal memory savings at this scale but critical for distributed deployment

Phase 3: Distributed Architecture (4 hours)

Objective: Scale horizontally across multiple machines while maintaining coordination.

Architecture:

┌─────────────────────────────────────────────────┐
│                 AWS Region                       │
│                                                  │
│  ┌────────────────┐                             │
│  │  Coordinator   │                             │
│  │   (EC2)        │                             │
│  │  - Monitor     │                             │
│  │  - Signal STOP │                             │
│  └────────┬───────┘                             │
│           │                                      │
│           │                                      │
│      ┌────▼─────────────────────────┐           │
│      │   Redis (ElastiCache)        │           │
│      │                               │           │
│      │  ┌─────────────────────┐     │           │
│      │  │ URL Queue (List)    │     │           │
│      │  │ LPUSH/RPOP          │     │           │
│      │  └─────────────────────┘     │           │
│      │                               │           │
│      │  ┌─────────────────────┐     │           │
│      │  │ Bloom Filter        │     │           │
│      │  │ BF.ADD/BF.EXISTS    │     │           │
│      │  └─────────────────────┘     │           │
│      │                               │           │
│      │  ┌─────────────────────┐     │           │
│      │  │ Stats (Hash)        │     │           │
│      │  │ HINCRBY             │     │           │
│      │  └─────────────────────┘     │           │
│      │                               │           │
│      │  ┌─────────────────────┐     │           │
│      │  │ Control (String)    │     │           │
│      │  │ SET/GET             │     │           │
│      │  └─────────────────────┘     │           │
│      └──────────┬───────────────────┘           │
│                 │                                │
│   ┌─────────────┼─────────────┐                │
│   │             │             │                │
│ ┌─▼──┐       ┌─▼──┐       ┌─▼──┐             │
│ │ W1 │       │ W2 │       │ W3 │  ... (5x)   │
│ │EC2 │       │EC2 │       │EC2 │             │
│ └────┘       └────┘       └────┘             │
│                                                 │
└─────────────────────────────────────────────────┘

Critical Design Decisions:

  1. Coordinator vs Self-Coordinating Workers

    • Options:
      • Coordinator + Workers (selected)
      • Pure self-coordination (distributed consensus)
    • Decision: Coordinator pattern
    • Rationale:
      • Simpler implementation
      • Coordinator is lightweight (just monitors Redis)
      • Can be made stateless (all state in Redis)
      • Production systems (K8s, Spark) use this pattern
      • Self-coordination adds complexity without clear benefit at this scale
  2. Shared State Design

    • Redis data structures:
      crawl:queue          → List (LPUSH/RPOP)
      crawl:visited        → RedisBloom (BF.ADD/BF.EXISTS)
      crawl:stats          → Hash (HINCRBY for atomic increments)
      crawl:control        → String (RUNNING/STOP signal)
      crawl:workers        → Hash (heartbeat tracking)
      
    • Key design principle: Atomic operations prevent race conditions
  3. Bloom Filter: Local vs Remote

    • Options:
      • Serialize entire filter (14MB network transfer per check)
      • RedisBloom module (atomic operations)
      • Sharded filters (complex coordination)
    • Decision: RedisBloom
    • Rationale:
      • Atomic BF.ADD and BF.EXISTS operations
      • No serialization overhead
      • Sub-millisecond latency within VPC
  4. Race Condition in URL Discovery

    • Problem:
      Worker A: BF.EXISTS("url") → false
      Worker B: BF.EXISTS("url") → false (simultaneous!)
      Worker A: BF.ADD("url"), LPUSH queue
      Worker B: BF.ADD("url"), LPUSH queue → DUPLICATE
      
    • Options:
      • Lua script for atomic check-and-set
      • Accept small duplicate percentage
    • Decision Phase 1: Accept duplicates (pragmatic)
    • Rationale: Bloom filter already allows false positives; small duplicate rate acceptable for learning phase
    • Future improvement: Lua script for production
  5. Deployment Strategy

    • Infrastructure as Code: Terraform
    • Rationale:
      • Reproducible infrastructure
      • Version controlled
      • Easy teardown (cost control)
      • Industry standard

Network Topology:

VPC: 10.0.0.0/16
│
└── Subnet: 10.0.1.0/24 (ap-south-1a)
    │
    ├── Coordinator:     10.0.1.x
    ├── Workers (5x):    10.0.1.y
    └── Redis:           10.0.1.z

Security Group Rules:
- Ingress: SSH (port 22) from operator IP
- Ingress: All TCP within VPC (10.0.0.0/16)
- Egress: All traffic (for crawling external sites)

Key Learnings:

  1. Coordinator as Single Point of Failure (Acceptable)

    • Coordinator crash impact: Workers continue processing current URLs
    • Recovery: Restart coordinator, resume monitoring
    • Mitigation: Coordinator is stateless (can run on any machine)
  2. Politeness vs Throughput

    • 1-second delay per worker = theoretical max 5 pages/second with 5 workers
    • Actual: 0.60 pages/second (network latency, parsing overhead)
    • Improvement opportunity: Domain-specific rate limiting (not implemented)
  3. Cost vs Performance

    • t2.micro instances: Sufficient for I/O-bound crawling
    • ElastiCache t3.micro: Adequate for 5 workers
    • Total cost: ~$0.50/hour
    • Cost optimization: Spot instances (not implemented for simplicity)

Data Quality Architecture

Quality Filter Pipeline:

Raw HTML
    │
    ├─> Strip <script>, <style>, <noscript>
    │
    ├─> Extract text from <p>, <article>, <div>
    │
    ├─> Language Detection (parse <html lang="">)
    │   └─> Reject if not "en"
    │
    ├─> Text Length Check
    │   └─> Reject if < 1000 characters
    │
    ├─> Link Density
    │   └─> links_per_1000_chars = (num_links / text_length) * 1000
    │   └─> Reject if > 50
    │
    ├─> Image Count
    │   └─> Reject if > 50 (likely gallery)
    │
    └─> Special Page Filter
        └─> Reject URLs containing:
            - Wikipedia:
            - Special:
            - Help:
            - Template:
            - Talk:
            - User:
            - File:

Quality Metrics (52 pages crawled):

  • Acceptance rate: 71% (52 accepted / 73 total attempted)
  • Average text length: ~15,000 characters
  • Average links per page: ~300
  • Link density: ~20 links per 1000 characters

Design Rationale:

For LLM training, quality trumps quantity. Better to have 10K high-quality pages than 100K pages with 90K being navigation/spam. The filtering thresholds were calibrated based on empirical observation of Wikipedia content structure.


Architectural Alternatives Considered But Not Implemented

1. Self-Coordinating Workers (Distributed Consensus)

Concept:

Workers coordinate via Redis without coordinator:
- active_workers counter (INCR/DECR)
- pending_urls counter
- Exit when: pending_urls == 0 AND active_workers == 1

Why not selected:

  • Adds complexity (WATCH/MULTI transactions)
  • Risk of premature termination (race conditions)
  • Coordinator is not a bottleneck at this scale
  • Coordinator failure mode is acceptable (workers continue, just need restart)

2. Domain Sharding

Concept:

Each worker "owns" specific domains:
- hash(domain) % num_workers = assigned_worker
- Better politeness (one worker per domain)
- Natural load balancing

Why not selected:

  • More complex work distribution
  • Requires domain extraction and routing
  • Benefit minimal for Wikipedia (single domain)
  • Better suited for multi-domain crawls

3. Priority Queue

Concept:

Score URLs based on:
- Domain authority
- Page depth
- Content type hints
Crawl high-priority pages first

Why not selected:

  • Requires scoring function
  • Adds complexity to Redis queue
  • Marginal benefit for bounded crawl (500 pages)
  • Worthwhile for unbounded crawls targeting specific content

Technology Stack Decisions

Language: Go

Rationale:

  • Native concurrency primitives (goroutines, channels)
  • Fast compilation and execution
  • Strong standard library (net/http, html parsing)
  • Explicit error handling (reduces bugs)

State Store: Redis

Rationale:

  • Atomic operations (LPUSH/RPOP, HINCRBY)
  • RedisBloom module for space-efficient deduplication
  • Low latency (<1ms within VPC)
  • Simple mental model (key-value + data structures)

Infrastructure: AWS + Terraform

Rationale:

  • Terraform: Infrastructure as code, reproducible, version controlled
  • EC2: Simple compute, predictable costs
  • ElastiCache: Managed Redis (no ops overhead)
  • VPC: Network isolation, low-latency inter-instance communication

Performance Characteristics

Scalability Analysis

Single Machine (Phase 2):

  • Workers: 25
  • Throughput: 6.20 pages/second
  • Bottleneck: Network latency + politeness delay
  • Max theoretical: ~25 pages/second (if no delay)

Distributed (Phase 3):

  • Workers: 5 (across 5 machines)
  • Throughput: 0.60 pages/second
  • Bottleneck: Politeness delay (1s per request)
  • Max theoretical: 5 pages/second (if no delay)

Scaling projection:

Workers   | Pages/sec (1s delay) | Pages/hour | Pages/day
----------|---------------------|------------|----------
5         | 0.60                | 2,160      | 51,840
25        | 3.00                | 10,800     | 259,200
100       | 12.00               | 43,200     | 1,036,800
500       | 60.00               | 216,000    | 5,184,000

Assumes linear scaling (valid until Redis becomes bottleneck at ~1000 workers).

Cost Analysis

AWS Infrastructure (per hour):

Component              | Type           | Cost/hour
-----------------------|----------------|----------
Coordinator EC2        | t2.micro       | $0.0116
Worker EC2 (5x)        | t2.micro       | $0.0580
Redis ElastiCache      | t3.micro       | $0.0170
Data Transfer (est)    | Outbound       | $0.0100
-----------------------|----------------|----------
Total                  |                | $0.0966

Monthly cost (24/7 operation): ~$70 Per-page cost (@ 0.60 pages/sec): ~$0.000037

Cost optimization opportunities:

  • Spot instances: 70% savings on EC2
  • Reserved instances: 40% savings for long-term
  • Larger instance types: Better throughput per dollar at scale

Lessons Learned

1. Start Simple, Add Complexity Incrementally

Building single-threaded first revealed core challenges without concurrency noise. Each phase built on previous learnings.

2. Politeness is a Hard Constraint

No amount of distributed architecture can overcome rate limiting. Respectful crawling requires delays, which fundamentally limits throughput.

3. Coordinator Pattern Wins for Pragmatism

While distributed consensus is academically interesting, coordinator-worker is simpler, debuggable, and sufficient for most use cases. Production systems agree (Kubernetes, Spark, Hadoop all use coordinator patterns).

4. Bloom Filters for Large-Scale Deduplication

At 1M+ URLs, memory becomes the constraint. Bloom filters trade slight accuracy for massive space savings (97% reduction).

5. Quality Over Quantity for LLM Training

Aggressive filtering (71% rejection rate) is correct. Better to train on 10K high-quality pages than 100K pages of which 90K is noise.

6. Infrastructure as Code is Non-Negotiable

Terraform enabled rapid iteration, cost control (easy teardown), and reproducibility. Manual infrastructure management doesn't scale.

7. Observability from Day One

Logging, metrics, and monitoring scripts (monitor.sh) were essential for understanding distributed behavior. Without visibility, debugging is impossible.


Conclusion

This project demonstrated that building distributed systems requires careful consideration of tradeoffs at every level. The final architecture balances simplicity, scalability, and pragmatism.

Core insight: Distributed systems are about managing tradeoffs. Perfect consistency, infinite scale, and zero latency are mutually exclusive. The art lies in choosing which constraints to relax based on the problem domain.

For web crawling: eventual consistency (Bloom filter false positives), bounded scale (politeness limits), and acceptable latency (1-2 second delays) proved to be the right tradeoffs.

Credits

  • Use of AI tools for README drafting and editing.
  • Inspiration from open-source crawlers and distributed systems literature.

About

Scalable distributed web crawler with quality-focused data collection for LLM training. Implements concurrent workers, RedisBloom deduplication, and IaaC deployment.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published