Skip to content

lauracabtay/web-crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawler

Description

Concurrent single-domain web crawler in Java that efficiently discovers and maps internal site structure starting from a seed URL.

Starting from a seed URL, it:

  • Fetches the page HTML
  • Extracts links
  • Canonicalises them
  • Deduplicates & validates (same host, allowed by robots, valid http/https, not PDF, ...)
  • Enqueues new URLs
  • Repeats until no new pages remain

At the end of the process, it prints some stats in the terminal and generate 2 files:

  • sitemap.json: a representation of the sitemap store with all URLs metadata
  • sitemap.txt: a representation of all URLs and their children

Architecture

Diagram

Crawler architecture diagram

Design summary

The crawler is designed as a 2-stage pipeline separating I/O latency (fetch) from CPU work (page processing).

Concern Implementation
Concurrency Producer–consumer with 2 blocking queues providing backpressure
State URL Metadata of processed pages stored in ConcurrentHashMap keyed by canonical URL
Distribution Separate thread pools for UrlFetcher and Page Processor for flexibility
Politeness Fixed inter-request delay
Completion Atomic counter for in-flight URLs + poison message for shutdown

inFlight is an AtomicInteger tracking how many URLs are currently being fetched or processed. It increments when we discover a new unique URL, and decrements once the fetcher finishes handling that URL. When this reaches 0, and both queues are empty, we know there is no active work left, i.e. crawling is complete.

Running

Requirements

  • Java 11
  • Maven 3.9+

Build

mvn clean package

Run locally

Run java -jar target/web-crawler-1.0-SNAPSHOT.jar

OR

Run the CrawlingApp in your IDE

Test

mvn test

Config

Config lives in /src/main/resources/application.properties

About

Single-domain concurrent web crawler in Java that discovers, canonicalises, and maps site URLs starting from a seed page

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages