Concurrent single-domain web crawler in Java that efficiently discovers and maps internal site structure starting from a seed URL.
Starting from a seed URL, it:
- Fetches the page HTML
- Extracts links
- Canonicalises them
- Deduplicates & validates (same host, allowed by robots, valid http/https, not PDF, ...)
- Enqueues new URLs
- Repeats until no new pages remain
At the end of the process, it prints some stats in the terminal and generate 2 files:
- sitemap.json: a representation of the sitemap store with all URLs metadata
- sitemap.txt: a representation of all URLs and their children
The crawler is designed as a 2-stage pipeline separating I/O latency (fetch) from CPU work (page processing).
| Concern | Implementation |
|---|---|
| Concurrency | Producer–consumer with 2 blocking queues providing backpressure |
| State | URL Metadata of processed pages stored in ConcurrentHashMap keyed by canonical URL |
| Distribution | Separate thread pools for UrlFetcher and Page Processor for flexibility |
| Politeness | Fixed inter-request delay |
| Completion | Atomic counter for in-flight URLs + poison message for shutdown |
inFlight is an AtomicInteger tracking how many URLs are currently being fetched or processed.
It increments when we discover a new unique URL, and decrements once the fetcher finishes handling that URL.
When this reaches 0, and both queues are empty, we know there is no active work left, i.e. crawling is complete.
- Java 11
- Maven 3.9+
mvn clean package
Run java -jar target/web-crawler-1.0-SNAPSHOT.jar
OR
Run the CrawlingApp in your IDE
mvn test
Config lives in /src/main/resources/application.properties
