Skip to content

lein3000zzz/project-arachne

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦍

Project arachne 🕷

Description

A high-performance, distributed web crawler built in Go that leverages browser automation to crawl and parse dynamic web content. The crawler supports depth-limited crawling, handles JavaScript-rendered pages, respects robots.txt, and stores crawled data in a graph database for efficient querying and analysis.

Features

  • Browser Automation: Uses Rod (Go wrapper for Puppeteer) to render JavaScript-heavy websites and extract dynamic content.
  • Multi-Format Parsing: Extracts links from HTML, JavaScript, and JSON responses.
    • js parsing relies heavily on goja
  • Depth-Limited Crawling: Configurable maximum depth and link limits per crawl run.
  • Robots.txt Compliance: Automatically checks and respects robots.txt files.
  • Caching: Redis-based caching for pages and robots.txt to improve performance and reduce redundant requests.
  • Graph Database Storage: Stores crawled pages and their relationships in Neo4j for advanced querying.
  • Message Queue: Uses Apache Kafka for distributed task processing and run management.
  • Screenshot Capture: Optional screenshot functionality for visual page archiving.
  • Concurrent Processing: Configurable number of concurrent workers for efficient crawling.
  • Dockerized Deployment: Complete containerized setup with Docker Compose.

Architecture

The crawler consists of several key components:

  • Crawler: Main crawling logic using browser automation
  • Processor: Kafka-based message processing for tasks and runs
  • Page Parser: Extracts links from various content types
  • Networker: Handles HTTP requests and browser interactions
  • Page Repository: Manages data persistence in Neo4j
  • Cache: Redis-based caching layer

Data Flow

  1. Run configurations are sent to Kafka
  2. Processor creates initial tasks
  3. Crawler workers process tasks concurrently
  4. Pages are parsed for links and stored in Neo4j
  5. New tasks are generated for discovered links (within depth limits)

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

web-crawler project with kafka queues, redis cache support, neo4j map builder, strict robots.txt compliance, one of a kind static js parser with goja, some cool features from rod library and more! No updates for a while, but the work will be resumed soon

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages