Scraper Module Structure

The scraper module is designed with a modular architecture so that team members can work on specific parts independently. Although this requires a bit more upfront effort, it greatly reduces refactoring later and minimizes the impact of changes on other parts of the system.

`run.py` (High-Level Execution)

Purpose: This is the entry point for the entire scraping process. It’s located outside the scraper module.
Usage: It calls the runner.py class to run the scrapers.
Note: You should only interact with the module through runner.py at this level.

`runner.py` (Orchestration & Aggregation)

Purpose: Orchestrates the execution of multiple scrapers (engines) and handles overall process control.
Usage: Sets up the Scrapy process, collects items from each engine, and saves aggregated results.

`scraper_engine.py` (Scraper Configuration & Execution)

Purpose: Translates the user-defined SpiderConfig into steps that the Scrapy spider can execute.
Usage: Initializes and runs the main spider (StepSpider) with the appropriate settings and tasks.

`engine_spider.py` (Core Scrapy Spider)

Purpose: Implements the StepSpider class, which handles the actual crawling.
Usage: Manages requests (including pagination and Playwright integration for dynamic content) and processes each defined task (e.g., find, follow, click).
Note: This is where the core logic of your scraping – like pagination handling and element extraction – takes place.

`config.py` (Scraper Configuration Definitions)

Purpose: Defines the data classes and types (like Find, Follow, and PaginationConfig) for configuring the scrapers.
Usage: Provides a structured way to define scraper behavior that is both reusable and easy to update.

`helpers.py` (Utility Functions)

Purpose: Contains utility functions for common tasks such as element selection, text extraction, and URL normalization.
Usage: Keeps the core spider code clean and DRY (Don’t Repeat Yourself) by centralizing reusable logic.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scraper Module Structure

`run.py` (High-Level Execution)

`runner.py` (Orchestration & Aggregation)

`scraper_engine.py` (Scraper Configuration & Execution)

`engine_spider.py` (Core Scrapy Spider)

`config.py` (Scraper Configuration Definitions)

`helpers.py` (Utility Functions)

FilesExpand file tree

usage.md

Latest commit

History

usage.md

File metadata and controls

Scraper Module Structure

run.py (High-Level Execution)

runner.py (Orchestration & Aggregation)

scraper_engine.py (Scraper Configuration & Execution)

engine_spider.py (Core Scrapy Spider)

config.py (Scraper Configuration Definitions)

helpers.py (Utility Functions)

`run.py` (High-Level Execution)

`runner.py` (Orchestration & Aggregation)

`scraper_engine.py` (Scraper Configuration & Execution)

`engine_spider.py` (Core Scrapy Spider)

`config.py` (Scraper Configuration Definitions)

`helpers.py` (Utility Functions)