Scraper Module

The Scraper Module is a highly modular web scraping framework built on Scrapy. It simplifies the development of site-specific scrapers by introducing structured configurations, pagination handling, and optional Playwright integration for dynamic content.

Overview

The Scraper Module is designed for scalability, reusability, and ease of use.

Key Features:

Dynamic Configuration System: Define spiders using structured Python configuration files.
Modular Task Execution: Supports Find, Follow, and Pagination operations.
Efficient Pagination Handling: Automatically handles recursive and link-based pagination.
Playwright Integration: Enables scraping of JavaScript-heavy websites when needed.
Duplicate Prevention & Canonicalization: Avoids redundant requests and improves efficiency.

Architecture

The module consists of the following key components:

ScraperEngine (scraper_module/scraper_lib/scraper_engine.py):
Configures and initializes each scraper from a structured configuration.
StepSpider (scraper_module/scraper_lib/engine_spider.py):
The Scrapy-based spider that executes configured tasks (Find, Follow) and handles pagination.
RunAllEngines (scraper_module/scraper_lib/runner.py):
Orchestrates multiple scrapers and ensures efficient execution with duplicate filtering.
Helper Functions (scraper_module/scraper_lib/helpers.py):
Includes URL canonicalization, XPath/CSS selection, and pagination utilities.

Requirements

Python 3.7+
Scrapy
Optional (for Playwright integration):
- scrapy-playwright
- Playwright

Install dependencies from requirements.txt.

Installation

Clone the repository:

The Scraper Module is meant to be a reusable framework that plugs into your own project.
```
git clone https://github.com/IsaacWeber1/scraper_module.git
```

Integrate the framework into your project:

Place the scraper_module/ directory in your project folder. The directory structure for your project should look something like this:

your_project/
├── scraper_module/       # Cloned Scraper Module framework
├── configs/              # Your spider configurations (you create this folder)
├── run.py                # Main script to orchestrate spiders (you create this file)
├── data_output/          # Folder for scraped data (you create this folder)
└── ...                   # Other project files

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r scraper_module/requirements.txt

Usage

Setting Up the Framework

Create the configs/ folder: This folder will contain the configuration files for your scrapers. Each scraper should have its own configuration file written in Python.

Example working configuration file (configs/brown_university.py):

from scraper_module.config import *

config = SpiderConfig(
    name="brown_university",
    start_url="https://bulletin.brown.edu/",
    use_playwright=False,
    pagination=Search_Links(
        search_space='xpath://div[@id="cl-menu"]/ul[@id="/"]',
        link_selector="xpath:li/a/@href",
        target_page_selector='xpath://div[@id="tabs"]/ul/li[@id="courseinventorytab"]/a/@href' # Target page to scrape
    ),
    tasks=[
        Find(
            task_name="courses",
            search_space='xpath://*[@id="courseinventorycontainer"]/div/div',
            repeating_selector="div",
            fields={
                "title": 'xpath:p[@class="courseblocktitle"]/strong//text()',
                "description": 'xpath:p[@class="courseblockdesc"]//text()join'
            },
            num_required=1
        )
    ]
)

Create the run.py file: This script orchestrates the execution of all spiders.

import glob
import importlib.util
from scraper_module.scraper_lib.runner import RunAllEngines
from scraper_module.scraper_lib.scraper_engine import ScraperEngine

def clean_output():
    # Remove any existing JSON output files to start fresh.
    for file in glob.glob("data_output/*.json"):
        os.remove(file)

EXCLUDE_ENGINES = {"example_exclude"} # For testing purposes

if __name__ == "__main__":
    clean_output()
    config_files = [
        file for file in glob.glob("configs/*.py")
        if os.path.basename(file).replace(".py", "") not in EXCLUDE_ENGINES
    ]

    engines = []
    for config_file in config_files:
        spec = importlib.util.spec_from_file_location(
            "config", config_file
        )
        module = importlib.util.module_from_spec(spec)
        spec.loader.exec_module(module)
        engines.append(ScraperEngine(module.config))

    runner = RunAllEngines(engines=engines)
    runner.run_all()
    runner.save_all()

Create the data_output/ folder: This folder will store the JSON output files generated by your spiders.

Running All Spiders

To execute all spiders defined in the configs/ folder, run:

python run.py

Running a Single Spider

To run a specific spider programmatically:

from scraper_module.scraper_lib.scraper_engine import ScraperEngine
from configs.example_university import config

engine = ScraperEngine(config)
engine.run()

Adding New Scrapers

Create a new config file in configs/ (e.g., configs/new_university.py).
Define the scraper tasks and pagination in the config file using SpiderConfig.
Add your scraper configuration file to the configs/ folder.
Run the framework (python run.py).

Configuration System

Each scraper is fully configurable using SpiderConfig. Below are the available configuration options:

SpiderConfig

Parameter	Type	Description
`name`	`str`	Unique name for the scraper
`start_url`	`str`	URL where scraping starts
`use_playwright`	`bool`	Enable Playwright for JavaScript-heavy sites
`pagination`	`PaginationConfig`	Defines pagination strategy
`tasks`	`List[TaskConfig]`	Defines what data to extract

Pagination & Task Configuration

Pagination Types

Type	Description
`Search_Links`	Finds and follows links dynamically
`Listed_Links`	Uses a predefined list of pagination links

Task Types

Type	Description
`Find`	Extracts structured data using XPath/CSS
`Follow`	Follows links and extracts additional fields

Example Find task:

Find(
    task_name="courses",
    search_space='xpath://div[@class="course-list"]',
    repeating_selector="div",
    fields={
        "title": 'xpath:h2/text()',
        "description": 'xpath:p[@class="summary"]//text()join'
    },
    num_required=1
)

Project Structure

your_project/
│── run.py                  # Orchestrates spider execution
│── configs/                # Configuration files for spiders
│   ├── example_spider.py
│── scraper_module/         # Imported scraping framework
│   ├── config.py            # Dataclasses for scraper configs
│   ├── scraper_lib/
│   │   ├── runner.py        # Runs multiple scrapers
│   │   ├── scraper_engine.py # Handles spider execution
│   │   ├── engine_spider.py  # Core Scrapy spider
│   │   ├── helpers.py       # Utility functions
│── data_output/             # Scraped data output

License

This project is released under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
scraper_lib		scraper_lib
scraper_project		scraper_project
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
usage.md		usage.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scraper Module

Table of Contents

Overview

Key Features:

Architecture

Requirements

Installation

Usage

Setting Up the Framework

Running All Spiders

Running a Single Spider

Adding New Scrapers

Configuration System

SpiderConfig

Pagination & Task Configuration

Pagination Types

Task Types

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scraper Module

Table of Contents

Overview

Key Features:

Architecture

Requirements

Installation

Usage

Setting Up the Framework

Running All Spiders

Running a Single Spider

Adding New Scrapers

Configuration System

SpiderConfig

Pagination & Task Configuration

Pagination Types

Task Types

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages