GitHub - mdsunbeam/universal-scraper-agent: Web scraper agent using a large multimodal language model to scrape data in any format that you want it.

Universal Scraper Agent

Easily extract data from the web through large language models (LLMs) by specifying the format through JSON files.
Report Bug · Request Feature

Table of Contents

About The Project
Getting Started
- Prerequisites
- Installation
Usage
Roadmap
Contributing
License
Contact

About The Project

Universal Scraping Agent is a versatile and powerful tool for scraping websites and processing images of webpages to extract specific data using Selenium and multimodal large language models (LLMs). This project allows users to define their scraping requirements through a JSON configuration, making it easy to customize and automate data extraction tasks.

Supported Models

In the initial release, we support the use of:

GPT models from OpenAI
Claude 3 models from Anthropic
Gemini models from Google

Key Features

Automated Web Scraping: Uses Selenium to navigate and scrape data from websites.
Image Processing: Captures screenshots of webpages and processes them using LLMs.
Customizable Data Extraction: Users can specify the data to be scraped using a JSON configuration.
Dynamic Content Handling: Capable of handling dynamic web content and AJAX-loaded elements.
Multi-Platform Support: Compatible with Windows, macOS, and Linux.

(back to top)

Getting Started

The first thing to set up is your OpenAI, Anthropic, Google, and Groq API keys. You need OpenAI for the GPT models, Anthropic for the Claude 3 models, Google for the Gemini models, and finally Groq for Llama 3 models. You only need one of the LLMs to begin automated scraping.

Prerequisites

Get an API key for OpenAI at: https://platform.openai.com/api-keys
Get directions for how to generate an API key for Anthropic at: https://docs.anthropic.com/claude/reference/getting-started-with-the-api
Get an API key for Google at: https://aistudio.google.com/app/apikey
Create a free account and get an API key for Groq at: https://console.groq.com/keys

Next, create four text files called OPENAI_API_KEY.txt, ANTHROPIC_API_KEY.txt, GOOGLE_API_KEY.txt, and GROQ_API_KEY.txt. Paste your respective API keys into the text files.

Installation

Make Python virtual environment

python3.10 -m venv scraper-agent-env
source scraper-agent-env/bin/activate

Clone the repo

git clone https://github.com/mdsunbeam/universal-scraper-agent.git
cd universal-scraper-agent

Install Python packages
```
pip install -r requirements.txt 
```

(back to top)

Usage

Define your scraping configuration by creating a JSON file to specify what data you want to extract.

{
    "Description" : "detailed description of the contents of the webpage",
    "Interesting Feature" : ["list of any peculiar, interesting, or unique features about the page"]
}

Run the scraper agent.

from llms import GPT
from utils import load_json_as_dict, save_json_to_file, explore_links
import cv2
import os


if __name__ == "__main__":

    # Shallow exploration of links on website
    website_url = 'https://mdsunbeam.com/'  # Replace with your target website
    explore_links(website_url)

    MODELS = {
    "OpenAI": ["gpt-4-turbo", "gpt-4o", "gpt-3.5-turbo"]
    }

    desired_format = load_json_as_dict("specific_output.json")
    
    system_message = f"""You are a web-scraping agent that can decide how to scrape information
    from webpages. Please organize the JSON scraping in the following format: \n
    {desired_format}
    """

    # save_json_to_file(gpt4o.generate_response(), "scrape_results.json")
    directory_path = "images"
    for filename in os.listdir(directory_path):
        file_path = os.path.join(directory_path, filename)
        if os.path.isfile(file_path):
            webpage_image = cv2.imread(file_path)
            gpt4o = GPT(model_name=MODELS["OpenAI"][1], system_message=system_message)
            gpt4o.add_user_message(frame=webpage_image, user_msg="Please give me results in the desired JSON form.")
            save_json_to_file(gpt4o.generate_response(), f"scraped_results/result_{filename}.json")

or

python main.py

View scraping results in the scraped_results/ folder.

(back to top)

Roadmap

add a key in the results that saves exact URL
error handling during shallow exploration

See the open issues for a full list of proposed features (and known issues).

(back to top)

Contributing

Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.

If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

Fork the Project
Create your Feature Branch (git checkout -b feature/new-feature)
Commit your Changes (git commit -m 'added new feature')
Push to the Branch (git push origin feature/new-feature)
Open a Pull Request

(back to top)

License

Distributed under the MIT License. See LICENSE for more information.

(back to top)

Contact

@MdSunbeam - mdsunbeam3.14@gmail.com

Project Link: https://github.com/mdsunbeam/universal-scraper-agent

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
misc		misc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
llms.py		llms.py
main.py		main.py
requirements.txt		requirements.txt
scrape_results.json		scrape_results.json
screenshot_image.py		screenshot_image.py
shallow_exploration.py		shallow_exploration.py
specific_output.json		specific_output.json
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Universal Scraper Agent

About The Project

Supported Models

Key Features

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Universal Scraper Agent

About The Project

Supported Models

Key Features

Getting Started

Prerequisites

Installation

Usage

Roadmap

Contributing

License

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages