Easily extract data from the web through large language models (LLMs) by specifying the format through JSON files.
Report Bug
·
Request Feature
Table of Contents
Universal Scraping Agent is a versatile and powerful tool for scraping websites and processing images of webpages to extract specific data using Selenium and multimodal large language models (LLMs). This project allows users to define their scraping requirements through a JSON configuration, making it easy to customize and automate data extraction tasks.
In the initial release, we support the use of:
- GPT models from OpenAI
- Claude 3 models from Anthropic
- Gemini models from Google
- Automated Web Scraping: Uses Selenium to navigate and scrape data from websites.
- Image Processing: Captures screenshots of webpages and processes them using LLMs.
- Customizable Data Extraction: Users can specify the data to be scraped using a JSON configuration.
- Dynamic Content Handling: Capable of handling dynamic web content and AJAX-loaded elements.
- Multi-Platform Support: Compatible with Windows, macOS, and Linux.
The first thing to set up is your OpenAI, Anthropic, Google, and Groq API keys. You need OpenAI for the GPT models, Anthropic for the Claude 3 models, Google for the Gemini models, and finally Groq for Llama 3 models. You only need one of the LLMs to begin automated scraping.
- Get an API key for OpenAI at: https://platform.openai.com/api-keys
- Get directions for how to generate an API key for Anthropic at: https://docs.anthropic.com/claude/reference/getting-started-with-the-api
- Get an API key for Google at: https://aistudio.google.com/app/apikey
- Create a free account and get an API key for Groq at: https://console.groq.com/keys
Next, create four text files called OPENAI_API_KEY.txt, ANTHROPIC_API_KEY.txt, GOOGLE_API_KEY.txt, and GROQ_API_KEY.txt. Paste your respective API keys into the text files.
- Make Python virtual environment
python3.10 -m venv scraper-agent-env source scraper-agent-env/bin/activate - Clone the repo
git clone https://github.com/mdsunbeam/universal-scraper-agent.git cd universal-scraper-agent - Install Python packages
pip install -r requirements.txt
- Define your scraping configuration by creating a JSON file to specify what data you want to extract.
{
"Description" : "detailed description of the contents of the webpage",
"Interesting Feature" : ["list of any peculiar, interesting, or unique features about the page"]
}- Run the scraper agent.
from llms import GPT
from utils import load_json_as_dict, save_json_to_file, explore_links
import cv2
import os
if __name__ == "__main__":
# Shallow exploration of links on website
website_url = 'https://mdsunbeam.com/' # Replace with your target website
explore_links(website_url)
MODELS = {
"OpenAI": ["gpt-4-turbo", "gpt-4o", "gpt-3.5-turbo"]
}
desired_format = load_json_as_dict("specific_output.json")
system_message = f"""You are a web-scraping agent that can decide how to scrape information
from webpages. Please organize the JSON scraping in the following format: \n
{desired_format}
"""
# save_json_to_file(gpt4o.generate_response(), "scrape_results.json")
directory_path = "images"
for filename in os.listdir(directory_path):
file_path = os.path.join(directory_path, filename)
if os.path.isfile(file_path):
webpage_image = cv2.imread(file_path)
gpt4o = GPT(model_name=MODELS["OpenAI"][1], system_message=system_message)
gpt4o.add_user_message(frame=webpage_image, user_msg="Please give me results in the desired JSON form.")
save_json_to_file(gpt4o.generate_response(), f"scraped_results/result_{filename}.json")or
python main.py- View scraping results in the
scraped_results/folder.
- add a key in the results that saves exact URL
- error handling during shallow exploration
See the open issues for a full list of proposed features (and known issues).
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/new-feature) - Commit your Changes (
git commit -m 'added new feature') - Push to the Branch (
git push origin feature/new-feature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
@MdSunbeam - mdsunbeam3.14@gmail.com
Project Link: https://github.com/mdsunbeam/universal-scraper-agent