Skip to content

rdsaad/ScrapeGenius.AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ScrapeGenius.AI

ScrapeGenius.AI is an AI-powered web scraping project that leverages Python to create an interactive scraping tool. The project utilizes Streamlit for a user-friendly front end, Langchain for integrating an AI model to interpret the scraped data, and Selenium for managing web scraping tasks. The core dependencies are managed through a virtual environment with Venv in VSCode, allowing a smooth setup using a single dependency file.

The main code resides in three primary Python files: main.py, which powers the website's front-end interface, genius.py for the actual web scraping, cleaning DOM content, and CAPTCHA bypassing, and extract.py, which uses Ollama for parsing content extracted from the provided website. Bright Data is employed as the scraping browser to efficiently handle and bypass captchas, blocked websites, and IP bans.

With a local Chrome Driver, users can activate the project with "source ScrapeGenius/bin/activate" to enable the environment and then run "streamlit run main.py" to launch the local AI-powered web scraping site. Essentially, ScrapeGenius.AI scrapes a website, retrieves DOM content, and utilizes AI to extract targeted information based on user prompts.

About

Python-powered AI web scraping tool

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages