ScrapeGenius.AI is an AI-powered web scraping project that leverages Python to create an interactive scraping tool. The project utilizes Streamlit for a user-friendly front end, Langchain for integrating an AI model to interpret the scraped data, and Selenium for managing web scraping tasks. The core dependencies are managed through a virtual environment with Venv in VSCode, allowing a smooth setup using a single dependency file.
The main code resides in three primary Python files: main.py, which powers the website's front-end interface, genius.py for the actual web scraping, cleaning DOM content, and CAPTCHA bypassing, and extract.py, which uses Ollama for parsing content extracted from the provided website. Bright Data is employed as the scraping browser to efficiently handle and bypass captchas, blocked websites, and IP bans.
With a local Chrome Driver, users can activate the project with "source ScrapeGenius/bin/activate" to enable the environment and then run "streamlit run main.py" to launch the local AI-powered web scraping site. Essentially, ScrapeGenius.AI scrapes a website, retrieves DOM content, and utilizes AI to extract targeted information based on user prompts.