Web scraper component for the project. This module consists of three versions of scrapers:
- Training scrapers: Used for bulk data collection to train machine learning models.
- Apify prototypes: Cloud-based versions used for testing infrastructure and anti-bot bypass.
- Production scrapers: Stable versions integrated with the main application for real-time data fetching.
The project is built with:
- Python
- Python libraries:
- Selenium
- Undetected Chromedriver
- BeautifulSoup
- Requests
- Apify Client (for prototypes)
- Python-dotenv (for environment variables)
To run this project, you need:
- Code editor (e.g., PyCharm or Visual Studio Code)
- Python 3.x installed & listed libraries
- Google Chrome browser
- Apify API Token (for prototype version)
- Clone the repository
Download the project files to your local machine:
git clone https://github.com/anituqe-analyzer/web_scraper.git
cd web_scraper- Set up a virtual environment (Windows)
It is recommended to create a virtual environment to isolate project dependencies:
python -m venv .venv
.venv\Scripts\activate- Install dependencies
Install the required Python libraries using pip:
pip install selenium undetected-chromedriver beautifulsoup4 requests apify-client python-dotenv-
Configuration To use the Apify prototypes, create a .env file in the root directory and add your API token:
APIFY_TOKEN=your_api_token_here -
Usage
To run a specific version of the scraper, execute one of the following commands in your terminal:
Training Scrapers:
- For OLX:
python scrapers/training/web_scraper_olx.py - For Allegro:
python scrapers/training/web_scraper_allegro.py - For eBay:
python scrapers/training/web_scraper_ebay.py
Prototypes (Apify):
- For OLX:
python scrapers/prototypes/web_scraper_ebay_apify.py - For Allegro:
python scrapers/prototypes/web_scraper_allegro_apify.py
The data/ directory contains datasets collected using these web scrapers to train models.