Skip to content

anituqe-analyzer/web-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Web scraper

Description

Web scraper component for the project. This module consists of three versions of scrapers:

  • Training scrapers: Used for bulk data collection to train machine learning models.
  • Apify prototypes: Cloud-based versions used for testing infrastructure and anti-bot bypass.
  • Production scrapers: Stable versions integrated with the main application for real-time data fetching.

Technologies

The project is built with:

  • Python
  • Python libraries:
    • Selenium
    • Undetected Chromedriver
    • BeautifulSoup
    • Requests
    • Apify Client (for prototypes)
    • Python-dotenv (for environment variables)

Prerequisites

To run this project, you need:

  • Code editor (e.g., PyCharm or Visual Studio Code)
  • Python 3.x installed & listed libraries
  • Google Chrome browser
  • Apify API Token (for prototype version)

Installation & Setup

  1. Clone the repository

Download the project files to your local machine:

git clone https://github.com/anituqe-analyzer/web_scraper.git
cd web_scraper
  1. Set up a virtual environment (Windows)

It is recommended to create a virtual environment to isolate project dependencies:

python -m venv .venv
.venv\Scripts\activate
  1. Install dependencies

Install the required Python libraries using pip:

pip install selenium undetected-chromedriver beautifulsoup4 requests apify-client python-dotenv
  1. Configuration To use the Apify prototypes, create a .env file in the root directory and add your API token: APIFY_TOKEN=your_api_token_here

  2. Usage

To run a specific version of the scraper, execute one of the following commands in your terminal:
Training Scrapers:

  • For OLX: python scrapers/training/web_scraper_olx.py
  • For Allegro: python scrapers/training/web_scraper_allegro.py
  • For eBay: python scrapers/training/web_scraper_ebay.py

Prototypes (Apify):

  • For OLX: python scrapers/prototypes/web_scraper_ebay_apify.py
  • For Allegro: python scrapers/prototypes/web_scraper_allegro_apify.py

Scraped Data

The data/ directory contains datasets collected using these web scrapers to train models.

Releases

No releases published

Packages

 
 
 

Contributors

Languages