Web scraper

Description

Web scraper component for the project. This module consists of three versions of scrapers:

Training scrapers: Used for bulk data collection to train machine learning models.
Apify prototypes: Cloud-based versions used for testing infrastructure and anti-bot bypass.
Production scrapers: Stable versions integrated with the main application for real-time data fetching.

Technologies

The project is built with:

Python
Python libraries:
- Selenium
- Undetected Chromedriver
- BeautifulSoup
- Requests
- Apify Client (for prototypes)
- Python-dotenv (for environment variables)

Prerequisites

To run this project, you need:

Code editor (e.g., PyCharm or Visual Studio Code)
Python 3.x installed & listed libraries
Google Chrome browser
Apify API Token (for prototype version)

Installation & Setup

Clone the repository

Download the project files to your local machine:

git clone https://github.com/anituqe-analyzer/web_scraper.git
cd web_scraper

Set up a virtual environment (Windows)

It is recommended to create a virtual environment to isolate project dependencies:

python -m venv .venv
.venv\Scripts\activate

Install dependencies

Install the required Python libraries using pip:

pip install selenium undetected-chromedriver beautifulsoup4 requests apify-client python-dotenv

Configuration To use the Apify prototypes, create a .env file in the root directory and add your API token: APIFY_TOKEN=your_api_token_here
Usage

To run a specific version of the scraper, execute one of the following commands in your terminal:
Training Scrapers:

For OLX: python scrapers/training/web_scraper_olx.py
For Allegro: python scrapers/training/web_scraper_allegro.py
For eBay: python scrapers/training/web_scraper_ebay.py

Prototypes (Apify):

For OLX: python scrapers/prototypes/web_scraper_ebay_apify.py
For Allegro: python scrapers/prototypes/web_scraper_allegro_apify.py

Scraped Data

The data/ directory contains datasets collected using these web scrapers to train models.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
data		data
scrapers		scrapers
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web scraper

Description

Technologies

Prerequisites

Installation & Setup

Scraped Data

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Web scraper

Description

Technologies

Prerequisites

Installation & Setup

Scraped Data

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages