Skip to content

High-performance asynchronous scraper for collecting used car data from auto.ria.com using Python, httpx, BeautifulSoup4, PostgreSQL, and Celery

License

Notifications You must be signed in to change notification settings

ursaloper/auto.ria-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

auto.ria.com Scraper

Python 3.10 PostgreSQL Docker Code style: black

πŸ“– Русская вСрсия Π΄ΠΎΠΊΡƒΠΌΠ΅Π½Ρ‚Π°Ρ†ΠΈΠΈ

High-performance asynchronous application for collecting used car data from the auto.ria.com platform.

πŸ“‹ Description

AutoRia Scraper is an efficient tool for collecting car data from the auto.ria.com website. The application uses an asynchronous approach based on httpx+BeautifulSoup4 for maximum performance and resource efficiency.

Collected data:

  • πŸ’° Price information in USD
  • πŸ” Car characteristics (mileage, VIN code, license plate)
  • πŸ‘€ Seller contact information (name, phone)
  • πŸ–ΌοΈ Photo and media information
  • πŸ“Š Date and time when the listing was discovered by the scraper

Advantages:

  • ⚑ High performance β€” asynchronous HTTP requests with httpx
  • πŸ”„ Resilience β€” automatic retry attempts on errors
  • πŸ“ˆ Scalability β€” configurable number of concurrent requests
  • 🧠 Intelligent data collection β€” two-stage collection process (main data + phone)
  • πŸ“ Detailed logging β€” tracking all stages of data collection
  • πŸ—ƒοΈ Automatic backups β€” regular database backup

πŸ”§ Technologies

  • Python 3.10 β€” modern programming language version
  • PostgreSQL β€” reliable relational database for data storage
  • SQLAlchemy β€” powerful ORM for database operations
  • httpx β€” next-generation asynchronous HTTP client
  • BeautifulSoup4 β€” efficient HTML page parser
  • asyncio β€” library for asynchronous programming
  • Celery β€” distributed task queue for process automation
  • Docker & Docker Compose β€” containerization for easy deployment

πŸ“‚ Project Structure

β”œβ”€β”€ .dockerignore
β”œβ”€β”€ .env                # Environment variables (create manually)
β”œβ”€β”€ .env.example        # Example .env file
β”œβ”€β”€ .gitignore
β”œβ”€β”€ Dockerfile          # Application Docker image
β”œβ”€β”€ README.md           # Documentation
β”œβ”€β”€ README_RU.md        # Russian documentation
β”œβ”€β”€ docker-compose.yml  # Docker Compose configuration
β”œβ”€β”€ requirements.txt    # Python dependencies
β”œβ”€β”€ tests/              # Tests
β”œβ”€β”€ logs/               # Application logs
β”‚   └── scraper.log
β”œβ”€β”€ dumps/              # Database dumps
β”‚   └── autoria_dump_YYYY-MM-DD_HH-MM-SS.sql
└── app/                # Main application code
    β”œβ”€β”€ __init__.py
    β”œβ”€β”€ main.py         # Entry point
    β”œβ”€β”€ core/           # Database and models
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ database.py
    β”‚   └── models.py
    β”œβ”€β”€ config/         # Configuration
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ celery_config.py
    β”‚   └── settings.py
    β”œβ”€β”€ utils/          # Utilities
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ db_dumper.py
    β”‚   β”œβ”€β”€ db_utils.py
    β”‚   └── logger.py
    β”œβ”€β”€ scraper/        # Parsing logic
    β”‚   β”œβ”€β”€ __init__.py
    β”‚   β”œβ”€β”€ autoria.py  # Main scraper
    β”‚   β”œβ”€β”€ base.py     # Base scraper class
    β”‚   └── parsers/
    β”‚       β”œβ”€β”€ car_page.py    # Car page parser
    β”‚       └── search_page.py # Search page parser
    └── tasks/          # Celery tasks
        β”œβ”€β”€ __init__.py
        β”œβ”€β”€ backup.py   # Backup tasks
        └── scraping.py # Data collection tasks

πŸš€ Installation and Launch

Via Docker (recommended)

  1. Clone the repository:
git clone https://github.com/ursaloper/auto.ria-scraper
cd auto.ria-scraper
  1. Create .env file based on .env.example:
cp .env.example .env
  1. Configure environment variables in .env:
nano .env
  1. Launch the application:
docker-compose up -d
  1. View logs:
docker-compose logs -f

Local Installation

  1. Create virtual environment:
python -m venv venv
source venv/bin/activate  # Linux/MacOS
# or
venv\Scripts\activate     # Windows
  1. Install dependencies:
pip install -r requirements.txt
  1. Configure .env file:
cp .env.example .env
nano .env
  1. Launch the application:
python -m app.main

πŸ€– Celery Management

For manual task execution and queue monitoring use:

Scraping and Backup Tasks

  • Create database dump manually:
    docker-compose exec celery_worker celery -A app call app.tasks.backup.manual_backup
  • Run scraping manually:
    docker-compose exec celery_worker celery -A app call app.tasks.scraping.manual_scrape
  • Run scraping from specific URL:
    docker-compose exec celery_worker celery -A app call app.tasks.scraping.manual_scrape --args='["https://auto.ria.com/uk/car/mercedes-benz/"]'

Celery Monitoring

  • Show registered tasks:
    docker-compose exec celery_worker celery -A app inspect registered
  • Show queued tasks:
    docker-compose exec celery_worker celery -A app inspect reserved
  • Show active tasks:
    docker-compose exec celery_worker celery -A app inspect active
  • Show completed tasks history:
    docker-compose exec celery_worker celery -A app inspect revoked

βš™οΈ Configuration

Main settings are located in the .env file:

Parameter Description Example
DATABASE_URL PostgreSQL connection URL postgresql://user:password@postgres:5432/autoria
SCRAPER_START_TIME Data collection start time 12:00 (daily at 12:00)
DUMP_TIME Database dump creation time 00:00 (daily at 00:00)
SCRAPER_START_URL Starting page for data collection https://auto.ria.com/uk/car/used/
MAX_PAGES_TO_PARSE Maximum number of pages to parse 10
MAX_CARS_TO_PROCESS Maximum number of cars to process 100
SCRAPER_CONCURRENCY Maximum number of concurrent requests 5

πŸš„ Performance

Parser speed depends on the SCRAPER_CONCURRENCY parameter, which determines the number of concurrent requests. In practice, due to auto.ria.com site limitations and server-side delays, actual speed may differ from theoretical.

Test Results:

  • Processed: 500 cars
  • Added to DB: 495-496 new records
  • Execution time: ~6-7 minutes (360-380 seconds)
  • Efficiency: 99% (percentage of successfully processed listings)

Important:

  • Increasing SCRAPER_CONCURRENCY above 5-7 practically doesn't speed up data collection due to site limitations and delays on auto.ria.com side.
  • Too high values may lead to temporary IP address blocking.
  • Recommended to use values 5-7 for stable and safe operation.

πŸ’Ύ Database Dumps

  • Dumps are created automatically daily at specified time
  • Stored in dumps/ directory
  • Filename format: autoria_dump_YYYY-MM-DD_HH-MM-SS.sql
  • Automatic deletion of old dumps (stored for 30 days by default)

πŸ“Š Logging

The logging system provides detailed information about application operation:

  • All logs are available in logs/scraper.log file
  • Log rotation is configured (maximum file size: 10MB)
  • Separate logging for each module
  • Logging levels: DEBUG, INFO, WARNING, ERROR, CRITICAL

πŸ› οΈ Development

Code Style

The project uses Black for code formatting:

# Format code
black app/

# Check formatting
black --check app/

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add some amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

πŸ“ž Support

If you have questions or need help:

⭐ Star History

If this project helped you, please give it a star! ⭐

About

High-performance asynchronous scraper for collecting used car data from auto.ria.com using Python, httpx, BeautifulSoup4, PostgreSQL, and Celery

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published