High-performance asynchronous application for collecting used car data from the auto.ria.com platform.
AutoRia Scraper is an efficient tool for collecting car data from the auto.ria.com website. The application uses an asynchronous approach based on httpx+BeautifulSoup4 for maximum performance and resource efficiency.
- π° Price information in USD
- π Car characteristics (mileage, VIN code, license plate)
- π€ Seller contact information (name, phone)
- πΌοΈ Photo and media information
- π Date and time when the listing was discovered by the scraper
- β‘ High performance β asynchronous HTTP requests with httpx
- π Resilience β automatic retry attempts on errors
- π Scalability β configurable number of concurrent requests
- π§ Intelligent data collection β two-stage collection process (main data + phone)
- π Detailed logging β tracking all stages of data collection
- ποΈ Automatic backups β regular database backup
- Python 3.10 β modern programming language version
- PostgreSQL β reliable relational database for data storage
- SQLAlchemy β powerful ORM for database operations
- httpx β next-generation asynchronous HTTP client
- BeautifulSoup4 β efficient HTML page parser
- asyncio β library for asynchronous programming
- Celery β distributed task queue for process automation
- Docker & Docker Compose β containerization for easy deployment
βββ .dockerignore
βββ .env # Environment variables (create manually)
βββ .env.example # Example .env file
βββ .gitignore
βββ Dockerfile # Application Docker image
βββ README.md # Documentation
βββ README_RU.md # Russian documentation
βββ docker-compose.yml # Docker Compose configuration
βββ requirements.txt # Python dependencies
βββ tests/ # Tests
βββ logs/ # Application logs
β βββ scraper.log
βββ dumps/ # Database dumps
β βββ autoria_dump_YYYY-MM-DD_HH-MM-SS.sql
βββ app/ # Main application code
βββ __init__.py
βββ main.py # Entry point
βββ core/ # Database and models
β βββ __init__.py
β βββ database.py
β βββ models.py
βββ config/ # Configuration
β βββ __init__.py
β βββ celery_config.py
β βββ settings.py
βββ utils/ # Utilities
β βββ __init__.py
β βββ db_dumper.py
β βββ db_utils.py
β βββ logger.py
βββ scraper/ # Parsing logic
β βββ __init__.py
β βββ autoria.py # Main scraper
β βββ base.py # Base scraper class
β βββ parsers/
β βββ car_page.py # Car page parser
β βββ search_page.py # Search page parser
βββ tasks/ # Celery tasks
βββ __init__.py
βββ backup.py # Backup tasks
βββ scraping.py # Data collection tasks
- Clone the repository:
git clone https://github.com/ursaloper/auto.ria-scraper
cd auto.ria-scraper- Create .env file based on .env.example:
cp .env.example .env- Configure environment variables in .env:
nano .env- Launch the application:
docker-compose up -d- View logs:
docker-compose logs -f- Create virtual environment:
python -m venv venv
source venv/bin/activate # Linux/MacOS
# or
venv\Scripts\activate # Windows- Install dependencies:
pip install -r requirements.txt- Configure .env file:
cp .env.example .env
nano .env- Launch the application:
python -m app.mainFor manual task execution and queue monitoring use:
- Create database dump manually:
docker-compose exec celery_worker celery -A app call app.tasks.backup.manual_backup - Run scraping manually:
docker-compose exec celery_worker celery -A app call app.tasks.scraping.manual_scrape - Run scraping from specific URL:
docker-compose exec celery_worker celery -A app call app.tasks.scraping.manual_scrape --args='["https://auto.ria.com/uk/car/mercedes-benz/"]'
- Show registered tasks:
docker-compose exec celery_worker celery -A app inspect registered - Show queued tasks:
docker-compose exec celery_worker celery -A app inspect reserved - Show active tasks:
docker-compose exec celery_worker celery -A app inspect active - Show completed tasks history:
docker-compose exec celery_worker celery -A app inspect revoked
Main settings are located in the .env file:
| Parameter | Description | Example |
|---|---|---|
DATABASE_URL |
PostgreSQL connection URL | postgresql://user:password@postgres:5432/autoria |
SCRAPER_START_TIME |
Data collection start time | 12:00 (daily at 12:00) |
DUMP_TIME |
Database dump creation time | 00:00 (daily at 00:00) |
SCRAPER_START_URL |
Starting page for data collection | https://auto.ria.com/uk/car/used/ |
MAX_PAGES_TO_PARSE |
Maximum number of pages to parse | 10 |
MAX_CARS_TO_PROCESS |
Maximum number of cars to process | 100 |
SCRAPER_CONCURRENCY |
Maximum number of concurrent requests | 5 |
Parser speed depends on the SCRAPER_CONCURRENCY parameter, which determines the number of concurrent requests. In practice, due to auto.ria.com site limitations and server-side delays, actual speed may differ from theoretical.
Test Results:
- Processed: 500 cars
- Added to DB: 495-496 new records
- Execution time: ~6-7 minutes (360-380 seconds)
- Efficiency: 99% (percentage of successfully processed listings)
Important:
- Increasing
SCRAPER_CONCURRENCYabove 5-7 practically doesn't speed up data collection due to site limitations and delays on auto.ria.com side.- Too high values may lead to temporary IP address blocking.
- Recommended to use values 5-7 for stable and safe operation.
- Dumps are created automatically daily at specified time
- Stored in
dumps/directory - Filename format:
autoria_dump_YYYY-MM-DD_HH-MM-SS.sql - Automatic deletion of old dumps (stored for 30 days by default)
The logging system provides detailed information about application operation:
- All logs are available in
logs/scraper.logfile - Log rotation is configured (maximum file size: 10MB)
- Separate logging for each module
- Logging levels: DEBUG, INFO, WARNING, ERROR, CRITICAL
The project uses Black for code formatting:
# Format code
black app/
# Check formatting
black --check app/This project is licensed under the MIT License - see the LICENSE file for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
If you have questions or need help:
- Create an Issue
- Check the Russian documentation
If this project helped you, please give it a star! β