A scalable FastAPI-based web crawler service for extracting metadata, content, and topics from web pages. Designed to handle large-scale crawling operations with support for batch processing and efficient data storage.
- Single URL Crawling: Extract metadata from individual URLs
- Batch Processing: Process multiple URLs from text files
- Metadata Extraction: Title, description, keywords, Open Graph data
- Topic Classification: Automatic topic extraction from content
- Crawl History: Track and retrieve crawl history with pagination
- Search Capabilities: Search crawled pages by topic
- MongoDB Storage: Persistent storage with efficient indexing
- Docker Support: Easy deployment with Docker Compose
- FastAPI: Modern, fast web framework for building APIs
- MongoDB: Document database for storing crawl metadata
- Motor: Async MongoDB driver
- httpx: Async HTTP client for web requests
- BeautifulSoup4: HTML parsing and content extraction
- Pydantic: Data validation and serialization
- Python 3.8+
- MongoDB (or use Docker Compose)
- pip
- Build and run with Docker Compose:
docker-compose up -dThis will start:
- MongoDB on port 27017
- Web Crawler API on port 8000
- View API documentation:
Visit
http://localhost:8000/docsfor interactive API documentation
GET /api/v1/crawl?url=<URL>&force_refresh=false
POST /api/v1/crawl/batch
Content-Type: multipart/form-data
Body: file (text file with one URL per line)
GET /api/v1/history?limit=10&skip=0
GET /api/v1/crawl/{crawl_id}
GET /api/v1/search/topic?topic=<topic>&limit=10
GET /api/v1/stats
GET /health
web_crawler_api/
├── controllers/ # API route handlers
├── services/ # Business logic
├── models/ # Pydantic schemas
├── database/ # Database connection and setup
├── docs/ # Design and implementation documentation
├── main.py # FastAPI application entry point
├── requirements.txt # Python dependencies
├── docker-compose.yml # Docker Compose configuration
└── Dockerfile # Docker image definition
- Design Documentation: See
docs/DESIGN.mdfor architecture and scalability design - Implementation Plan: See
docs/ENGINEERING_IMPLEMENTATION_PLAN.mdfor detailed implementation guide - API Documentation: Interactive docs available at
/docswhen the server is running
[MIT]