A powerful tool for downloading historical versions of websites from the Internet Archive's Wayback Machine. This scraper processes a CSV file containing URLs and deal dates, then downloads website snapshots from two specific time periods for each URL.
- Batch Processing: Process multiple URLs from a CSV file
- Time-based Downloads: Automatically calculates two download dates for each URL:
- 6 months before the deal date
- 12 months after the deal date
- Resume Capability: Can resume interrupted downloads using state tracking
- Comprehensive Logging: Detailed logs for debugging and monitoring
- Docker Support: Easy deployment using Docker containers
- Progress Tracking: State file to track completed downloads
- Docker and Docker Compose
- A CSV file with your URLs and deal dates
-
Prepare your data file: Create a
data.csvfile with the following format:URL,Deal Date https://example.com,2016-09-30 https://another-site.com,2017-03-15
-
Run the scraper:
./run.sh
-
Check results: Your downloaded websites will be available in the
downloads/directory.
# Deploy with default version (1.0.0)
chmod +x deploy.sh
./deploy.sh
# Deploy with specific version
./deploy.sh 2.1.0- Builds and pushes Docker image to Docker Hub (latest + version tag)
- Creates a
wayback_scraperartifact with:docker-compose.yml(configured image with version)run.sh(execution script)README.md(quick start)
- Publishes GitHub Release with downloadable artifact
- Docker and Docker Hub account
- GitHub CLI (optional, for automated releases)
- Update
DOCKERHUB_USERNAMEindeploy.sh
# Login to Docker Hub
docker login
# Make script executable
chmod +x deploy.shYour CSV file must contain these columns:
URL: The website URL to scrapeDeal Date: The reference date in YYYY-MM-DD format
Example:
URL,Deal Date
https://example.com,2016-09-30
https://company.com,2017-03-15For each URL in your CSV file, the scraper will:
-
Calculate download dates:
- First date: 6 months before the deal date
- Second date: 12 months after the deal date
-
Download website snapshots:
- Uses the
wayback-machine-downloadertool - Downloads HTML files and main pages
- Skips media files (images, CSS, JS) for faster downloads
- Uses the
-
Organize results:
- Creates separate folders for each URL and date
- Naming format:
{domain}_up_to_{YYYYMMDD}
After running the scraper, your downloads/ directory will look like:
downloads/
├── example_com_up_to_20160330/
│ ├── download.log
│ └── [downloaded files]
├── example_com_up_to_20170930/
│ ├── download.log
│ └── [downloaded files]
└── logs/
└── wayback_scraper.log
You can modify the time periods in wayback_scraper.py:
MONTHS_BEFORE_DEAL = 6 # Months before deal date
MONTHS_AFTER_DEAL = 12 # Months after deal dateIf your CSV uses different column names, update these constants:
WEBSITE_URL_COLUMN = 'URL'
DEAL_DATE_COLUMN = 'Deal Date'Build the image:
docker compose buildRun the scraper:
docker compose upIf you prefer to run without Docker:
-
Install dependencies:
pip install -r requirements.txt
-
Install wayback-machine-downloader (Ruby gem)
-
Run the script:
python3 wayback_scraper.py data.csv --output downloads
python3 wayback_scraper.py <csv_file> [options]
Options:
--output DIR Output directory for downloads (default: downloads)
--resume Resume from previous state
--help Show help messageThe scraper maintains a state file (wayback_scraper_state.json) that tracks:
- Completed downloads
- Download timestamps
- Folder locations
This allows you to:
- Resume interrupted downloads
- Skip already completed downloads
- Track progress across multiple runs
The scraper provides comprehensive logging:
- Main log:
downloads/logs/wayback_scraper.log - Download logs: Individual logs in each download folder
- Console output: Real-time progress updates
-
Docker not running:
# Start Docker Desktop or Docker daemon sudo systemctl start docker # Linux
-
CSV file not found:
- Ensure
data.csvexists in the project root - Check file permissions
- Ensure
-
Download timeouts:
- The scraper has a 15-minute timeout per download
- Large websites may take longer
- Check logs for specific errors
-
Resume downloads:
# The scraper automatically resumes from state file # To force fresh start, delete wayback_scraper_state.json
For detailed debugging, check the logs:
# View main log
tail -f downloads/logs/wayback_scraper.log
# View specific download log
tail -f downloads/example_com_up_to_20160330/download.log- Large datasets: Process in smaller batches
- Network issues: The scraper will retry failed downloads
- Storage: Ensure sufficient disk space for downloads
- Memory: Monitor Docker container memory usage
- Downloads are limited to HTML files and main pages
- Media files (images, CSS, JS) are excluded for performance
- Wayback Machine availability may vary by URL and date
- Rate limiting may apply for large-scale scraping
- Fork the repository
- Create a feature branch
- Make your changes
- Test thoroughly
- Submit a pull request
This project is open source. Please check the license file for details.
For issues and questions:
- Check the troubleshooting section
- Review the logs for error messages
- Open an issue on the repository
Note: This tool is designed for research and archival purposes. Please respect the Internet Archive's terms of service and robots.txt files when scraping websites.