Wayback Machine Scraper

A powerful tool for downloading historical versions of websites from the Internet Archive's Wayback Machine. This scraper processes a CSV file containing URLs and deal dates, then downloads website snapshots from two specific time periods for each URL.

Features

Batch Processing: Process multiple URLs from a CSV file
Time-based Downloads: Automatically calculates two download dates for each URL:
- 6 months before the deal date
- 12 months after the deal date
Resume Capability: Can resume interrupted downloads using state tracking
Comprehensive Logging: Detailed logs for debugging and monitoring
Docker Support: Easy deployment using Docker containers
Progress Tracking: State file to track completed downloads

Prerequisites

Docker and Docker Compose
A CSV file with your URLs and deal dates

Quick Start

Prepare your data file: Create a data.csv file with the following format:

URL,Deal Date
https://example.com,2016-09-30
https://another-site.com,2017-03-15

Run the scraper:
```
./run.sh
```
Check results: Your downloaded websites will be available in the downloads/ directory.

Deployment

Quick Deploy

# Deploy with default version (1.0.0)
chmod +x deploy.sh
./deploy.sh

# Deploy with specific version
./deploy.sh 2.1.0

What it does

Builds and pushes Docker image to Docker Hub (latest + version tag)
Creates a wayback_scraper artifact with:
- docker-compose.yml (configured image with version)
- run.sh (execution script)
- README.md (quick start)
Publishes GitHub Release with downloadable artifact

Prerequisites

Docker and Docker Hub account
GitHub CLI (optional, for automated releases)
Update DOCKERHUB_USERNAME in deploy.sh

Troubleshooting

# Login to Docker Hub
docker login

# Make script executable
chmod +x deploy.sh

CSV Format

Your CSV file must contain these columns:

URL: The website URL to scrape
Deal Date: The reference date in YYYY-MM-DD format

Example:

URL,Deal Date
https://example.com,2016-09-30
https://company.com,2017-03-15

How It Works

For each URL in your CSV file, the scraper will:

Calculate download dates:
- First date: 6 months before the deal date
- Second date: 12 months after the deal date
Download website snapshots:
- Uses the wayback-machine-downloader tool
- Downloads HTML files and main pages
- Skips media files (images, CSS, JS) for faster downloads
Organize results:
- Creates separate folders for each URL and date
- Naming format: {domain}_up_to_{YYYYMMDD}

Directory Structure

After running the scraper, your downloads/ directory will look like:

downloads/
├── example_com_up_to_20160330/
│   ├── download.log
│   └── [downloaded files]
├── example_com_up_to_20170930/
│   ├── download.log
│   └── [downloaded files]
└── logs/
    └── wayback_scraper.log

Configuration

Time Periods

You can modify the time periods in wayback_scraper.py:

MONTHS_BEFORE_DEAL = 6  # Months before deal date
MONTHS_AFTER_DEAL = 12  # Months after deal date

CSV Column Names

If your CSV uses different column names, update these constants:

WEBSITE_URL_COLUMN = 'URL'
DEAL_DATE_COLUMN = 'Deal Date'

Advanced Usage

Manual Docker Commands

Build the image:

docker compose build

Run the scraper:

docker compose up

Direct Python Usage

If you prefer to run without Docker:

Install dependencies:
```
pip install -r requirements.txt
```
Install wayback-machine-downloader (Ruby gem)

Run the script:

python3 wayback_scraper.py data.csv --output downloads

Command Line Options

python3 wayback_scraper.py <csv_file> [options]

Options:
  --output DIR     Output directory for downloads (default: downloads)
  --resume         Resume from previous state
  --help           Show help message

State Management

The scraper maintains a state file (wayback_scraper_state.json) that tracks:

Completed downloads
Download timestamps
Folder locations

This allows you to:

Resume interrupted downloads
Skip already completed downloads
Track progress across multiple runs

Logging

The scraper provides comprehensive logging:

Main log: downloads/logs/wayback_scraper.log
Download logs: Individual logs in each download folder
Console output: Real-time progress updates

Troubleshooting

Common Issues

Docker not running:

# Start Docker Desktop or Docker daemon
sudo systemctl start docker  # Linux

CSV file not found:
- Ensure data.csv exists in the project root
- Check file permissions
Download timeouts:
- The scraper has a 15-minute timeout per download
- Large websites may take longer
- Check logs for specific errors

Resume downloads:

# The scraper automatically resumes from state file
# To force fresh start, delete wayback_scraper_state.json

Debug Mode

For detailed debugging, check the logs:

# View main log
tail -f downloads/logs/wayback_scraper.log

# View specific download log
tail -f downloads/example_com_up_to_20160330/download.log

Performance Tips

Large datasets: Process in smaller batches
Network issues: The scraper will retry failed downloads
Storage: Ensure sufficient disk space for downloads
Memory: Monitor Docker container memory usage

Limitations

Downloads are limited to HTML files and main pages
Media files (images, CSS, JS) are excluded for performance
Wayback Machine availability may vary by URL and date
Rate limiting may apply for large-scale scraping

Contributing

Fork the repository
Create a feature branch
Make your changes
Test thoroughly
Submit a pull request

License

This project is open source. Please check the license file for details.

Support

For issues and questions:

Check the troubleshooting section
Review the logs for error messages
Open an issue on the repository

Note: This tool is designed for research and archival purposes. Please respect the Internet Archive's terms of service and robots.txt files when scraping websites.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.devcontainer		.devcontainer
deploy		deploy
wayback-machine-downloader @ 2f1ec84		wayback-machine-downloader @ 2f1ec84
.gitignore		.gitignore
.gitmodules		.gitmodules
Dockerfile		Dockerfile
README.md		README.md
deploy.sh		deploy.sh
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt
run.sh		run.sh
wayback_scraper.py		wayback_scraper.py

Folders and files

Latest commit

History

Repository files navigation

Wayback Machine Scraper

Features

Prerequisites

Quick Start

Deployment

Quick Deploy

What it does

Prerequisites

Troubleshooting

CSV Format

How It Works

Directory Structure

Configuration

Time Periods

CSV Column Names

Advanced Usage

Manual Docker Commands

Direct Python Usage

Command Line Options

State Management

Logging

Troubleshooting

Common Issues

Debug Mode

Performance Tips

Limitations

Contributing

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages