Skip to content

AlfonsoCorrado/wayback_scraper

Repository files navigation

Wayback Machine Scraper

A powerful tool for downloading historical versions of websites from the Internet Archive's Wayback Machine. This scraper processes a CSV file containing URLs and deal dates, then downloads website snapshots from two specific time periods for each URL.

Features

  • Batch Processing: Process multiple URLs from a CSV file
  • Time-based Downloads: Automatically calculates two download dates for each URL:
    • 6 months before the deal date
    • 12 months after the deal date
  • Resume Capability: Can resume interrupted downloads using state tracking
  • Comprehensive Logging: Detailed logs for debugging and monitoring
  • Docker Support: Easy deployment using Docker containers
  • Progress Tracking: State file to track completed downloads

Prerequisites

  • Docker and Docker Compose
  • A CSV file with your URLs and deal dates

Quick Start

  1. Prepare your data file: Create a data.csv file with the following format:

    URL,Deal Date
    https://example.com,2016-09-30
    https://another-site.com,2017-03-15
  2. Run the scraper:

    ./run.sh
  3. Check results: Your downloaded websites will be available in the downloads/ directory.

Deployment

Quick Deploy

# Deploy with default version (1.0.0)
chmod +x deploy.sh
./deploy.sh

# Deploy with specific version
./deploy.sh 2.1.0

What it does

  1. Builds and pushes Docker image to Docker Hub (latest + version tag)
  2. Creates a wayback_scraper artifact with:
    • docker-compose.yml (configured image with version)
    • run.sh (execution script)
    • README.md (quick start)
  3. Publishes GitHub Release with downloadable artifact

Prerequisites

  • Docker and Docker Hub account
  • GitHub CLI (optional, for automated releases)
  • Update DOCKERHUB_USERNAME in deploy.sh

Troubleshooting

# Login to Docker Hub
docker login

# Make script executable
chmod +x deploy.sh

CSV Format

Your CSV file must contain these columns:

  • URL: The website URL to scrape
  • Deal Date: The reference date in YYYY-MM-DD format

Example:

URL,Deal Date
https://example.com,2016-09-30
https://company.com,2017-03-15

How It Works

For each URL in your CSV file, the scraper will:

  1. Calculate download dates:

    • First date: 6 months before the deal date
    • Second date: 12 months after the deal date
  2. Download website snapshots:

    • Uses the wayback-machine-downloader tool
    • Downloads HTML files and main pages
    • Skips media files (images, CSS, JS) for faster downloads
  3. Organize results:

    • Creates separate folders for each URL and date
    • Naming format: {domain}_up_to_{YYYYMMDD}

Directory Structure

After running the scraper, your downloads/ directory will look like:

downloads/
├── example_com_up_to_20160330/
│   ├── download.log
│   └── [downloaded files]
├── example_com_up_to_20170930/
│   ├── download.log
│   └── [downloaded files]
└── logs/
    └── wayback_scraper.log

Configuration

Time Periods

You can modify the time periods in wayback_scraper.py:

MONTHS_BEFORE_DEAL = 6  # Months before deal date
MONTHS_AFTER_DEAL = 12  # Months after deal date

CSV Column Names

If your CSV uses different column names, update these constants:

WEBSITE_URL_COLUMN = 'URL'
DEAL_DATE_COLUMN = 'Deal Date'

Advanced Usage

Manual Docker Commands

Build the image:

docker compose build

Run the scraper:

docker compose up

Direct Python Usage

If you prefer to run without Docker:

  1. Install dependencies:

    pip install -r requirements.txt
  2. Install wayback-machine-downloader (Ruby gem)

  3. Run the script:

    python3 wayback_scraper.py data.csv --output downloads

Command Line Options

python3 wayback_scraper.py <csv_file> [options]

Options:
  --output DIR     Output directory for downloads (default: downloads)
  --resume         Resume from previous state
  --help           Show help message

State Management

The scraper maintains a state file (wayback_scraper_state.json) that tracks:

  • Completed downloads
  • Download timestamps
  • Folder locations

This allows you to:

  • Resume interrupted downloads
  • Skip already completed downloads
  • Track progress across multiple runs

Logging

The scraper provides comprehensive logging:

  • Main log: downloads/logs/wayback_scraper.log
  • Download logs: Individual logs in each download folder
  • Console output: Real-time progress updates

Troubleshooting

Common Issues

  1. Docker not running:

    # Start Docker Desktop or Docker daemon
    sudo systemctl start docker  # Linux
  2. CSV file not found:

    • Ensure data.csv exists in the project root
    • Check file permissions
  3. Download timeouts:

    • The scraper has a 15-minute timeout per download
    • Large websites may take longer
    • Check logs for specific errors
  4. Resume downloads:

    # The scraper automatically resumes from state file
    # To force fresh start, delete wayback_scraper_state.json

Debug Mode

For detailed debugging, check the logs:

# View main log
tail -f downloads/logs/wayback_scraper.log

# View specific download log
tail -f downloads/example_com_up_to_20160330/download.log

Performance Tips

  • Large datasets: Process in smaller batches
  • Network issues: The scraper will retry failed downloads
  • Storage: Ensure sufficient disk space for downloads
  • Memory: Monitor Docker container memory usage

Limitations

  • Downloads are limited to HTML files and main pages
  • Media files (images, CSS, JS) are excluded for performance
  • Wayback Machine availability may vary by URL and date
  • Rate limiting may apply for large-scale scraping

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

License

This project is open source. Please check the license file for details.

Support

For issues and questions:

  1. Check the troubleshooting section
  2. Review the logs for error messages
  3. Open an issue on the repository

Note: This tool is designed for research and archival purposes. Please respect the Internet Archive's terms of service and robots.txt files when scraping websites.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors