Skip to content

arielnlee/Wayback-Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Historical Wayback Crawler

This script collects historical versions of webpages and extracts their raw text/HTML content for analysis and optionally saves the data to a json. It uses the Wayback Machine and CDX API to retrieve snapshots. It can also track the number of times a given webpage has changed its content within the specified time range (does not do track by default).

Cleaned up from Data Provenance Initiatve.

Install Requirements

pip install -r requirements.txt

Usage

NOTE: Your CSV file should contain a "URL" column of urls to crawl. The default is to crawl robots.txt files. If you want to crawl the main page/domain, use --site-type main so the URL is not modified to include the /robots.txt path.

python -m wayback.run \
    --input-path <in-path> \
    --snapshots-path snapshots \
    --output-json-path wayback_data.json \
    --start-date 20240419 \
    --end-date 20250203 \
    --frequency monthly \
    --site-type robots \
    --save-snapshots \
    --process-to-json

Arguments

  • --input-path (Path, required): Path to CSV file containing URLs (must include "URL" column).
  • --output-json-path (Path, default: ./wayback_data.json): Path to save the output JSON file with extracted text for all URLs.
  • --start-date (str, default: "20240419"): Start date in YYYYMMDD format.
  • --end-date (str, default: "20250203"): End date in YYYYMMDD format.
  • --frequency (str, default: "monthly", choices: ["daily", "monthly", "annually"]): Frequency of collecting snapshots.
  • --num-workers (int, default: multiprocessing.cpu_count() - 1): Number of worker threads.
  • --snapshots-path (Path, default: Path("snapshots")): Path to the folder where snapshots will be saved.
  • --stats-path (Path, default: Path("stats")): Path to the folder where rate of change stats will be saved.
  • --count-changes (flag, default: False): Track rate of change by counting the number of unique changes for each site in the date range.
  • --process-to-json (flag, default: False): Process the extracted snapshots and save them to a JSON file.
  • --save-snapshots (flag, default: False): Whether to save and process snapshots from the Wayback Machine.
  • --site-type (str, default: "robots", choices: ["tos", "robots", "main"]): Type of site to process (terms of service, robots.txt, or main page/domain).
  • --max-chunk-size (int, default: 5000): Chunk size (MB) for saving data to JSON file.

The only required argument is the input path to a CSV file with URLs.

Rate Limiting

To avoid overwhelming sites and respect rate limits, this script uses the ratelimit library to limit the number of requests to 2 requests per second.

If you need to adjust the rate limit, modify the RATE_LIMIT_CALLS and RATE_LIMIT_PERIOD of the CDXEndpoint class in the config.py file.

Errors

Any errors / failed requests are saved to a file called failed_urls.txt in the root directory of this repo.

Output JSON Format

When using --process-to-json, the script creates a JSON file with the following structure:

{
    "domain.com": {
        "YYYY-MM-DD": "content for this date",
        "YYYY-MM-DD": "content for this date",
        ...
    },
    "another-domain.com": {
        "YYYY-MM-DD": "content for this date",
        "YYYY-MM-DD": "content for this date",
        ...
    }
}

Example output for robots.txt files:

{
  "patents.google.com": {
    "2024-04-19": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/",
    "2024-05-01": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/",
    "2024-06-01": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/"
  }
}

The JSON structure is:

  • Top level: Dictionary of domains
  • Second level: Dictionary of dates mapping to content
  • Content: Raw text/HTML content for that snapshot
  • Dates: In YYYY-MM-DD format

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages