Historical Wayback Crawler

This script collects historical versions of webpages and extracts their raw text/HTML content for analysis and optionally saves the data to a json. It uses the Wayback Machine and CDX API to retrieve snapshots. It can also track the number of times a given webpage has changed its content within the specified time range (does not do track by default).

Cleaned up from Data Provenance Initiatve.

Install Requirements

pip install -r requirements.txt

Usage

NOTE: Your CSV file should contain a "URL" column of urls to crawl. The default is to crawl robots.txt files. If you want to crawl the main page/domain, use --site-type main so the URL is not modified to include the /robots.txt path.

python -m wayback.run \
    --input-path <in-path> \
    --snapshots-path snapshots \
    --output-json-path wayback_data.json \
    --start-date 20240419 \
    --end-date 20250203 \
    --frequency monthly \
    --site-type robots \
    --save-snapshots \
    --process-to-json

Arguments

--input-path (Path, required): Path to CSV file containing URLs (must include "URL" column).
--output-json-path (Path, default: ./wayback_data.json): Path to save the output JSON file with extracted text for all URLs.
--start-date (str, default: "20240419"): Start date in YYYYMMDD format.
--end-date (str, default: "20250203"): End date in YYYYMMDD format.
--frequency (str, default: "monthly", choices: ["daily", "monthly", "annually"]): Frequency of collecting snapshots.
--num-workers (int, default: multiprocessing.cpu_count() - 1): Number of worker threads.
--snapshots-path (Path, default: Path("snapshots")): Path to the folder where snapshots will be saved.
--stats-path (Path, default: Path("stats")): Path to the folder where rate of change stats will be saved.
--count-changes (flag, default: False): Track rate of change by counting the number of unique changes for each site in the date range.
--process-to-json (flag, default: False): Process the extracted snapshots and save them to a JSON file.
--save-snapshots (flag, default: False): Whether to save and process snapshots from the Wayback Machine.
--site-type (str, default: "robots", choices: ["tos", "robots", "main"]): Type of site to process (terms of service, robots.txt, or main page/domain).
--max-chunk-size (int, default: 5000): Chunk size (MB) for saving data to JSON file.

The only required argument is the input path to a CSV file with URLs.

Rate Limiting

To avoid overwhelming sites and respect rate limits, this script uses the ratelimit library to limit the number of requests to 2 requests per second.

If you need to adjust the rate limit, modify the RATE_LIMIT_CALLS and RATE_LIMIT_PERIOD of the CDXEndpoint class in the config.py file.

Errors

Any errors / failed requests are saved to a file called failed_urls.txt in the root directory of this repo.

Output JSON Format

When using --process-to-json, the script creates a JSON file with the following structure:

{
    "domain.com": {
        "YYYY-MM-DD": "content for this date",
        "YYYY-MM-DD": "content for this date",
        ...
    },
    "another-domain.com": {
        "YYYY-MM-DD": "content for this date",
        "YYYY-MM-DD": "content for this date",
        ...
    }
}

Example output for robots.txt files:

{
  "patents.google.com": {
    "2024-04-19": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/",
    "2024-05-01": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/",
    "2024-06-01": "User-agent: *\nDisallow: /*\nAllow: /$\nAllow: /advanced$\nAllow: /patent/\nAllow: /sitemap/"
  }
}

The JSON structure is:

Top level: Dictionary of domains
Second level: Dictionary of dates mapping to content
Content: Raw text/HTML content for that snapshot
Dates: In YYYY-MM-DD format

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
wayback		wayback
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Historical Wayback Crawler

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

arielnlee/Wayback-Crawler

Folders and files

Latest commit

History

Repository files navigation

Historical Wayback Crawler

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages