Note: This was vibe coded with Gemini
This script queries the Wayback Machine's CDX API to find archived snapshots of a specific domain within a user-defined date range. It filters for successful captures (HTTP 200) and exports the results to a CSV file for auditing and analysis.
- Custom Date Filtering: Specify a start date and an optional end date for the audit.
- Status Filtering: Automatically filters for successful
200 OKstatus codes. - Deduplication: Uses the Wayback Machine's collapse parameter to ensure unique daily captures.
- CSV Export: Generates a structured report including:
- Formatted capture date (YYYY-MM-DD HH:MM)
- HTTP Status Code
- Original URL
- Direct link to the Wayback Machine archive
- Python 3.x
requestslibrary
To install the required library, run:
pip install requests
- Run the script:
python script_name.py
- Enter the Domain: Input the domain you wish to audit (e.g.,
example.com). - Enter the Start Date: Provide the date in
YYYYMMDDformat (e.g.,20240101). - Enter the End Date: Provide the end date in
YYYYMMDDformat, or press Enter to default to the current date.
The script interacts with the Internet Archive's CDX API. Below is a high-level overview of the data flow:
- Request: The script sends a GET request to
web.archive.org/cdx/search/cdx. - Parameters: It applies filters for the
url,statuscode, andtimestamp. - Processing: The script converts the raw API timestamp (e.g.,
20240101123000) into a human-readable format. - Output: Data is written row-by-row into a CSV file named
audit_[domain]_[start_date].csv.
The generated CSV follows this format:
| Date | Status | Original URL | View Archive Link |
|---|---|---|---|
| 2024-01-01 12:00 | 200 | http://example.com/ | https://web.archive.org/web/20240101120000/http://example.com/ |