A comprehensive web archiving toolkit with cryptographic verification, time-disciplined attestation, and reproducible workflows.
- Web Crawling: Uses Browsertrix Crawler for high-fidelity web archiving
- Media Download: Integrates yt-dlp and gallery-dl for multimedia content
- Cryptographic Sealing: Merkle trees, digital signatures, and timestamping
- Time Discipline: NTS-synchronized timestamps for non-repudiation
- Verification: End-to-end integrity checking and proof validation
- Reproducible: Docker-based, version-controlled, auditable workflows
- Privacy-Aware: Configurable redaction for public release
- Docker and Docker Compose
- Python 3.8+
- Git
- OpenSSL
git clone https://github.com/gavriilfakih/webarchive.git
cd webarchive
make bootstrap
make install# Start services
make up
# Crawl a website
make crawl URL="https://example.com"
# Download media
make media URL="https://www.youtube.com/watch?v=abc123"
# Create cryptographic seals
make seal
# Verify integrity
make verify
# Start replay server
make replay
# Visit http://localhost:8080webarchive/
├─ bin/ # Executable scripts
├─ src/webarchive/ # Python package
├─ conf/ # Configuration files
├─ docker/ # Docker services
├─ docs/ # Documentation
├─ test/ # Unit tests
├─ data/ # Runtime data (not in git)
│ ├─ archives/ # WARC files
│ ├─ media/ # Downloaded media
│ ├─ proofs/ # Cryptographic proofs
│ └─ logs/ # Service logs
└─ examples/ # Sample configurations
- Capture: Web crawling and media downloading
- Seal: Generate cryptographic proofs
- Verify: Validate integrity and authenticity
- Replay: Browse archived content locally
- Publish: Create redacted versions for sharing
make env-check # Check system requirements
make bootstrap # Initialize directories
make up # Start Docker services
make down # Stop services# Web crawling
make crawl URL="https://example.com" CFG="conf/browsertrix/crawl.yml"
# Media download
make media URL="https://youtube.com/watch?v=abc"
make media FILE="urls.txt"
# Cryptographic sealing
make seal # hash → merkle → sign → timestamp
make verify # Verify all proofs
# Content redaction
make redact # Create public-safe copiesmake test # Run unit tests
make lint # Code quality checks
make format # Auto-format code
make docs # Generate documentationCopy .env.example to .env and customize:
cp .env.example .env
# Edit .env with your settingsEdit conf/browsertrix/crawl.yml:
workers: 2
limit: 50
behaviors:
- autoplay
- autofetch
exclude:
- "*/ads/*"
- "*/tracking/*"Edit conf/policy.yaml for capture policies:
content_policy:
max_file_size: 104857600 # 100MB
exclude_patterns:
- "*/private/*"
- "*/admin/*"The toolkit provides multiple layers of integrity protection:
- File Hashes: SHA-256 of all archived content
- Merkle Trees: Deterministic tree structure
- Digital Signatures: GPG signatures on attestations
- RFC 3161 Timestamps: Trusted timestamp authority
- OpenTimestamps: Blockchain-based timestamps
# Comprehensive verification
make verify
# Manual verification steps
python -m webarchive.verify --base . --schema conf/attest.schema.json
python -m webarchive.merkle data/proofs/files.sha256 --root data/proofs/merkle/root.txt
gpg --verify data/proofs/ATT.yaml.sig data/proofs/ATT.yamlThe system uses Docker Compose with multiple profiles:
# Web archiving services
docker compose -f docker/compose.yml up -d --profile archiving
# Replay services
docker compose -f docker/compose.yml up -d --profile replay
# Time synchronization
docker compose -f docker/compose.yml up -d --profile timeserver- browsertrix: Web crawler
- pywb: Archive replay
- chrony: NTS time synchronization
- postgres: Session database (optional)
from webarchive import merkle, attest, verify
# Build merkle tree
tree = merkle.build_merkle_from_file_hashes(Path("files.sha256"))
print(f"Root: {tree.get_root()}")
# Generate attestation
attestation = attest.build_attestation(
session_name="my-session",
base_path=Path("."),
notes="Example archiving session"
)
# Verify integrity
report = verify.run_comprehensive_verification(Path("."))
print(f"Status: {report['overall_status']}")- NTS (Network Time Security) for authenticated time
- Chrony daemon with multiple time sources
- Persistent time logs for audit trails
- Automatic header redaction (cookies, auth tokens)
- URL parameter sanitization
- Configurable content filtering
- Pinned Docker images
- Deterministic file ordering
- Version-controlled configurations
- Never archive authenticated content without permission
- Review robots.txt and terms of service
- Use redaction for public releases
- Verify timestamps independently
- Store private keys securely
- Fork the repository
- Create a feature branch
- Run tests:
make test - Check code quality:
make lint - Submit a pull request
MIT License - see LICENSE file.
- GitHub Issues: Bug reports and feature requests
- Documentation:
docs/directory - Examples:
examples/directory
Note: This is experimental software. Test thoroughly before using in production environments.