Skip to content

gavriilfakih/webarchive

Repository files navigation

Web Archive Tools

A comprehensive web archiving toolkit with cryptographic verification, time-disciplined attestation, and reproducible workflows.

Features

  • Web Crawling: Uses Browsertrix Crawler for high-fidelity web archiving
  • Media Download: Integrates yt-dlp and gallery-dl for multimedia content
  • Cryptographic Sealing: Merkle trees, digital signatures, and timestamping
  • Time Discipline: NTS-synchronized timestamps for non-repudiation
  • Verification: End-to-end integrity checking and proof validation
  • Reproducible: Docker-based, version-controlled, auditable workflows
  • Privacy-Aware: Configurable redaction for public release

Quick Start

Prerequisites

  • Docker and Docker Compose
  • Python 3.8+
  • Git
  • OpenSSL

Installation

git clone https://github.com/gavriilfakih/webarchive.git
cd webarchive
make bootstrap
make install

Basic Usage

# Start services
make up

# Crawl a website
make crawl URL="https://example.com"

# Download media
make media URL="https://www.youtube.com/watch?v=abc123"

# Create cryptographic seals
make seal

# Verify integrity
make verify

# Start replay server
make replay
# Visit http://localhost:8080

Architecture

webarchive/
├─ bin/                    # Executable scripts
├─ src/webarchive/         # Python package
├─ conf/                   # Configuration files
├─ docker/                 # Docker services
├─ docs/                   # Documentation
├─ test/                   # Unit tests
├─ data/                   # Runtime data (not in git)
│  ├─ archives/            # WARC files
│  ├─ media/               # Downloaded media
│  ├─ proofs/              # Cryptographic proofs
│  └─ logs/                # Service logs
└─ examples/               # Sample configurations

Workflow

  1. Capture: Web crawling and media downloading
  2. Seal: Generate cryptographic proofs
  3. Verify: Validate integrity and authenticity
  4. Replay: Browse archived content locally
  5. Publish: Create redacted versions for sharing

Core Commands

Environment

make env-check    # Check system requirements
make bootstrap    # Initialize directories
make up           # Start Docker services
make down         # Stop services

Archiving

# Web crawling
make crawl URL="https://example.com" CFG="conf/browsertrix/crawl.yml"

# Media download
make media URL="https://youtube.com/watch?v=abc"
make media FILE="urls.txt"

# Cryptographic sealing
make seal         # hash → merkle → sign → timestamp
make verify       # Verify all proofs

# Content redaction
make redact       # Create public-safe copies

Development

make test         # Run unit tests
make lint         # Code quality checks
make format       # Auto-format code
make docs         # Generate documentation

Configuration

Environment Variables

Copy .env.example to .env and customize:

cp .env.example .env
# Edit .env with your settings

Crawl Configuration

Edit conf/browsertrix/crawl.yml:

workers: 2
limit: 50
behaviors:
  - autoplay
  - autofetch
exclude:
  - "*/ads/*"
  - "*/tracking/*"

Policy Configuration

Edit conf/policy.yaml for capture policies:

content_policy:
  max_file_size: 104857600  # 100MB
  exclude_patterns:
    - "*/private/*"
    - "*/admin/*"

Cryptographic Verification

The toolkit provides multiple layers of integrity protection:

  1. File Hashes: SHA-256 of all archived content
  2. Merkle Trees: Deterministic tree structure
  3. Digital Signatures: GPG signatures on attestations
  4. RFC 3161 Timestamps: Trusted timestamp authority
  5. OpenTimestamps: Blockchain-based timestamps

Verification Process

# Comprehensive verification
make verify

# Manual verification steps
python -m webarchive.verify --base . --schema conf/attest.schema.json
python -m webarchive.merkle data/proofs/files.sha256 --root data/proofs/merkle/root.txt
gpg --verify data/proofs/ATT.yaml.sig data/proofs/ATT.yaml

Docker Services

The system uses Docker Compose with multiple profiles:

# Web archiving services
docker compose -f docker/compose.yml up -d --profile archiving

# Replay services
docker compose -f docker/compose.yml up -d --profile replay

# Time synchronization
docker compose -f docker/compose.yml up -d --profile timeserver

Services Included

  • browsertrix: Web crawler
  • pywb: Archive replay
  • chrony: NTS time synchronization
  • postgres: Session database (optional)

Python API

from webarchive import merkle, attest, verify

# Build merkle tree
tree = merkle.build_merkle_from_file_hashes(Path("files.sha256"))
print(f"Root: {tree.get_root()}")

# Generate attestation
attestation = attest.build_attestation(
    session_name="my-session",
    base_path=Path("."),
    notes="Example archiving session"
)

# Verify integrity
report = verify.run_comprehensive_verification(Path("."))
print(f"Status: {report['overall_status']}")

Advanced Features

Time-Disciplined Archiving

  • NTS (Network Time Security) for authenticated time
  • Chrony daemon with multiple time sources
  • Persistent time logs for audit trails

Privacy and Redaction

  • Automatic header redaction (cookies, auth tokens)
  • URL parameter sanitization
  • Configurable content filtering

Reproducible Builds

  • Pinned Docker images
  • Deterministic file ordering
  • Version-controlled configurations

Security Considerations

  • Never archive authenticated content without permission
  • Review robots.txt and terms of service
  • Use redaction for public releases
  • Verify timestamps independently
  • Store private keys securely

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Run tests: make test
  4. Check code quality: make lint
  5. Submit a pull request

License

MIT License - see LICENSE file.

Support

  • GitHub Issues: Bug reports and feature requests
  • Documentation: docs/ directory
  • Examples: examples/ directory

Related Projects


Note: This is experimental software. Test thoroughly before using in production environments.

About

Robust local webarchive system

Resources

License

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published