Web Archive Tools

A comprehensive web archiving toolkit with cryptographic verification, time-disciplined attestation, and reproducible workflows.

Features

Web Crawling: Uses Browsertrix Crawler for high-fidelity web archiving
Media Download: Integrates yt-dlp and gallery-dl for multimedia content
Cryptographic Sealing: Merkle trees, digital signatures, and timestamping
Time Discipline: NTS-synchronized timestamps for non-repudiation
Verification: End-to-end integrity checking and proof validation
Reproducible: Docker-based, version-controlled, auditable workflows
Privacy-Aware: Configurable redaction for public release

Quick Start

Prerequisites

Docker and Docker Compose
Python 3.8+
Git
OpenSSL

Installation

git clone https://github.com/gavriilfakih/webarchive.git
cd webarchive
make bootstrap
make install

Basic Usage

# Start services
make up

# Crawl a website
make crawl URL="https://example.com"

# Download media
make media URL="https://www.youtube.com/watch?v=abc123"

# Create cryptographic seals
make seal

# Verify integrity
make verify

# Start replay server
make replay
# Visit http://localhost:8080

Architecture

webarchive/
├─ bin/                    # Executable scripts
├─ src/webarchive/         # Python package
├─ conf/                   # Configuration files
├─ docker/                 # Docker services
├─ docs/                   # Documentation
├─ test/                   # Unit tests
├─ data/                   # Runtime data (not in git)
│  ├─ archives/            # WARC files
│  ├─ media/               # Downloaded media
│  ├─ proofs/              # Cryptographic proofs
│  └─ logs/                # Service logs
└─ examples/               # Sample configurations

Workflow

Capture: Web crawling and media downloading
Seal: Generate cryptographic proofs
Verify: Validate integrity and authenticity
Replay: Browse archived content locally
Publish: Create redacted versions for sharing

Core Commands

Environment

make env-check    # Check system requirements
make bootstrap    # Initialize directories
make up           # Start Docker services
make down         # Stop services

Archiving

# Web crawling
make crawl URL="https://example.com" CFG="conf/browsertrix/crawl.yml"

# Media download
make media URL="https://youtube.com/watch?v=abc"
make media FILE="urls.txt"

# Cryptographic sealing
make seal         # hash → merkle → sign → timestamp
make verify       # Verify all proofs

# Content redaction
make redact       # Create public-safe copies

Development

make test         # Run unit tests
make lint         # Code quality checks
make format       # Auto-format code
make docs         # Generate documentation

Configuration

Environment Variables

Copy .env.example to .env and customize:

cp .env.example .env
# Edit .env with your settings

Crawl Configuration

Edit conf/browsertrix/crawl.yml:

workers: 2
limit: 50
behaviors:
  - autoplay
  - autofetch
exclude:
  - "*/ads/*"
  - "*/tracking/*"

Policy Configuration

Edit conf/policy.yaml for capture policies:

content_policy:
  max_file_size: 104857600  # 100MB
  exclude_patterns:
    - "*/private/*"
    - "*/admin/*"

Cryptographic Verification

The toolkit provides multiple layers of integrity protection:

File Hashes: SHA-256 of all archived content
Merkle Trees: Deterministic tree structure
Digital Signatures: GPG signatures on attestations
RFC 3161 Timestamps: Trusted timestamp authority
OpenTimestamps: Blockchain-based timestamps

Verification Process

# Comprehensive verification
make verify

# Manual verification steps
python -m webarchive.verify --base . --schema conf/attest.schema.json
python -m webarchive.merkle data/proofs/files.sha256 --root data/proofs/merkle/root.txt
gpg --verify data/proofs/ATT.yaml.sig data/proofs/ATT.yaml

Docker Services

The system uses Docker Compose with multiple profiles:

# Web archiving services
docker compose -f docker/compose.yml up -d --profile archiving

# Replay services
docker compose -f docker/compose.yml up -d --profile replay

# Time synchronization
docker compose -f docker/compose.yml up -d --profile timeserver

Services Included

browsertrix: Web crawler
pywb: Archive replay
chrony: NTS time synchronization
postgres: Session database (optional)

Python API

from webarchive import merkle, attest, verify

# Build merkle tree
tree = merkle.build_merkle_from_file_hashes(Path("files.sha256"))
print(f"Root: {tree.get_root()}")

# Generate attestation
attestation = attest.build_attestation(
    session_name="my-session",
    base_path=Path("."),
    notes="Example archiving session"
)

# Verify integrity
report = verify.run_comprehensive_verification(Path("."))
print(f"Status: {report['overall_status']}")

Advanced Features

Time-Disciplined Archiving

NTS (Network Time Security) for authenticated time
Chrony daemon with multiple time sources
Persistent time logs for audit trails

Privacy and Redaction

Automatic header redaction (cookies, auth tokens)
URL parameter sanitization
Configurable content filtering

Reproducible Builds

Pinned Docker images
Deterministic file ordering
Version-controlled configurations

Security Considerations

Never archive authenticated content without permission
Review robots.txt and terms of service
Use redaction for public releases
Verify timestamps independently
Store private keys securely

Contributing

Fork the repository
Create a feature branch
Run tests: make test
Check code quality: make lint
Submit a pull request

License

MIT License - see LICENSE file.

Support

GitHub Issues: Bug reports and feature requests
Documentation: docs/ directory
Examples: examples/ directory

Related Projects

Note: This is experimental software. Test thoroughly before using in production environments.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github		.github
bin		bin
conf		conf
docker		docker
docs		docs
examples		examples
schemas		schemas
src		src
test		test
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web Archive Tools

Features

Quick Start

Prerequisites

Installation

Basic Usage

Architecture

Workflow

Core Commands

Environment

Archiving

Development

Configuration

Environment Variables

Crawl Configuration

Policy Configuration

Cryptographic Verification

Verification Process

Docker Services

Services Included

Python API

Advanced Features

Time-Disciplined Archiving

Privacy and Redaction

Reproducible Builds

Security Considerations

Contributing

License

Support

Related Projects

About

Uh oh!

Releases

Packages

Languages

License

gavriilfakih/webarchive

Folders and files

Latest commit

History

Repository files navigation

Web Archive Tools

Features

Quick Start

Prerequisites

Installation

Basic Usage

Architecture

Workflow

Core Commands

Environment

Archiving

Development

Configuration

Environment Variables

Crawl Configuration

Policy Configuration

Cryptographic Verification

Verification Process

Docker Services

Services Included

Python API

Advanced Features

Time-Disciplined Archiving

Privacy and Redaction

Reproducible Builds

Security Considerations

Contributing

License

Support

Related Projects

About

Resources

License

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages