Skip to content

Metadata Scrubber Tool

CarterPerez-dev edited this page Feb 11, 2026 · 1 revision

Metadata Scrubber Tool

Privacy-focused CLI tool for stripping metadata from images, PDFs, and Office documents.

Overview

A command-line tool that removes privacy-sensitive metadata from files — GPS coordinates, camera serial numbers, author identities, timestamps, and revision history. Supports JPEG, PNG, PDF, Word, Excel, and PowerPoint formats with batch processing and verification.

Status: Complete | Difficulty: Beginner

Legal Disclaimer

This tool is for privacy protection and authorized security testing only. Metadata scrubbing is a legitimate privacy practice used by journalists, whistleblowers, and security researchers.

Tech Stack

Technology Version Purpose
Python 3.10+ Modern type hints, pattern matching
Typer - CLI framework
Rich - Terminal UI, progress bars, tables
Pillow - Image EXIF handling
openpyxl - Excel metadata
python-pptx - PowerPoint metadata
python-docx - Word document metadata
PyPDF2 - PDF metadata

Features

Core Functionality

  • Read metadata from any supported file
  • Scrub (remove) metadata while preserving file functionality
  • Verify before/after metadata states with colored comparison
  • Batch processing with ThreadPoolExecutor for concurrent operations
  • Recursive directory processing with extension filtering

Supported Formats

Format Metadata Removed
JPEG EXIF (GPS, camera model, timestamps, serial numbers)
PNG tEXt chunks, EXIF data
PDF Info dictionary (author, creator, timestamps)
Word (.docx) Document properties (author, company, revision)
Excel (.xlsx) Workbook properties
PowerPoint (.pptx) Presentation properties

Security Relevance

  • John McAfee (2013): Located via EXIF GPS data in a journalist's photo
  • Anonymous (2012): Members doxxed through PDF author metadata
  • File metadata can leak identity, location, organization, and edit history

Architecture

User Command (read / scrub / verify)
    ↓
main.py (Typer CLI)
    ↓
┌───────────────────────────────────────────────┐
│              MetadataFactory                   │
│  Routes files to handlers by extension         │
└───────────────────┬───────────────────────────┘
                    ↓
┌──────────┬────────────┬───────────┬───────────┐
│  Image   │    PDF     │   Word    │  Excel/   │
│  Handler │  Handler   │  Handler  │  PPT      │
│  JPEG/PNG│  PyPDF2    │  docx     │  Handlers │
└──────────┴────────────┴───────────┴───────────┘
                    ↓
         BatchProcessor (ThreadPoolExecutor)
                    ↓
         ReportGenerator (verification)

Quick Start

cd PROJECTS/beginner/metadata-scrubber-tool

# Install dependencies
uv sync

# Read metadata from a file
uv run mst read photo.jpg

# Scrub metadata from a single file
uv run mst scrub photo.jpg --output ./cleaned

# Process an entire directory recursively
uv run mst scrub ./photos -r -ext jpg --output ./scrubbed

# Verify metadata was removed
uv run mst verify photo.jpg ./cleaned/processed_photo.jpg

Project Structure

metadata-scrubber-tool/
├── src/
│   ├── commands/              # CLI command implementations
│   │   ├── read.py            # Display metadata
│   │   ├── scrub.py           # Remove metadata
│   │   └── verify.py          # Compare before/after
│   ├── core/                  # Format-specific processors
│   │   ├── jpeg_metadata.py   # EXIF handling for JPEG
│   │   └── png_metadata.py    # Textual chunk + EXIF for PNG
│   ├── services/              # Business logic
│   │   ├── metadata_handler.py    # Abstract base class
│   │   ├── image_handler.py       # Images
│   │   ├── pdf_handler.py         # PDFs
│   │   ├── excel_handler.py       # Excel
│   │   ├── powerpoint_handler.py  # PowerPoint
│   │   ├── worddoc_handler.py     # Word
│   │   ├── metadata_factory.py    # File routing
│   │   ├── batch_processor.py     # Concurrent processing
│   │   └── report_generator.py    # Verification reports
│   ├── utils/
│   └── main.py
└── tests/

Development

# Run tests
uv run pytest tests/ -v

# Linting
uv run ruff check .

# Format
uv run ruff format .

Source Code

View on GitHub

Clone this wiki locally