-
-
Notifications
You must be signed in to change notification settings - Fork 114
Metadata Scrubber Tool
CarterPerez-dev edited this page Feb 11, 2026
·
1 revision
Privacy-focused CLI tool for stripping metadata from images, PDFs, and Office documents.
A command-line tool that removes privacy-sensitive metadata from files — GPS coordinates, camera serial numbers, author identities, timestamps, and revision history. Supports JPEG, PNG, PDF, Word, Excel, and PowerPoint formats with batch processing and verification.
Status: Complete | Difficulty: Beginner
This tool is for privacy protection and authorized security testing only. Metadata scrubbing is a legitimate privacy practice used by journalists, whistleblowers, and security researchers.
| Technology | Version | Purpose |
|---|---|---|
| Python | 3.10+ | Modern type hints, pattern matching |
| Typer | - | CLI framework |
| Rich | - | Terminal UI, progress bars, tables |
| Pillow | - | Image EXIF handling |
| openpyxl | - | Excel metadata |
| python-pptx | - | PowerPoint metadata |
| python-docx | - | Word document metadata |
| PyPDF2 | - | PDF metadata |
- Read metadata from any supported file
- Scrub (remove) metadata while preserving file functionality
- Verify before/after metadata states with colored comparison
- Batch processing with ThreadPoolExecutor for concurrent operations
- Recursive directory processing with extension filtering
| Format | Metadata Removed |
|---|---|
| JPEG | EXIF (GPS, camera model, timestamps, serial numbers) |
| PNG | tEXt chunks, EXIF data |
| Info dictionary (author, creator, timestamps) | |
| Word (.docx) | Document properties (author, company, revision) |
| Excel (.xlsx) | Workbook properties |
| PowerPoint (.pptx) | Presentation properties |
- John McAfee (2013): Located via EXIF GPS data in a journalist's photo
- Anonymous (2012): Members doxxed through PDF author metadata
- File metadata can leak identity, location, organization, and edit history
User Command (read / scrub / verify)
↓
main.py (Typer CLI)
↓
┌───────────────────────────────────────────────┐
│ MetadataFactory │
│ Routes files to handlers by extension │
└───────────────────┬───────────────────────────┘
↓
┌──────────┬────────────┬───────────┬───────────┐
│ Image │ PDF │ Word │ Excel/ │
│ Handler │ Handler │ Handler │ PPT │
│ JPEG/PNG│ PyPDF2 │ docx │ Handlers │
└──────────┴────────────┴───────────┴───────────┘
↓
BatchProcessor (ThreadPoolExecutor)
↓
ReportGenerator (verification)
cd PROJECTS/beginner/metadata-scrubber-tool
# Install dependencies
uv sync
# Read metadata from a file
uv run mst read photo.jpg
# Scrub metadata from a single file
uv run mst scrub photo.jpg --output ./cleaned
# Process an entire directory recursively
uv run mst scrub ./photos -r -ext jpg --output ./scrubbed
# Verify metadata was removed
uv run mst verify photo.jpg ./cleaned/processed_photo.jpgmetadata-scrubber-tool/
├── src/
│ ├── commands/ # CLI command implementations
│ │ ├── read.py # Display metadata
│ │ ├── scrub.py # Remove metadata
│ │ └── verify.py # Compare before/after
│ ├── core/ # Format-specific processors
│ │ ├── jpeg_metadata.py # EXIF handling for JPEG
│ │ └── png_metadata.py # Textual chunk + EXIF for PNG
│ ├── services/ # Business logic
│ │ ├── metadata_handler.py # Abstract base class
│ │ ├── image_handler.py # Images
│ │ ├── pdf_handler.py # PDFs
│ │ ├── excel_handler.py # Excel
│ │ ├── powerpoint_handler.py # PowerPoint
│ │ ├── worddoc_handler.py # Word
│ │ ├── metadata_factory.py # File routing
│ │ ├── batch_processor.py # Concurrent processing
│ │ └── report_generator.py # Verification reports
│ ├── utils/
│ └── main.py
└── tests/
# Run tests
uv run pytest tests/ -v
# Linting
uv run ruff check .
# Format
uv run ruff format .©AngelaMos | CertGames.com | CarterPerez-dev | 2026