Transforming historical PDF archives into meaningful visual insights.
Wayback PDF Diff is a sophisticated prototype designed to extend the Internet Archive's Wayback Machine capabilities. While the Wayback Machine excel at diffing HTML pages, historical PDF comparisons often remain a manual, tedious task. This tool automates that process, providing a side-by-side, highlighted diff of any two PDF versions.
- 🌐 Wayback Archive Integration: Directly search and pull historical PDF versions using the Wayback Machine CDX API.
- 📄 Precision Extraction: Uses
PyMuPDF(fitz) for robust text extraction while preserving document structure. - 🔍 Granular Comparison: Combines sentence-level and word-level diffing to catch even the smallest changes.
- 📊 Statistical Dashboard: Real-time breakdown of internal changes (Additions, Deletions, and Modifications).
- 🌗 Responsive Web UI: A modern, clean interface for both archive research and local file comparison.
This project is built with a focus on modularity and performance:
- Backend: Python / Flask
- PDF Engine: PyMuPDF - Chosen for its speed and accuracy in extracting encoded text.
- Diff Engine:
difflib+ Custom Regex logic for sentence sanitization. - API integration: Wayback Machine CDX API for metadata search and playback for content retrieval.
Tip
The tool handles text normalization (whitespace cleaning, line-break reconstruction) to ensure that formatting changes don't clutter the actual content diff.
- Python 3.8 or higher
- pip
-
Clone the repository
git clone https://github.com/Princeg0210/Wayback-pdf-diff.git cd wayback-pdf-diff -
Install dependencies
pip install -r requirements.txt
-
Run the application
python app.py
-
Access the tool Open your browser and navigate to
http://127.0.0.1:5001.
- Archive Mode: Enter a URL and browse through historical versions fetched directly from the Internet Archive.
- Local Mode: Upload two PDF files from your machine.
- Analyze: Click "Compare" to generate the diff.
- Insights: Interpret the colors:
- 🟢 Green : Added Content
- 🔴 Red : Removed Content
- 🟡 Yellow : Modified/Changed
- OCR Support: Integrate Tesseract for scanned/image-based PDFs.
- Visual Diff: Image-based (pixel-to-pixel) comparison for layout-heavy documents.
- Batch Processing: Compare multiple versions at once.
- Export Options: Save diff reports as PDF or JSON.
Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Distributed under the MIT License. See LICENSE for more information.
Prince Gupta - @Princeg0210 Project Link: https://github.com/Princeg0210/Wayback-pdf-diff



