Open source document processing pipeline for the Epstein case files. Download, OCR, extract entities, deduplicate, and export 140,000+ documents from the DOJ releases.
This is the data processing companion to epsteinexposed.com — the most comprehensive searchable database of the Epstein files.
# Install
pip install epstein-pipeline
# Download a dataset
epstein-pipeline download kaggle
# OCR documents
epstein-pipeline ocr ./raw-pdfs/ --output ./processed/
# Extract entities and link to known persons
epstein-pipeline extract-entities ./processed/ --output ./entities/
# Export for the website
epstein-pipeline export json ./processed/ --output ./export/Raw DOJ PDFs ──> OCR ──> Entity Extraction ──> Deduplication ──> Export
│ │ │ │
Docling (IBM) spaCy NER rapidfuzz JSON/CSV/SQLite
+ 1,400+ fuzzy match
known persons
| Step | Tool | Description |
|---|---|---|
| Download | Built-in | Fetch from DOJ, Kaggle, HuggingFace, Archive.org |
| OCR | Docling (IBM) | Extract text from PDFs with layout understanding |
| Entity Extraction | spaCy | Find person names, organizations, locations |
| Person Linking | rapidfuzz | Match names to 1,400+ known persons |
| Deduplication | rapidfuzz + content hashing | Find duplicate documents across sources |
| Validation | Pydantic | Schema validation, integrity checks |
| Export | Built-in | JSON (website), CSV (research), SQLite (queries) |
pip install epstein-pipelinepip install "epstein-pipeline[ocr]"pip install "epstein-pipeline[nlp]"
python -m spacy download en_core_web_smpip install "epstein-pipeline[all]"
python -m spacy download en_core_web_smdocker compose run pipeline --help
docker compose run pipeline ocr ./raw-pdfs/ --output ./output/epstein-pipeline --help # Show all commands
epstein-pipeline download doj --dataset 9 # Download DOJ dataset 9
epstein-pipeline download kaggle # Download Kaggle dataset
epstein-pipeline download huggingface # Download HuggingFace datasets
epstein-pipeline ocr ./pdfs/ -o ./out/ # OCR PDF files
epstein-pipeline extract-entities ./out/ -o ./e/ # Extract entities
epstein-pipeline dedup ./out/ -o report.json # Find duplicates
epstein-pipeline validate ./out/ # Validate data quality
epstein-pipeline export json ./out/ -o ./site/ # Export for website
epstein-pipeline export csv ./out/ -o docs.csv # Export as CSV
epstein-pipeline export sqlite ./out/ -o ep.db # Export as SQLite
epstein-pipeline stats ./out/ # Show statisticsWe welcome contributions from everyone! See CONTRIBUTING.md for details.
No coding required:
- Report data quality issues (wrong person matches, duplicates)
- Suggest new data sources
- Review and verify processed data
- Improve documentation
Code contributions:
- Add new data source downloaders
- Improve entity extraction accuracy
- Add export formats
- Fix bugs
See DATA_SOURCES.md for all known public data sources.
See ARCHITECTURE.md for pipeline design details.
- epsteinexposed.com — The live website powered by this pipeline
- Epstein-Files — DOJ file mirrors and torrents
- Epstein-doc-explorer — Email graph explorer
- Epstein-research-data — Community research dataset
MIT License. See LICENSE.