Turn any document dump into a searchable evidence database.
Built by the team behind epstein-data.com — where we turned the 218GB DOJ Epstein file release into a fully searchable, entity-linked, citation-backed research database.
This project is licensed under PolyForm Noncommercial License 1.0.0.
- Noncommercial use is allowed under the terms in LICENSE.
- Commercial use requires a separate commercial license from the project owner.
- Required attribution notice is in NOTICE.
- Third-party dependency notices are in THIRD_PARTY_NOTICES.md.
pip install -e ".[pymupdf,nlp]"
python -m spacy download en_core_web_sm# Point at a folder of PDFs, get a searchable database
casestack ingest ./my-documents --name "City Council FOIA"
# Serve it locally
casestack serve
# Check status
casestack statusCopy case.yaml.example to case.yaml and customize. See the example for all options.
- OCR — Extract text from PDFs (Docling or PyMuPDF)
- Entity Extraction — Find people, orgs, dates, money, phone numbers (spaCy NER)
- Deduplication — Identify duplicate documents (content hash + fuzzy matching)
- Export — SQLite database with FTS5 full-text search
- Serve — Datasette web interface with search, filtering, and AI Q&A
Pre-configured case files for known document sets:
presets/epstein.yaml— DOJ Jeffrey Epstein File Release (218GB, 1.38M PDFs)