AI-powered search tool for the Epstein files, making it easier to find relevant information across thousands of documents, emails, photos, and records released by the DOJ, House Oversight Committee, and other sources.
- Fuzzy, misspelling-tolerant search — searching for "Lesly Goff" finds "Lesley Groff"
- AI document analysis — a local LLM (via Ollama) reads every document, scores it for relevance, extracts entities, and discovers new categories automatically
- Relevance scoring — AI classifies documents by topic (trafficking, blackmail, financial crime, intelligence, etc.) and deprioritises irrelevant content (e.g. music newsletters)
- Named Entity Recognition — extracts people, organisations, locations, and their roles (e.g. "Virginia Giuffre — Victim", "Prince Andrew — Participant")
- Entity relationship graph — 1,000+ connections from the archive (Epstein-Maxwell strength 421, Prince Andrew 50, etc.)
- Email threading & deduplication — groups emails into conversations in chronological order; removes duplicates
- Context linking — shows related documents, timeline neighbours, and shared entities for every document
- Multiple search modes — by name, by date, by category, full-text, or random interesting documents
- Live search fallback — queries the Epstein Document Archive API (207K+ documents) when local database is empty
- Age verification gate — as required for this content
- epsteininvestigation.org — 207K+ documents via public API (live search, entity graph, flight logs)
- justice.gov/epstein — Official DOJ releases
- jmail.world — Jmail suite (emails, flights, photos, drive)
- Hugging Face: tensonaut/EPSTEIN_FILES_20K — 25,800 OCR'd documents from the Nov 2025 House Oversight Committee release (requires HF auth)
- DocETL Epstein Email Explorer — related project analysing 2,322 emails
- Local PDF files you supply
# Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm
# Install Ollama (for AI analysis)
# Download from https://ollama.ai, then:
ollama pull llama3.1:8b # or any model — the app auto-detects the best available
# Start the app
./epstein.sh startThe epstein.sh script runs the app as a background daemon:
./epstein.sh start # Start in the background, logs to data/epstein.log
./epstein.sh stop # Stop the daemon (and any orphaned processes)
./epstein.sh restart # Stop then start
./epstein.sh status # Show PID and uptime, or "Not running"
./epstein.sh log # Tail the log file (Ctrl-C to stop watching)The app runs on http://localhost:5555 by default (see Configuration to change it).
For interactive/debug mode (logs to terminal, auto-reloads on code changes):
FLASK_DEBUG=1 python run.py- Open http://localhost:5555 and click "Yes" on the age verification
- Go to Admin and click Import Archive CSVs — imports 96 entities, 1,000 relationships, and 55 flight records (instant)
- Click Import HF Dataset — bulk-imports document metadata from the archive API
- Click Start AI Worker — begins analysing documents through your local LLM; categories appear on the home page as they're discovered
The background AI worker runs while the web UI is live. It:
- Imports structured data — entities, relationships, flight logs from epsteininvestigation.org
- Analyses documents — sends each through a local LLM (Ollama) which returns a relevance score, categories, entities with roles, and a plain-language summary
- Discovers new categories — every 15 documents, asks the LLM to identify themes emerging across the batch
- Updates the home page — new categories appear in real time under "AI-Discovered"
The system prompt encodes what "relevant" means: trafficking, child exploitation, blackmail, intelligence services, financial crime, corruption. It understands that there is no such thing as "child prostitution" and that a music newsletter is not what the public is interested in.
Hardware requirements:
- Mac M4 Pro (64 GB): runs
qwen2.5:32bcomfortably (~35s per document) - Any machine with Ollama: auto-detects the largest available model
- Data centre: set
OLLAMA_MODEL=llama3.1:70bfor higher quality analysis
# Ingest PDFs from a local directory
python ingest.py local /path/to/pdf/folder
# Ingest from DOJ website (downloads PDFs automatically)
python ingest.py doj
# Build email threads after ingestion
python ingest.py threads
# Check stats
python ingest.py statsOr use the Admin page buttons to import from the archive API and Hugging Face.
The app runs in Docker, which works on Ubuntu 16.04 and above.
# On your server:
git clone <this-repo> /opt/epstein-files
cd /opt/epstein-files
sudo bash scripts/setup.shThis will:
- Install Docker and Docker Compose if not present
- Build the Docker image (Python 3.11 with all ML dependencies)
- Start the application on port 5555
For AI analysis on the server, install Ollama separately and pull a model:
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:70b # larger model for data centre hardware# Copy PDFs to the data directory
scp -r ./pdfs/ user@server:/opt/epstein-files/data/pdfs/
# Run ingestion inside the container
docker-compose exec web python ingest.py local /app/data/pdfs
docker-compose exec web python ingest.py threadsThe system handles misspellings generically — RapidFuzz generates plausible variants for any word (repeated character collapse, phonetic substitutions, transpositions, deletions). "Lesly Goff" finds "Lesley Groff", "geoffrey epstien" finds "Jeffrey Epstein".
Multi-word queries match adjacent words only: "nick lees" finds "Nick; Lees" but not "Nick is going to the party for Cathy Lees". Quotes are stripped automatically.
Every AI-analysed document gets a 0–1 relevance score. A deposition about trafficking scores 0.95; a routine legal letter scores 0.1.
1,000+ connections imported from the archive, queryable via API:
- Epstein ↔ Maxwell: associate (strength 421)
- Epstein ↔ Prince Andrew: social-associate (strength 50)
- Epstein ↔ Donald Trump: social-associate (strength 46)
- Maxwell ↔ Virginia Giuffre: accused-by (strength 36)
Emails with matching subjects are grouped into chronological threads. Embedded/quoted emails are detected so you see Email A, then Email B (without Email A repeated inside it).
All search functionality is also available via JSON API:
GET /api/search?q=trafficking
GET /api/search/name?q=Lesley+Groff
GET /api/search/date?year=2019&month=6
GET /api/search/category/trafficking
GET /api/document/123
GET /api/random
GET /api/categories
GET /api/entities?type=PERSON
GET /api/timeline
GET /api/stats
GET /api/data/relationships?entity=Epstein
GET /api/data/flights?passenger=Gates
GET /api/worker/status
GET /api/ai/status
POST /api/worker/start
POST /api/worker/stop
POST /api/ingest {"source": "huggingface"}
POST /api/ingest {"source": "archive_csvs"}
All settings are in config.py and can be overridden with environment variables. Set them in your shell, in a .env file (for Docker Compose), or export them before running ./epstein.sh start.
| Variable | Default | Description |
|---|---|---|
PORT |
5555 |
Web server port |
FLASK_DEBUG |
0 |
1 for debug mode (auto-reload, verbose logs) |
SECRET_KEY |
random | Flask session secret (set in production) |
DATABASE_URL |
sqlite:///data/epstein.db |
Database connection string |
OLLAMA_URL |
http://localhost:11434 |
Ollama API endpoint |
OLLAMA_MODEL |
(auto-detect) | Force a specific model, e.g. llama3.1:70b |
ARCHIVE_API_URL |
https://www.epsteininvestigation.org/api/v1 |
External document archive API |
DOJ_BASE_URL |
https://www.justice.gov/epstein |
DOJ document source |
JMAIL_BASE_URL |
https://jmail.world |
Jmail archive source |
RESULTS_PER_PAGE |
25 |
Search results per page |
WORKER_ANALYSIS_DELAY |
2 |
Seconds between AI analyses (be kind to Ollama) |
WORKER_FETCH_DELAY |
1 |
Seconds between API fetches |
WORKER_FETCH_BATCH |
20 |
Documents per API fetch |
CATEGORY_DISCOVERY_BATCH |
15 |
Docs analysed before running category discovery |
BULK_IMPORT_MAX_DOCS |
5000 |
Max documents for bulk API import |
GUNICORN_WORKERS |
4 |
Gunicorn worker processes (Docker only) |
GUNICORN_TIMEOUT |
120 |
Gunicorn request timeout in seconds (Docker only) |
Example — run on port 8080 with a specific model:
PORT=8080 OLLAMA_MODEL=mistral:7b ./epstein.sh start- Backend: Flask + SQLAlchemy + SQLite (with FTS5 for full-text search)
- AI: Ollama (local LLM) for document analysis, category discovery, entity extraction
- NLP: spaCy (NER), RapidFuzz (fuzzy matching), scikit-learn (TF-IDF topic discovery)
- Data: epsteininvestigation.org API, Hugging Face datasets, DOJ PDFs
- Deployment: Docker + Gunicorn, runs on Ubuntu 16.04+
This tool is for research and public accountability purposes. All indexed documents are public records.