Epstein Files Search

AI-powered search tool for the Epstein files, making it easier to find relevant information across thousands of documents, emails, photos, and records released by the DOJ, House Oversight Committee, and other sources.

What It Does

Fuzzy, misspelling-tolerant search — searching for "Lesly Goff" finds "Lesley Groff"
AI document analysis — a local LLM (via Ollama) reads every document, scores it for relevance, extracts entities, and discovers new categories automatically
Relevance scoring — AI classifies documents by topic (trafficking, blackmail, financial crime, intelligence, etc.) and deprioritises irrelevant content (e.g. music newsletters)
Named Entity Recognition — extracts people, organisations, locations, and their roles (e.g. "Virginia Giuffre — Victim", "Prince Andrew — Participant")
Entity relationship graph — 1,000+ connections from the archive (Epstein-Maxwell strength 421, Prince Andrew 50, etc.)
Email threading & deduplication — groups emails into conversations in chronological order; removes duplicates
Context linking — shows related documents, timeline neighbours, and shared entities for every document
Multiple search modes — by name, by date, by category, full-text, or random interesting documents
Live search fallback — queries the Epstein Document Archive API (207K+ documents) when local database is empty
Age verification gate — as required for this content

Data Sources

epsteininvestigation.org — 207K+ documents via public API (live search, entity graph, flight logs)
justice.gov/epstein — Official DOJ releases
jmail.world — Jmail suite (emails, flights, photos, drive)
Hugging Face: tensonaut/EPSTEIN_FILES_20K — 25,800 OCR'd documents from the Nov 2025 House Oversight Committee release (requires HF auth)
DocETL Epstein Email Explorer — related project analysing 2,322 emails
Local PDF files you supply

Quick Start

# Install dependencies
pip install -r requirements.txt
python -m spacy download en_core_web_sm

# Install Ollama (for AI analysis)
# Download from https://ollama.ai, then:
ollama pull llama3.1:8b    # or any model — the app auto-detects the best available

# Start the app
./epstein.sh start

Managing the App

The epstein.sh script runs the app as a background daemon:

./epstein.sh start     # Start in the background, logs to data/epstein.log
./epstein.sh stop      # Stop the daemon (and any orphaned processes)
./epstein.sh restart   # Stop then start
./epstein.sh status    # Show PID and uptime, or "Not running"
./epstein.sh log       # Tail the log file (Ctrl-C to stop watching)

The app runs on http://localhost:5555 by default (see Configuration to change it).

For interactive/debug mode (logs to terminal, auto-reloads on code changes):

FLASK_DEBUG=1 python run.py

First Run

Open http://localhost:5555 and click "Yes" on the age verification
Go to Admin and click Import Archive CSVs — imports 96 entities, 1,000 relationships, and 55 flight records (instant)
Click Import HF Dataset — bulk-imports document metadata from the archive API
Click Start AI Worker — begins analysing documents through your local LLM; categories appear on the home page as they're discovered

AI Analysis

The background AI worker runs while the web UI is live. It:

Imports structured data — entities, relationships, flight logs from epsteininvestigation.org
Analyses documents — sends each through a local LLM (Ollama) which returns a relevance score, categories, entities with roles, and a plain-language summary
Discovers new categories — every 15 documents, asks the LLM to identify themes emerging across the batch
Updates the home page — new categories appear in real time under "AI-Discovered"

The system prompt encodes what "relevant" means: trafficking, child exploitation, blackmail, intelligence services, financial crime, corruption. It understands that there is no such thing as "child prostitution" and that a music newsletter is not what the public is interested in.

Hardware requirements:

Mac M4 Pro (64 GB): runs qwen2.5:32b comfortably (~35s per document)
Any machine with Ollama: auto-detects the largest available model
Data centre: set OLLAMA_MODEL=llama3.1:70b for higher quality analysis

Ingesting Documents

# Ingest PDFs from a local directory
python ingest.py local /path/to/pdf/folder

# Ingest from DOJ website (downloads PDFs automatically)
python ingest.py doj

# Build email threads after ingestion
python ingest.py threads

# Check stats
python ingest.py stats

Or use the Admin page buttons to import from the archive API and Hugging Face.

Deploying to Ubuntu (Docker)

The app runs in Docker, which works on Ubuntu 16.04 and above.

# On your server:
git clone <this-repo> /opt/epstein-files
cd /opt/epstein-files
sudo bash scripts/setup.sh

This will:

Install Docker and Docker Compose if not present
Build the Docker image (Python 3.11 with all ML dependencies)
Start the application on port 5555

For AI analysis on the server, install Ollama separately and pull a model:

curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3.1:70b    # larger model for data centre hardware

Ingestion on the server

# Copy PDFs to the data directory
scp -r ./pdfs/ user@server:/opt/epstein-files/data/pdfs/

# Run ingestion inside the container
docker-compose exec web python ingest.py local /app/data/pdfs
docker-compose exec web python ingest.py threads

Search Features

Fuzzy Name Matching

The system handles misspellings generically — RapidFuzz generates plausible variants for any word (repeated character collapse, phonetic substitutions, transpositions, deletions). "Lesly Goff" finds "Lesley Groff", "geoffrey epstien" finds "Jeffrey Epstein".

Proximity Search

Multi-word queries match adjacent words only: "nick lees" finds "Nick; Lees" but not "Nick is going to the party for Cathy Lees". Quotes are stripped automatically.

Relevance Scoring

Every AI-analysed document gets a 0–1 relevance score. A deposition about trafficking scores 0.95; a routine legal letter scores 0.1.

Entity Relationship Graph

1,000+ connections imported from the archive, queryable via API:

Epstein ↔ Maxwell: associate (strength 421)
Epstein ↔ Prince Andrew: social-associate (strength 50)
Epstein ↔ Donald Trump: social-associate (strength 46)
Maxwell ↔ Virginia Giuffre: accused-by (strength 36)

Email Threading

Emails with matching subjects are grouped into chronological threads. Embedded/quoted emails are detected so you see Email A, then Email B (without Email A repeated inside it).

API

All search functionality is also available via JSON API:

GET /api/search?q=trafficking
GET /api/search/name?q=Lesley+Groff
GET /api/search/date?year=2019&month=6
GET /api/search/category/trafficking
GET /api/document/123
GET /api/random
GET /api/categories
GET /api/entities?type=PERSON
GET /api/timeline
GET /api/stats
GET /api/data/relationships?entity=Epstein
GET /api/data/flights?passenger=Gates
GET /api/worker/status
GET /api/ai/status
POST /api/worker/start
POST /api/worker/stop
POST /api/ingest  {"source": "huggingface"}
POST /api/ingest  {"source": "archive_csvs"}

Configuration

All settings are in config.py and can be overridden with environment variables. Set them in your shell, in a .env file (for Docker Compose), or export them before running ./epstein.sh start.

Variable	Default	Description
`PORT`	`5555`	Web server port
`FLASK_DEBUG`	`0`	`1` for debug mode (auto-reload, verbose logs)
`SECRET_KEY`	random	Flask session secret (set in production)
`DATABASE_URL`	`sqlite:///data/epstein.db`	Database connection string
`OLLAMA_URL`	`http://localhost:11434`	Ollama API endpoint
`OLLAMA_MODEL`	(auto-detect)	Force a specific model, e.g. `llama3.1:70b`
`ARCHIVE_API_URL`	`https://www.epsteininvestigation.org/api/v1`	External document archive API
`DOJ_BASE_URL`	`https://www.justice.gov/epstein`	DOJ document source
`JMAIL_BASE_URL`	`https://jmail.world`	Jmail archive source
`RESULTS_PER_PAGE`	`25`	Search results per page
`WORKER_ANALYSIS_DELAY`	`2`	Seconds between AI analyses (be kind to Ollama)
`WORKER_FETCH_DELAY`	`1`	Seconds between API fetches
`WORKER_FETCH_BATCH`	`20`	Documents per API fetch
`CATEGORY_DISCOVERY_BATCH`	`15`	Docs analysed before running category discovery
`BULK_IMPORT_MAX_DOCS`	`5000`	Max documents for bulk API import
`GUNICORN_WORKERS`	`4`	Gunicorn worker processes (Docker only)
`GUNICORN_TIMEOUT`	`120`	Gunicorn request timeout in seconds (Docker only)

Example — run on port 8080 with a specific model:

PORT=8080 OLLAMA_MODEL=mistral:7b ./epstein.sh start

Architecture

Backend: Flask + SQLAlchemy + SQLite (with FTS5 for full-text search)
AI: Ollama (local LLM) for document analysis, category discovery, entity extraction
NLP: spaCy (NER), RapidFuzz (fuzzy matching), scikit-learn (TF-IDF topic discovery)
Data: epsteininvestigation.org API, Hugging Face datasets, DOJ PDFs
Deployment: Docker + Gunicorn, runs on Ubuntu 16.04+

License

This tool is for research and public accountability purposes. All indexed documents are public records.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Epstein Files Search

What It Does

Data Sources

Quick Start

Managing the App

First Run

AI Analysis

Ingesting Documents

Deploying to Ubuntu (Docker)

Ingestion on the server

Search Features

Fuzzy Name Matching

Proximity Search

Relevance Scoring

Entity Relationship Graph

Email Threading

API

Configuration

Architecture

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
app		app
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
config.py		config.py
docker-compose.yml		docker-compose.yml
epstein.sh		epstein.sh
ingest.py		ingest.py
requirements.txt		requirements.txt
run.py		run.py

Folders and files

Latest commit

History

Repository files navigation

Epstein Files Search

What It Does

Data Sources

Quick Start

Managing the App

First Run

AI Analysis

Ingesting Documents

Deploying to Ubuntu (Docker)

Ingestion on the server

Search Features

Fuzzy Name Matching

Proximity Search

Relevance Scoring

Entity Relationship Graph

Email Threading

API

Configuration

Architecture

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages