Structured data exports from the forensic analysis of the 218GB DOJ Jeffrey Epstein file release (all 12 datasets + House Oversight Estate + FBI Vault: 1,385,916 documents, 2,771,231 pages).
I have threaded through these databases into a searchable visual interface, with an AI-assistant, at https://epstein-data.com
You can also install my full Claude-powered Epstein Investigator setup yourself, by following the directions for the: desktop install, or CLI version.
Note the latest release, v5.1, at: https://github.com/rhowardstone/Epstein-research-data/releases/tag/v5.1
Results repo: Epstein-research — 165+ forensic investigation reports with DOJ source citations.
| File | Records | Description |
|---|---|---|
knowledge_graph_entities.json |
606 | Curated entities: people, shell companies, organizations, properties, aircraft, locations. Each entry includes aliases, metadata (occupation, legal status, mention counts), and entity type. |
knowledge_graph_relationships.json |
2,302 | Relationships between entities with types (traveled_with, associated_with, owned_by, victim_of, etc.), weights, date ranges, and source/target entity names. |
Note on knowledge graph: The knowledge graph was curated during the initial investigation phases and does not include NER (Named Entity Recognition) run against the full OCR corpus. It covers the most frequently-referenced and manually-verified entities. For comprehensive name extraction from the full text, see extracted_entities_filtered.json below or query the full databases directly.
| File | Records | Description |
|---|---|---|
persons_registry.json |
1,614 | Unified person registry merged from 9 sources: epstein-pipeline (1,195), knowledge-graph (285), la-rana-chicana (237), Wikipedia Epstein files list (45), Bondi PEP letter Feb 2026 (19), jmail.world (9), corpus-investigation (2), khanna-massie-2026 (2), and doj-release-2026 (1). Each entry includes name, aliases, category (political/business/academic/staff/financial/legal/media/other), search terms, and source attribution. |
Note: This registry is broader than the knowledge graph — it includes every named individual identified across all investigation phases, congressional disclosures, and cross-referenced sources. Categories reflect the person's primary role relative to the Epstein case, not an accusation. Many entries (e.g., Bondi PEP letter names) appear in the files only in incidental contexts such as news clippings or tips.
| File | Records | Description |
|---|---|---|
extracted_entities_filtered.json |
8,085 | Filtered entity extractions: 3,881 names (appearing in 2+ documents), 2,238 phone numbers, 1,489 amounts, 357 emails, 116 organizations. Each entry includes the EFTA document numbers where it appears. |
extracted_names_multi_doc.csv |
3,881 | Names appearing in multiple EFTA documents with document counts and sample EFTA references. CSV format for easy browsing. |
Note on quality: The raw extraction table contains 107,422 entities, many of which are OCR artifacts from redacted/degraded documents. The filtered exports remove garbled text and require multi-document co-occurrence for names.
| File | Records | Description |
|---|---|---|
image_catalog.csv.gz |
38,955 | Complete image catalog (gzipped). Fields: id, image_name, efta_number, page_number, people, text_content, objects, setting, activity, notable, analyzed_at. |
image_catalog_notable.json.gz |
38,864 | Images with people or notable content identified (gzipped JSON). Truncated fields for manageable size. |
| File | Records | Description |
|---|---|---|
document_summary.csv.gz |
519,438 | Per-document redaction summary for every EFTA document (gzipped). Fields: efta_number, total_redactions, bad_redactions, proper_redactions, has_recoverable_text, dataset_source. |
| File | Records | Description |
|---|---|---|
reconstructed_pages_high_interest.json.gz |
39,588 | Pages where hidden text was recovered from under redactions (gzipped JSON). Fields include efta_number, page_number, num_fragments, reconstructed_text, interest_score, and names_found. Higher interest scores indicate more substantive recovered content. |
| File | Description |
|---|---|
efta_dataset_mapping.csv |
EFTA number ranges for each of the 12 DOJ datasets, with URL templates. |
efta_dataset_mapping.json |
Same mapping in JSON format for programmatic use. |
URL Pattern: https://www.justice.gov/epstein/files/DataSet%20{N}/EFTA{XXXXXXXX}.pdf
| Dataset | EFTA Start | EFTA End |
|---|---|---|
| 1 | 00000001 | 00003158 |
| 2 | 00003159 | 00003857 |
| 3 | 00003858 | 00005586 |
| 4 | 00005705 | 00008320 |
| 5 | 00008409 | 00008528 |
| 6 | 00008529 | 00008998 |
| 7 | 00009016 | 00009664 |
| 8 | 00009676 | 00039023 |
| 9 | 00039025 | 01262781 |
| 10 | 01262782 | 02205654 |
| 11 | 02205655 | 02730264 |
| 12 | 02730265 | 02858497 |
Note: EFTA numbers are assigned per page, not per document. A multi-page document consumes consecutive EFTA numbers — e.g., EFTA00008320 (89 pages) covers Bates numbers 00008320–00008408, and Dataset 5 begins at EFTA00008409. There are no gaps between datasets 1–11; every apparent gap is accounted for by multi-page documents at dataset boundaries. Dataset 12 (the post-release expansion) contains internal gaps totaling ~100K unassigned EFTA numbers — these likely represent documents not yet released or reserved number ranges.
The doj_audit/ directory documents an audit of the DOJ Epstein Library that identified documents removed or altered after the initial public release.
| File | Records | Description |
|---|---|---|
doj_audit/CONFIRMED_REMOVED.csv |
67,784 | Documents confirmed removed from the DOJ website (returning HTTP 404). Fields: efta, justice_gov_url, dataset, pages, scan_content_length, scan_last_modified. |
doj_audit/FLAGGED_documents.csv |
96,112 | All flagged documents with DOJ URLs and dataset info. |
doj_audit/FLAGGED_documents_details.csv |
102,223 | Flagged documents with detailed metadata including status, category, document type, priority score, confidence, and text preview. |
doj_audit/SIZE_MISMATCHES.csv |
23,989 | Documents where the file size on the DOJ server differs from the originally ingested version, suggesting post-release modification. |
doj_audit/sample_verification_results.csv |
500 | Statistical sample verification of flagged 404s using browser-based checks. |
See the full report: DOJ Document Removal Audit
The alteration_analysis/ directory contains analysis of documents where content was altered between versions of the DOJ release.
| File | Records | Description |
|---|---|---|
alteration_analysis/classified_alterations.csv |
21,803 | Documents with classified alteration types (CONTENT_REDUCTION, EMAILS_REMOVED, etc.) with sensitivity ratings and LLM-generated reasoning. |
alteration_analysis/removed_entities_export.csv |
146,209 | Entities (names, accounts, phone numbers) that were removed from documents between versions, with corpus hit counts and classification. |
The full alteration database (42,782 files tracked, 212,730 change units) is available as alteration_results.db.gz in the v5.1 release. See the full report: DOJ Document Alteration Forensics
The recovered_corrupted_pdfs/ directory contains text recovered from 5 corrupted PDF documents in the DOJ release through forensic byte-level carving:
EFTA00593870, EFTA00597207, EFTA00645624, EFTA01175426, EFTA01220934
Each subdirectory contains the cleaned extracted text from the corresponding corrupted PDF. See the full report: Corrupted PDF Forensics
Source databases are split across releases:
- v5.0 — Updated full text corpus (DS12 expansion, March 2026)
- v5.1 — Alteration analysis database
- v4.0 — All other databases
| Database | Release | Compressed | Uncompressed | Contents |
|---|---|---|---|---|
| full_text_corpus.db.gz | v5.0 | 2.3GB (split) | 6.3GB | 1,385,916 documents, 2,771,231 pages with full text, FTS5 search index. All 12 EFTA datasets + House Oversight Estate (DS99) + FBI Vault (DS98) + native spreadsheets + recovered EFTAs. Download both .part_aa and .part_ab and concatenate: cat full_text_corpus.db.gz.part_* > full_text_corpus.db.gz |
| concordance_complete.db.gz | v5.1 | 137MB | 696MB | 1,385,519 documents, 2,788,208 pages — concordance cross-reference with email threads, folder inventory, production metadata |
| alteration_results.db.gz | v5.1 | 183MB | 8.2GB | 212,730 change units with diff text, pixel-diff results, LLM classification |
| redaction_analysis_v2.db.gz | v4.0 | 166MB | 971MB | 2.59M redaction records, 849K document summaries, 39K reconstructed pages, 107K extracted entities |
| redaction_analysis_ds10.db.gz | v4.0 | 87MB | 532MB | Dataset 10 deep analysis (EFTA01262782-02205654) |
| image_analysis.db.gz | v4.0 | 64MB | 389MB | 38,955 images with AI-generated descriptions |
| ocr_database.db.gz | v4.0 | 25MB | 68MB | OCR extraction data |
| transcripts.db.gz | v4.0 | 1.7MB | 4.8MB | 1,628 media file entries, 435 with speech content, 189,982 words (faster-whisper large-v3) |
| knowledge_graph.db | v4.0 | 764KB | 764KB | Curated entities and relationships (uncompressed SQLite). Updated entity/relationship JSON files with 606 entities and 2,302 relationships are in the repo directly. |
| communications.db.gz | v4.0 | — | — | Email thread analysis |
| prosecutorial_query_graph.db | v4.0 | 2.5MB | 2.5MB | Subpoena analysis: riders, returns, clause fulfillment, investigative gaps |
Total: ~3.0GB compressed / ~17GB uncompressed
# Download and decompress the full text corpus (split into 2 parts)
wget https://github.com/rhowardstone/Epstein-research-data/releases/download/v5.0/full_text_corpus.db.gz.part_aa
wget https://github.com/rhowardstone/Epstein-research-data/releases/download/v5.0/full_text_corpus.db.gz.part_ab
cat full_text_corpus.db.gz.part_* > full_text_corpus.db.gz
gunzip full_text_corpus.db.gz
# Search the full text corpus
sqlite3 full_text_corpus.db "SELECT efta_number, page_number, substr(text_content, 1, 200) FROM pages WHERE text_content LIKE '%Leon Black%' LIMIT 10;"
# Search redacted content
sqlite3 redaction_analysis_v2.db "SELECT efta_number, page_number, substr(hidden_text, 1, 300) FROM redactions WHERE hidden_text LIKE '%TERM%' AND length(hidden_text) > 20 LIMIT 20;"The tools/ directory contains all Python scripts used to build the databases from raw PDFs. Use these to replicate the analysis, extend it with new data, or adapt for your own pipeline.
All tools auto-detect the data directory (no path editing needed). They check the EPSTEIN_DATA_DIR environment variable first, then look relative to the script and current working directory. If auto-detection fails: export EPSTEIN_DATA_DIR=/path/to/your/data
| Tool | Description |
|---|---|
tools/ingest_house_estate.py |
Ingests House Oversight Estate documents (Concordance format, OCR with configurable workers) |
tools/ingest_spreadsheets.py |
Ingests native XLS/XLSX/CSV files into full_text_corpus.db |
tools/transcribe_media.py |
GPU transcription of audio/video using faster-whisper large-v3 |
tools/prescreen_media.py |
Pre-screens media files to classify and skip surveillance footage |
tools/redaction_detector_v2.py |
Spatial redaction analysis: finds black rectangles, extracts underlying text |
tools/build_person_registry.py |
Builds unified person registry from 6 sources |
tools/build_knowledge_graph.py |
Constructs entity relationship graph |
tools/build_native_files_catalog.py |
Generates NATIVE_FILES_CATALOG.csv |
| Tool | Description |
|---|---|
tools/person_search.py |
FTS5 cross-reference search with co-occurrence analysis and CSV export |
tools/congressional_scorer.py |
Scores documents by redacted-name density for congressional reading room prioritization |
tools/generate_gov_reports.py |
Searches corpus for current government officials |
tools/search_judicial.py |
Searches corpus for federal judges |
tools/extract_subpoena_riders.py |
Extracts and catalogs subpoena rider documents |
| Tool | Description |
|---|---|
tools/find_missing_efta.py |
Gap detection across EFTA numbering |
tools/recover_missing_efta.py |
Recovers missing EFTAs from DOJ server or forensic carving |
tools/run_post_ingestion_pipeline.sh |
Chains all post-ingestion steps (transcription, registry, catalog) |
The raw PDFs can be obtained from:
| Source | URL | Contents |
|---|---|---|
| DOJ Epstein Library | justice.gov/epstein | Datasets 1-12 (individual PDFs). Bulk downloads removed Feb 6, 2026. |
| Archive.org DS9 | full.tar.bz2 | 103.6 GiB. Largest single dataset. |
| Archive.org DS11 | DataSet 11.zip | 25.6 GiB. 267,651 PDFs. |
| Archive.org DS1-5 | combined-all-epstein-files | First 5 datasets combined. |
| House Oversight | oversight.house.gov | Estate documents, DOJ-provided records, photo releases. |
See COMMUNITY_PLATFORMS.md in the research repo for a full directory of 78+ community tools and mirrors.
For developers building tools on top of this data:
- EFTA numbers are the universal key. Every document in the DOJ release has one.
- The
efta_dataset_mappingfiles let you resolve any EFTA number to a DOJ PDF URL. - Entity
efta_numbersarrays give you cross-references: "this person appears in these documents." - Knowledge graph
weighton relationships indicates strength of connection (higher = more documented). - Image
image_nameformat isEFTA{number}_p{page}_i{index}_{hash}.png— parse EFTA number and page from the filename. - No inter-dataset gaps (DS1–DS11): EFTA numbers are per-page, so a multi-page terminal document in each dataset consumes the EFTA numbers up to the next dataset's start. DS12 (post-release expansion) has internal gaps.
This is analysis of public government records released under the Epstein Files Transparency Act (Public Law 118-299). The underlying documents are U.S. government works. This structured data is released into the public domain.
Please open an issue if you find any problems! We will respond promptly.