A Retrieval-Augmented Generation (RAG) pipeline for querying the Jeffrey Epstein Files using AI β built on the Epstein Files 20K dataset from Hugging Face.
π Live Demo: (coming soon)
Process 2M+ document lines β Get accurate, source-cited answers in seconds
- Semantic chunking (splits by meaning, not character count)
- Hybrid retrieval: vector similarity + keyword matching
- Cross-encoder re-ranking for precision
- Grounded answers with source citations
- BYOK (Bring Your Own Key): users provide their own free Groq API key
- β Grounded Answers (No Hallucinations) β responses are derived only from retrieved source text
- β Semantic Chunking β context-aware splits where meaning shifts
- β Hybrid Search β ChromaDB (vector) + BM25 (keyword)
- β Cross-Encoder Re-ranking β filters results for maximum relevance
- β Source Citations β every answer includes citations to the underlying chunks
- β BYOK (Bring Your Own Key) β no server-side API key required
- β Fast Response β ~1 second end-to-end query time (typical)
- β Interactive Chat UI β Streamlit interface with conversation history
Raw Documents (2.5M lines)
β
Clean & Reconstruct
β
Semantic Chunking
β
Vector Embeddings + BM25 Index
User Question
β
Vector Search (ChromaDB) + Keyword Search (BM25)
β
Reciprocal Rank Fusion β Top 15 Chunks
Top 15 Chunks + Question
β
Cross-Encoder Scoring
β
Top 6 Most Relevant Chunks
Context + Question
β
LLaMA 3.3 70B (via Groq)
β
Answer with Source Citations
Typical RAG: vector similarity only β often misses exact names, dates, identifiers, and keyword-heavy queries.
AskJeffrey: vector + BM25 + cross-encoder β captures semantic meaning + exact matches, then precision-filters results before generation.
| Feature | Typical RAG Projects | AskJeffrey |
|---|---|---|
| Chunking | Fixed-size splits | Semantic chunking (meaning-based) |
| Search | Vector only | Hybrid (vector + BM25 keyword) |
| Ranking | No re-ranking | Cross-encoder re-ranking |
| Embeddings | MiniLM (384d) | BGE-base-en-v1.5 (768d) |
| API Key | Hardcoded/server-side | BYOK (user provides their own) |
| Citations | Often missing | Always included |
- Python 3.11+
- A free Groq API key: https://console.groq.com
git clone https://github.com/imdvz/AskJeffrey.git
cd AskJeffreypython -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activatepip install -r requirements.txt# Step 1: Download raw data
python ingest/download_dataset.py
# Step 2: Clean and reconstruct documents
python ingest/clean_dataset.py
# Step 3: Semantic chunking
python ingest/chunk_dataset.py
# Step 4: Generate embeddings + BM25 index
python ingest/embed_chunks.pystreamlit run app.pyOpen: http://localhost:8501
Paste your Groq API key in the sidebar and start asking questions.
AskJeffrey/
βββ ingest/ # Data processing pipeline
β βββ download_dataset.py # Download from Hugging Face
β βββ clean_dataset.py # Clean & reconstruct docs
β βββ chunk_dataset.py # Semantic chunking
β βββ embed_chunks.py # Embed & build BM25 index
β
βββ retrieval/ # Retrieval logic
β βββ hybrid_retriever.py # Vector + BM25 hybrid search
β βββ reranker.py # Cross-encoder re-ranking
β
βββ core/ # Core RAG chain
β βββ rag_chain.py # Orchestrates retrieval β LLM
β
βββ api/ # FastAPI backend (optional)
β βββ main.py # API routes
β βββ models.py # Pydantic models
β βββ prompts.py # Prompt templates
β
βββ app.py # Streamlit frontend
βββ config.py # Central configuration
βββ requirements.txt # Python dependencies
βββ .env.example # Environment template
This app does not use a server-side API key. Every user provides their own free Groq API key:
- π Key is not stored (browser session only)
- π« Key is not logged
- ποΈ Closing the tab clears it
Licensed under the MIT License β see LICENSE.
- Dataset: Teyler / Epstein Files 20K (Hugging Face) https://huggingface.co/datasets/teyler/epstein-files-20k
- Embeddings: Sentence Transformers β https://www.sbert.net/
- Vector DB: ChromaDB β https://www.trychroma.com/
- Keyword Search: rank-bm25 β https://github.com/dorianbrown/rank_bm25
- Re-ranking: Cross-Encoders β https://www.sbert.net/docs/cross_encoder/usage/usage.html
- LLM Inference: Groq β https://groq.com/
- Framework: LangChain β https://langchain.com/
- UI: Streamlit β https://streamlit.io/
- π Issues: https://github.com/imdvz/AskJeffrey/issues
- π¬ Discussions: https://github.com/imdvz/AskJeffrey/discussions
Built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines.