Skip to content

Ask questions about the Epstein Files using AI - A RAG pipeline with hybrid search, re-ranking, and Streamlit UI built on the Epstein Files 20K dataset.

License

Notifications You must be signed in to change notification settings

imdvz/AskJeffrey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

13 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

AskJeffrey

A Retrieval-Augmented Generation (RAG) pipeline for querying the Jeffrey Epstein Files using AI β€” built on the Epstein Files 20K dataset from Hugging Face.

πŸ”— Live Demo: (coming soon)


⚑ Quick Demo

Process 2M+ document lines β†’ Get accurate, source-cited answers in seconds

What it does

  • Semantic chunking (splits by meaning, not character count)
  • Hybrid retrieval: vector similarity + keyword matching
  • Cross-encoder re-ranking for precision
  • Grounded answers with source citations
  • BYOK (Bring Your Own Key): users provide their own free Groq API key

🎯 Key Features

  • βœ… Grounded Answers (No Hallucinations) β€” responses are derived only from retrieved source text
  • βœ… Semantic Chunking β€” context-aware splits where meaning shifts
  • βœ… Hybrid Search β€” ChromaDB (vector) + BM25 (keyword)
  • βœ… Cross-Encoder Re-ranking β€” filters results for maximum relevance
  • βœ… Source Citations β€” every answer includes citations to the underlying chunks
  • βœ… BYOK (Bring Your Own Key) β€” no server-side API key required
  • βœ… Fast Response β€” ~1 second end-to-end query time (typical)
  • βœ… Interactive Chat UI β€” Streamlit interface with conversation history

πŸ—οΈ How It Works

Four Stages (Simple Pipeline)

Stage 1 β€” Data Preparation (offline, run once)

Raw Documents (2.5M lines)
        ↓
Clean & Reconstruct
        ↓
Semantic Chunking
        ↓
Vector Embeddings + BM25 Index

Stage 2 β€” Hybrid Retrieval

User Question
        ↓
Vector Search (ChromaDB) + Keyword Search (BM25)
        ↓
Reciprocal Rank Fusion β†’ Top 15 Chunks

Stage 3 β€” Re-ranking

Top 15 Chunks + Question
        ↓
Cross-Encoder Scoring
        ↓
Top 6 Most Relevant Chunks

Stage 4 β€” Grounded Answer

Context + Question
        ↓
LLaMA 3.3 70B (via Groq)
        ↓
Answer with Source Citations

🧠 Why Hybrid Search + Re-ranking?

Typical RAG: vector similarity only β†’ often misses exact names, dates, identifiers, and keyword-heavy queries.

AskJeffrey: vector + BM25 + cross-encoder β†’ captures semantic meaning + exact matches, then precision-filters results before generation.


✨ What Makes This Different?

Feature Typical RAG Projects AskJeffrey
Chunking Fixed-size splits Semantic chunking (meaning-based)
Search Vector only Hybrid (vector + BM25 keyword)
Ranking No re-ranking Cross-encoder re-ranking
Embeddings MiniLM (384d) BGE-base-en-v1.5 (768d)
API Key Hardcoded/server-side BYOK (user provides their own)
Citations Often missing Always included

πŸ“¦ Installation

Requirements

Setup (5 minutes)

1) Clone the repository

git clone https://github.com/imdvz/AskJeffrey.git
cd AskJeffrey

2) Create and activate a virtual environment

python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

3) Install dependencies

pip install -r requirements.txt

πŸš€ Getting Started

Run the Data Pipeline (first time only)

# Step 1: Download raw data
python ingest/download_dataset.py

# Step 2: Clean and reconstruct documents
python ingest/clean_dataset.py

# Step 3: Semantic chunking
python ingest/chunk_dataset.py

# Step 4: Generate embeddings + BM25 index
python ingest/embed_chunks.py

Launch the App

streamlit run app.py

Open: http://localhost:8501

Paste your Groq API key in the sidebar and start asking questions.


πŸ“š Project Structure

AskJeffrey/
β”œβ”€β”€ ingest/                         # Data processing pipeline
β”‚   β”œβ”€β”€ download_dataset.py          # Download from Hugging Face
β”‚   β”œβ”€β”€ clean_dataset.py             # Clean & reconstruct docs
β”‚   β”œβ”€β”€ chunk_dataset.py             # Semantic chunking
β”‚   └── embed_chunks.py              # Embed & build BM25 index
β”‚
β”œβ”€β”€ retrieval/                       # Retrieval logic
β”‚   β”œβ”€β”€ hybrid_retriever.py          # Vector + BM25 hybrid search
β”‚   └── reranker.py                  # Cross-encoder re-ranking
β”‚
β”œβ”€β”€ core/                            # Core RAG chain
β”‚   └── rag_chain.py                 # Orchestrates retrieval β†’ LLM
β”‚
β”œβ”€β”€ api/                             # FastAPI backend (optional)
β”‚   β”œβ”€β”€ main.py                      # API routes
β”‚   β”œβ”€β”€ models.py                    # Pydantic models
β”‚   └── prompts.py                   # Prompt templates
β”‚
β”œβ”€β”€ app.py                           # Streamlit frontend
β”œβ”€β”€ config.py                        # Central configuration
β”œβ”€β”€ requirements.txt                 # Python dependencies
└── .env.example                     # Environment template

πŸ” Bring Your Own Key (BYOK)

This app does not use a server-side API key. Every user provides their own free Groq API key:

  • πŸ”’ Key is not stored (browser session only)
  • 🚫 Key is not logged
  • πŸ—‘οΈ Closing the tab clears it

πŸ“œ License

Licensed under the MIT License β€” see LICENSE.


πŸ™ Acknowledgments


πŸ“ž Support


⚠️ Disclaimer

Built for research, transparency, and educational purposes. All data is sourced from public records. Users are responsible for complying with applicable laws and ethical guidelines.

About

Ask questions about the Epstein Files using AI - A RAG pipeline with hybrid search, re-ranking, and Streamlit UI built on the Epstein Files 20K dataset.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages