Build a local RAG index from EXA search results. The pipeline targets healthcare businesses and NYC restaurants, crawls site content, chunks it, embeds it, and stores the vectors in a local Chroma DB.
- Python 3.12+
uvpackage manager- EXA API key
Create a virtual environment and install dependencies:
uv venv
uv pip install -e .Set your EXA API key in .env:
EXA_API_KEY=your_key_hereOptional: add OPENAI_API_KEY to enable entity resolution for directory/aggregator pages and follow-up searches for official business websites.
Using just:
just scrape --output-dir rag_index --collection exa_ragOr directly:
.venv/bin/python main.py --output-dir rag_index --collection exa_rag- Crawling is limited by
--max-pages-per-domainand--max-total-pages. - Business targets can be set with
--target-healthcareand--target-nyc-restaurants. - Raw crawled pages are written to
rag_pages.json. - The raw crawled pages are written to
rag_pages.jsonfor inspection.