A RAG (Retrieval-Augmented Generation) agent for querying index methodology documents and historical constituent data. Built with LlamaIndex, LangGraph, ChromaDB, and DuckDB.
- Semantic search over ingested PDF methodology documents (MSCI, S&P, FTSE)
- Natural language SQL queries against quarterly index constituent snapshots
- Methodology comparison — side-by-side diff of two index documents
- Changelog generation — what changed between two versions of a methodology
- MCP server — expose all tools to Claude Desktop via the Model Context Protocol
- RAGAS evaluation — score pipeline quality against a golden Q&A set
pip install -r requirements.txtCreate a .env file:
OPENAI_API_KEY=sk-...
# Optional: use Anthropic as the LLM backend
# LLM_PROVIDER=anthropic
# ANTHROPIC_API_KEY=sk-ant-...
# ANTHROPIC_MODEL=claude-sonnet-4-5
# Optional: override the OpenAI model
# OPENAI_MODEL=gpt-4o-minipython main.py ingest ./docs# Generate synthetic data first (if needed)
python data/generate_constituents.py
# Load into DuckDB
python main.py load-dataDirect RAG query (semantic search over PDFs):
python main.py query "What are the eligibility criteria for MSCI World?"Agent query (auto-routes to the best tool):
python main.py agent "Which sectors are overweight in MSCI EM vs S&P 500?"
python main.py agent "What is Apple's weight in the S&P 500?"
python main.py agent "How does MSCI World differ from FTSE 100 methodology?"python main.py eval --qa data/golden_qa.jsonResults are stored in eval_results.db (SQLite).
Expose the agent as an MCP server so Claude Desktop can call it as a tool:
python mcp_server.pyAdd to your Claude Desktop config (~/Library/Application Support/Claude/claude_desktop_config.json):
{
"mcpServers": {
"index-rag-agent": {
"command": "python",
"args": ["/path/to/index-rag-agent/mcp_server.py"]
}
}
}main.py CLI entry point
pipeline.py Ingest (PDF → ChromaDB) and direct query
agent.py LangGraph ReAct agent with 4 tools:
• search_methodology — semantic search over PDFs
• compare_methodologies — side-by-side diff
• summarize_changes — changelog between versions
• query_index_data — NL→SQL over DuckDB
mcp_server.py FastMCP wrapper for Claude Desktop
eval.py RAGAS evaluation harness
chroma_db/ Persistent ChromaDB vector store
index_data.ddb DuckDB database (quarterly constituent snapshots)
data/ CSV data and golden Q&A for evaluation
docs/ PDF methodology documents
The DuckDB database (index_data.ddb) holds quarterly snapshots for four indices:
| Index | Snapshot dates |
|---|---|
| S&P 500 | 2023-12-31, 2024-03-31, 2024-06-30, 2024-09-30, 2024-12-31 |
| MSCI World | same |
| MSCI EM | same |
| FTSE 100 | same |
Columns: date, index_name, constituent, ticker, sector, country, weight_pct, market_cap_usd.