A modular Graph-based Retrieval-Augmented Generation (RAG) system with separate training and inference pipelines. This repo uses Llama Index to demo a simple Graph RAG pipeline.
This system creates knowledge graphs from markdown files and allows you to query them using natural language. The training and inference phases are completely separated, allowing you to:
- Train once: Build knowledge graphs from your markdowns
- Query many times: Ask questions using stored graphs without reprocessing
src/graph_rag/
├── core/ # Core functionality
│ ├── trainer.py # Training module for creating knowledge graphs
│ └── inference.py # Inference module for querying stored graphs
├── config/ # Configuration management
│ └── settings.py # Environment-based settings
├── storage/ # Storage utilities (future)
└── utils/ # Utility functions (future)
# Root-level CLI scripts
train.py # Training CLI - direct implementation
query.py # Query CLI - direct implementation
- Python 3.9+
- UV package manager
- OpenAI API key (required)
- Neo4j database (optional, for production)
- Clone or navigate to the project directory
- Install dependencies with UV:
uv sync
-
Required: Set your OpenAI API key as an environment variable:
export OPENAI_API_KEY=your_actual_openai_api_key_here -
Optional: Copy and customize the example environment file:
cp .env.example .env # Edit .env with your preferred settings (optional)
# Required
OPENAI_API_KEY=your_openai_api_key_here
# Optional OpenAI settings
OPENAI_MODEL=gpt-4o
OPENAI_TEMPERATURE=0.1
OPENAI_EMBEDDING_MODEL=text-embedding-3-large
# Optional: Neo4j configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_here
# Document processing
CHUNK_SIZE=256
CHUNK_OVERLAP=50
MAX_TRIPLETS_PER_CHUNK=10
# Storage
GRAPH_STORAGE_DIR=./storage
DEFAULT_GRAPH_NAME=default# Train from a document
python train.py sample_document.md --graph-name my_graph
# With custom storage location
python train.py sample_document.md --graph-name my_graph --storage-dir ./my_storage# Interactive mode
python query.py --graph-name my_graph --interactive
# Single query
python query.py --graph-name my_graph --query "What are the main topics?"
# List available graphs
python query.py --listpython train.py document.md [options]
Options:
--graph-name, -n Name for the knowledge graph (default: 'default')
--storage-dir, -s Storage directory (default: './storage')
--neo4j Use Neo4j graph store (requires Neo4j setup)
--verbose, -v Enable detailed logging
--quiet, -q Disable verbose loggingpython query.py [options]
Options:
--graph-name, -n Graph to load (default: 'default')
--storage-dir, -s Storage directory (default: './storage')
--query, -q Single query to ask
--interactive, -i Start interactive session
--list, -l List available graphs
--stats Show graph statistics
--verbose, -v Enable detailed loggingimport os
os.environ["OPENAI_API_KEY"] = "your_api_key_here"
from src.graph_rag import GraphRAGTrainer, GraphRAGInference
# Training
trainer = GraphRAGTrainer()
storage_path = trainer.train_from_file("document.md", "my_graph")
# Inference
inference = GraphRAGInference()
inference.load_knowledge_graph("my_graph")
response = inference.query_simple("What are the main topics?")# Train individual graphs
python train.py doc1.md --graph-name doc1_graph
python train.py doc2.md --graph-name doc2_graph# Compare responses from different graphs
python query.py --graph-name doc1_graph --query "What is the main concept?"
python query.py --graph-name doc2_graph --query "What is the main concept?"python query.py --graph-name my_graph --interactive
# Then interactively:
# 🔍 Query: What are the key concepts?
# 🔍 Query: How do these concepts relate?
# 🔍 Query: quitKnowledge graphs are stored in the following structure:
storage/
├── graph_name_1/
│ ├── docstore.json # Document storage
│ ├── graph_store.json # Graph structure
│ ├── index_store.json # Index mappings
│ └── vector_store.json # Vector embeddings
└── graph_name_2/
└── ...
- ✅ Modular Architecture: Separate training and inference
- ✅ Persistent Storage: Save and load knowledge graphs
- ✅ Multiple Graph Support: Manage multiple knowledge graphs
- ✅ Interactive Queries: Chat-like interface for questions
- ✅ Detailed Logging: Understand the RAG process
- ✅ Neo4j Support: Scale with graph databases
- ✅ Environment-based Config: Secure configuration management
- ✅ Graph Statistics: Analyze your knowledge graphs
- 🔐 Never commit API keys: Use environment variables only
- 🔐 Use .env.example: Provide template without secrets
- 🔐 Validate configuration: Settings module validates API keys
- 🔐 Secure storage: Knowledge graphs stored locally by default
-
API Key Not Found
Error: Required environment variable 'OPENAI_API_KEY' is not set Solution: export OPENAI_API_KEY=your_actual_key -
Graph Not Found
Error: Knowledge graph 'my_graph' not found Solution: Use --list to see available graphs or train first -
Invalid API Key Format
Error: OPENAI_API_KEY does not appear to be a valid OpenAI API key Solution: Ensure key starts with 'sk-' or 'sk-proj-'
Use --verbose flag for detailed logging:
python train.py document.md --verbose
python query.py --graph-name my_graph --query "test" --verboseCore dependencies:
llama-index: Core LlamaIndex functionalityllama-index-graph-stores-neo4j: Neo4j integration (optional)llama-index-embeddings-openai: OpenAI embeddingsllama-index-llms-openai: OpenAI language modelsnetworkx: Graph manipulation and analysismatplotlib: Graph visualization (optional)
This project emphasizes security and modularity. When contributing:
- Never commit API keys or secrets
- Use environment variables for all configuration
- Follow the modular architecture pattern
- Add proper error handling and validation
This project is for educational and demonstration purposes.