Graph RAG System

A modular Graph-based Retrieval-Augmented Generation (RAG) system with separate training and inference pipelines. This repo uses Llama Index to demo a simple Graph RAG pipeline.

Overview

This system creates knowledge graphs from markdown files and allows you to query them using natural language. The training and inference phases are completely separated, allowing you to:

Train once: Build knowledge graphs from your markdowns
Query many times: Ask questions using stored graphs without reprocessing

Architecture

src/graph_rag/
├── core/                    # Core functionality
│   ├── trainer.py          # Training module for creating knowledge graphs
│   └── inference.py        # Inference module for querying stored graphs
├── config/                  # Configuration management
│   └── settings.py         # Environment-based settings
├── storage/                 # Storage utilities (future)
└── utils/                   # Utility functions (future)

# Root-level CLI scripts
train.py                     # Training CLI - direct implementation
query.py                     # Query CLI - direct implementation

Prerequisites

Python 3.9+
UV package manager
OpenAI API key (required)
Neo4j database (optional, for production)

Installation

Clone or navigate to the project directory
Install dependencies with UV:
```
uv sync
```

Configuration

⚠️ Security Note: This project uses environment variables for configuration. Never commit API keys to version control.

Setup Environment Variables

Required: Set your OpenAI API key as an environment variable:
```
export OPENAI_API_KEY=your_actual_openai_api_key_here
```

Optional: Copy and customize the example environment file:

cp .env.example .env
# Edit .env with your preferred settings (optional)

Environment Variables

# Required
OPENAI_API_KEY=your_openai_api_key_here

# Optional OpenAI settings
OPENAI_MODEL=gpt-4o
OPENAI_TEMPERATURE=0.1
OPENAI_EMBEDDING_MODEL=text-embedding-3-large

# Optional: Neo4j configuration
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=your_neo4j_password_here

# Document processing
CHUNK_SIZE=256
CHUNK_OVERLAP=50
MAX_TRIPLETS_PER_CHUNK=10

# Storage
GRAPH_STORAGE_DIR=./storage
DEFAULT_GRAPH_NAME=default

Quick Start

1. Train a Knowledge Graph

# Train from a document
python train.py sample_document.md --graph-name my_graph

# With custom storage location
python train.py sample_document.md --graph-name my_graph --storage-dir ./my_storage

2. Query the Knowledge Graph

# Interactive mode
python query.py --graph-name my_graph --interactive

# Single query
python query.py --graph-name my_graph --query "What are the main topics?"

# List available graphs
python query.py --list

Detailed Usage

Training Pipeline

python train.py document.md [options]

Options:
  --graph-name, -n      Name for the knowledge graph (default: 'default')
  --storage-dir, -s     Storage directory (default: './storage')
  --neo4j              Use Neo4j graph store (requires Neo4j setup)
  --verbose, -v        Enable detailed logging
  --quiet, -q          Disable verbose logging

Inference Pipeline

python query.py [options]

Options:
  --graph-name, -n     Graph to load (default: 'default')
  --storage-dir, -s    Storage directory (default: './storage')
  --query, -q          Single query to ask
  --interactive, -i    Start interactive session
  --list, -l          List available graphs
  --stats             Show graph statistics
  --verbose, -v       Enable detailed logging

Programmatic Usage

import os
os.environ["OPENAI_API_KEY"] = "your_api_key_here"

from src.graph_rag import GraphRAGTrainer, GraphRAGInference

# Training
trainer = GraphRAGTrainer()
storage_path = trainer.train_from_file("document.md", "my_graph")

# Inference
inference = GraphRAGInference()
inference.load_knowledge_graph("my_graph")
response = inference.query_simple("What are the main topics?")

Examples

Training from Multiple Documents

# Train individual graphs
python train.py doc1.md --graph-name doc1_graph
python train.py doc2.md --graph-name doc2_graph

Querying Different Graphs

# Compare responses from different graphs
python query.py --graph-name doc1_graph --query "What is the main concept?"
python query.py --graph-name doc2_graph --query "What is the main concept?"

Interactive Session

python query.py --graph-name my_graph --interactive

# Then interactively:
# 🔍 Query: What are the key concepts?
# 🔍 Query: How do these concepts relate?
# 🔍 Query: quit

Storage Structure

Knowledge graphs are stored in the following structure:

storage/
├── graph_name_1/
│   ├── docstore.json          # Document storage
│   ├── graph_store.json       # Graph structure
│   ├── index_store.json       # Index mappings
│   └── vector_store.json      # Vector embeddings
└── graph_name_2/
    └── ...

Features

✅ Modular Architecture: Separate training and inference
✅ Persistent Storage: Save and load knowledge graphs
✅ Multiple Graph Support: Manage multiple knowledge graphs
✅ Interactive Queries: Chat-like interface for questions
✅ Detailed Logging: Understand the RAG process
✅ Neo4j Support: Scale with graph databases
✅ Environment-based Config: Secure configuration management
✅ Graph Statistics: Analyze your knowledge graphs

Security Best Practices

🔐 Never commit API keys: Use environment variables only
🔐 Use .env.example: Provide template without secrets
🔐 Validate configuration: Settings module validates API keys
🔐 Secure storage: Knowledge graphs stored locally by default

Troubleshooting

Common Issues

API Key Not Found

Error: Required environment variable 'OPENAI_API_KEY' is not set
Solution: export OPENAI_API_KEY=your_actual_key

Graph Not Found

Error: Knowledge graph 'my_graph' not found
Solution: Use --list to see available graphs or train first

Invalid API Key Format

Error: OPENAI_API_KEY does not appear to be a valid OpenAI API key
Solution: Ensure key starts with 'sk-' or 'sk-proj-'

Debug Mode

Use --verbose flag for detailed logging:

python train.py document.md --verbose
python query.py --graph-name my_graph --query "test" --verbose

Dependencies

Core dependencies:

llama-index: Core LlamaIndex functionality
llama-index-graph-stores-neo4j: Neo4j integration (optional)
llama-index-embeddings-openai: OpenAI embeddings
llama-index-llms-openai: OpenAI language models
networkx: Graph manipulation and analysis
matplotlib: Graph visualization (optional)

Contributing

This project emphasizes security and modularity. When contributing:

Never commit API keys or secrets
Use environment variables for all configuration
Follow the modular architecture pattern
Add proper error handling and validation

License

This project is for educational and demonstration purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
src		src
storage/my_graph		storage/my_graph
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
knowledge_graph.png		knowledge_graph.png
pyproject.toml		pyproject.toml
query.py		query.py
sample_document.md		sample_document.md
train.py		train.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Graph RAG System

Overview

Architecture

Prerequisites

Installation

Configuration

Setup Environment Variables

Environment Variables

Quick Start

1. Train a Knowledge Graph

2. Query the Knowledge Graph

Detailed Usage

Training Pipeline

Inference Pipeline

Programmatic Usage

Examples

Training from Multiple Documents

Querying Different Graphs

Interactive Session

Storage Structure

Features

Security Best Practices

Troubleshooting

Common Issues

Debug Mode

Dependencies

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Graph RAG System

Overview

Architecture

Prerequisites

Installation

Configuration

Setup Environment Variables

Environment Variables

Quick Start

1. Train a Knowledge Graph

2. Query the Knowledge Graph

Detailed Usage

Training Pipeline

Inference Pipeline

Programmatic Usage

Examples

Training from Multiple Documents

Querying Different Graphs

Interactive Session

Storage Structure

Features

Security Best Practices

Troubleshooting

Common Issues

Debug Mode

Dependencies

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages