GraphQnA is a powerful domain-agnostic question-answering system built on Neo4j's GraphRAG framework. It combines vector search, knowledge graph traversal, and LLM reasoning to provide accurate, context-aware answers to complex questions about your domain knowledge.
- Key Features
- Quick Start
- Architecture Overview
- Installation
- Domain Configuration
- Ingesting Documents
- Asking Questions
- API and Slack Integration
- CLI Reference
- Configuration Options
- Troubleshooting
- Development
- License
- Hybrid Retrieval Orchestration - Automatically selects the optimal retrieval method based on query type
- Domain-Agnostic Design - Works with any knowledge domain through centralized configuration
- Intelligent Knowledge Graph Building - LLM-powered extraction of entities and relationships
- Multiple Retrieval Strategies:
- Vector Retrieval - Semantic similarity search for factual information
- GraphRAG - Enhanced retrieval leveraging graph structure for context
- Knowledge Graph (KG) - Direct Cypher query generation
- Enhanced KG - Schema-guided retrieval for complex questions
- REST API & Slack Bot - Multiple interfaces for easy integration
- Comprehensive Testing - Domain-specific test suites with detailed metrics
- Enterprise-Class CLI - Full command-line tools for all operations
# Clone the repository
git clone https://github.com/veteranbv/graphqna.git
cd graphqna
# Create virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r requirements.txt
# Configure your environment
cp .env.example .env
# Edit .env with your Neo4j and OpenAI credentials
# Create domain configuration
cp graphqna/config/domain_config_example.py graphqna/config/domain_config.py
# Edit domain_config.py to customize for your domain
# Ingest documents
./scripts/ingest.sh
# Try the demo
python scripts/hybrid_qa_demo.pyGraphQnA employs a sophisticated hybrid architecture that combines multiple retrieval methods:
-
Vector Retrieval: Uses embedding similarity to find relevant document chunks
- Best for: Factual questions and information retrieval
- Implementation:
VectorRetrieveringraphqna/retrieval/vector.py
-
GraphRAG: Combines vector similarity with graph traversal
- Best for: General questions requiring contextual awareness
- Implementation:
GraphRetrieveringraphqna/retrieval/graph.py
-
Knowledge Graph: Converts questions directly to Cypher queries
- Best for: Simple entity and relationship questions
- Implementation:
KnowledgeGraphRetrieveringraphqna/retrieval/kg.py
-
Enhanced Knowledge Graph: Schema-aware Cypher generation
- Best for: Complex entity and relationship questions
- Implementation:
EnhancedKGRetrieveringraphqna/retrieval/enhanced_kg.py
The hybrid retriever (HybridRetriever in graphqna/retrieval/hybrid_retriever.py) automatically selects the best method for each question:
- Query Classification: Analyzes the question type (factual, procedural, entity, relationship)
- Method Selection:
- Factual questions → GraphRAG (with Vector fallback)
- Procedural questions → GraphRAG (with Vector fallback)
- Entity questions → Enhanced KG (with GraphRAG fallback)
- Relationship questions → Enhanced KG (with KG fallback)
- Fallback Mechanisms: If the primary method fails or returns a generic answer, alternative methods are tried
The ingestion process creates a rich knowledge graph with entities and relationships. Here's a visualization of an example knowledge graph created by GraphQnA, viewed in Neo4j Bloom:
A Neo4j Bloom visualization showing entities (nodes) and relationships extracted from documents. Different node types are color-coded, revealing the rich semantic structure that enhances question answering.
Documents are processed through a multi-stage pipeline:
- Document Loading: Support for various formats (markdown, PDF, etc.)
- Chunking: Text is split into manageable chunks
- Embedding: Chunks are embedded using OpenAI models
- Schema Detection: LLM identifies entity and relationship types
- Knowledge Graph Extraction: Entities and relationships are extracted from text
- Graph Import: Knowledge is imported into Neo4j
- Python 3.9+
- Neo4j Database (Neo4j Aura or self-hosted)
- OpenAI API key (for GPT-4 and embeddings)
-
Create a virtual environment:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Set up Neo4j:
-
For Neo4j Aura (recommended):
- Create an account at Neo4j Aura
- Create a new database instance
- Save the connection URI, username, and password
-
For self-hosted Neo4j:
- Install Neo4j (version 5.0+)
- Configure authentication
-
-
Configure environment variables:
Create a
.envfile with:NEO4J_URI=neo4j+s://your-instance-id.databases.neo4j.io NEO4J_USERNAME=neo4j NEO4J_PASSWORD=your-password NEO4J_DATABASE=neo4j OPENAI_API_KEY=your-openai-api-key EMBEDDING_DIMENSIONS=3072 # For text-embedding-3-large -
Create domain configuration:
cp graphqna/config/domain_config_example.py graphqna/config/domain_config.py # Edit domain_config.py to customize for your domain -
Verify installation:
python -m graphqna db --check-connection python -m graphqna db --check-index
GraphQnA is designed to be domain-agnostic through a centralized configuration system.
Your domain configuration (domain_config.py) includes:
- Domain Metadata: Name and description
- Entity Definitions: Types of objects in your domain
- Relationship Definitions: How entities relate to each other
- Schema Triplets: Valid entity-relationship-entity combinations
- Response Templates: Consistent messaging
- Fallback Queries: Backup database queries
- LLM Prompts: Domain-specific instructions for various components
# Domain metadata
DOMAIN_NAME = "HealthcareKnowledge"
DOMAIN_DESCRIPTION = "A knowledge graph for healthcare documentation"
# Entity definitions
ENTITY_DEFINITIONS = [
{
"label": "Condition",
"description": "A medical condition or diagnosis",
"properties": [
{"name": "name", "type": "STRING"},
{"name": "description", "type": "STRING"},
{"name": "icd10_code", "type": "STRING"},
],
},
# More entity definitions...
]
# Relationship definitions
RELATION_DEFINITIONS = [
{
"label": "TREATS",
"description": "A treatment that addresses a condition",
"properties": [{"name": "effectiveness", "type": "STRING"}],
},
# More relationship definitions...
]The ingestion process converts documents into a knowledge graph with embeddings, entities, and relationships.
- Raw Data: Place source files in
data/raw/ - Processed Data: Successfully processed files move to
data/processed/ - Logs: System logs are stored in
logs/ - Output: Generated outputs appear in
output/
# Process a single file
python -m graphqna ingest --file path/to/document.md
# Process all files in a directory
python -m graphqna ingest --directory data/raw --pattern "*.md"
# Clear database before ingestion
python -m graphqna ingest --file path/to/document.md --clear
# Move processed files
python -m graphqna ingest --directory data/raw --move-processed
# Skip already processed files
python -m graphqna ingest --directory data/raw --skip-existing./scripts/ingest.shThis handles common ingestion scenarios automatically.
The ingestion pipeline (graphqna/ingest/pipeline.py) orchestrates:
- Document chunking with specified size and overlap
- Embedding generation using OpenAI models
- Schema detection from document content
- Knowledge graph extraction using LLMs
- Importing entities and relationships into Neo4j
- Creating a vector index for similarity search
# Ask a simple question
python -m graphqna query "What is GraphQnA?"
# Specify retrieval method
python -m graphqna query "How do I configure the system?" --method vector
# Show retrieved context
python -m graphqna query "What are the key features?" --context
# Process multiple questions
python -m graphqna query --file path/to/questions.txt
# Save responses to files
python -m graphqna query --file questions.txt --output-dir output/results# Start interactive session
python -m graphqna query --interactive
# Or use the convenience script
./scripts/run_interactive.shIn interactive mode:
- Type questions to get answers
- Type
vector,graphrag,kg,enhanced_kg, orhybridto change methods - Type
contextto toggle context display - Type
exit,quit, orqto exit
- hybrid (default): Automatically selects the best method
- vector: Pure vector similarity search
- graphrag: Neo4j's GraphRAG framework
- kg: Knowledge graph query generation
- enhanced_kg: Schema-guided Cypher generation
Try the demo script to compare all retrieval methods:
python scripts/hybrid_qa_demo.pyThis demo offers three modes:
- Compare all retrieval methods on a single query
- Run example questions for each query type
- Try your own question with the hybrid retriever
GraphQnA includes both a REST API and Slack bot for easy integration.
python scripts/run_api.pyGET /api/health- Check service healthGET /api/info- Get service informationPOST /api/query- Answer a questionPOST /api/ingest- Ingest a document (placeholder)
curl -X POST http://localhost:8000/api/query \
-H "Content-Type: application/json" \
-d '{
"query": "What is GraphQnA?",
"retrieval_method": "hybrid"
}'The Slack bot integration allows teams to ask questions directly in Slack channels or via direct messages.
Screenshot showing a sample interaction with the GraphQnA Slack bot. The bot provides comprehensive answers with source information and feedback buttons.
# Set environment variables
export SLACK_BOT_TOKEN=xoxb-your-bot-token
export SLACK_APP_TOKEN=xapp-your-app-token
export SLACK_SIGNING_SECRET=your-signing-secret
# Start the Slack bot
python scripts/run_slack_bot.py- Direct Messages: Send questions directly to the bot
- Channel Mentions: Mention the bot in a channel
- Monitored Channels: Bot can listen in specific channels
- Feedback Collection: Users can provide feedback on answers
- Threaded Responses: Keeps conversations organized
- Confidence Filtering: Only responds when confident in non-primary channels
For production deployment, GraphQnA includes systemd service files:
Located in deployment/systemd/, these files allow you to run the API and Slack bot as system services:
# Copy service files to systemd directory
sudo cp deployment/systemd/graphqna-api.service /etc/systemd/system/
sudo cp deployment/systemd/graphqna-slackbot.service /etc/systemd/system/
# Enable and start services
sudo systemctl daemon-reload
sudo systemctl enable graphqna-api
sudo systemctl start graphqna-api
sudo systemctl enable graphqna-slackbot
sudo systemctl start graphqna-slackbotThe service files are configured to:
- Run the API and Slack bot as system services
- Automatically restart on failure
- Load environment variables from your
.envfile - Run as a dedicated system user for security
- Output logs to the system journal
GraphQnA provides a comprehensive command-line interface:
# Show help information
python -m graphqna --helpingest- Process documents into the knowledge graphquery- Ask questions using the knowledge graphdb- Manage the Neo4j databasetest- Run test suites to evaluate system performance
# Show database statistics
python -m graphqna db --stats
# Clear the database
python -m graphqna db --clear
# Check database connection
python -m graphqna db --check-connection
# Reset vector index
python -m graphqna db --reset-vector-index --dimensions 3072
# Check vector index configuration
python -m graphqna db --check-index
# Create database backup
python -m graphqna db --backup output/database_backup.cypherGraphQnA includes extensive testing capabilities:
# Run basic test suite
python -m graphqna test
# Run full test suite with all methods
python -m graphqna test --suite full --method all
# Run custom tests from a file
python -m graphqna test --suite custom --file path/to/test_questions.md
# Write test results to a file
python -m graphqna test --output output/test_results.json
# Show detailed information
python -m graphqna test --verboseCreate domain-specific test questions:
cp tests/resources/test_questions_template.md tests/resources/test_questions_domain.mdThen customize these questions to match your domain's terminology and use cases.
GraphQnA can be configured through environment variables or the .env file:
| Option | Description | Default |
|---|---|---|
NEO4J_URI |
Neo4j database connection URI | - |
NEO4J_USERNAME |
Neo4j database username | - |
NEO4J_PASSWORD |
Neo4j database password | - |
NEO4J_DATABASE |
Neo4j database name | neo4j |
OPENAI_API_KEY |
OpenAI API key | - |
LLM_MODEL |
OpenAI model to use | gpt-4o |
LLM_TEMPERATURE |
Temperature for LLM generation | 0.0 |
LLM_MAX_TOKENS |
Maximum tokens for LLM responses | 2000 |
EMBEDDING_MODEL |
Embedding model to use | text-embedding-3-large |
EMBEDDING_DIMENSIONS |
Dimensions of embedding vectors | 3072 |
CHUNK_SIZE |
Text chunk size for processing | 1000 |
CHUNK_OVERLAP |
Overlap between chunks | 200 |
VECTOR_TOP_K |
Number of chunks to retrieve | 5 |
LOG_LEVEL |
Logging level (INFO, DEBUG, etc.) | INFO |
VECTOR_INDEX_NAME |
Name of the vector index in Neo4j | document-chunks |
If you encounter "Index query vector has X dimensions, but indexed vectors have Y dimensions":
-
Check settings:
cat .env | grep DIMENSIONS python -m graphqna db --check-index -
Reset the index:
python -m graphqna db --reset-vector-index --dimensions 3072
-
Re-ingest documents:
python -m graphqna ingest --directory data/raw --clear
If you have trouble connecting to Neo4j:
-
Verify connection:
python -m graphqna db --check-connection
-
Check credentials in
.env -
For Neo4j Aura: Ensure your IP is on the allowlist
If document processing fails:
-
Check the logs in
logs/graphqna.log -
Process a single file to isolate issues:
python -m graphqna ingest --file path/to/document.md
-
Verify vector index:
python -m graphqna db --check-index
/
├── graphqna/ # Core package
│ ├── __init__.py # Package initialization
│ ├── __main__.py # Module entry point
│ ├── api/ # API implementation
│ │ ├── server.py # FastAPI server
│ │ └── slack_bot.py # Slack integration
│ ├── cli/ # Command-line interface
│ │ ├── main.py # CLI entry point
│ │ └── commands/ # Command modules
│ ├── config/ # Configuration management
│ │ ├── settings.py # Settings handler
│ │ └── domain_config*.py # Domain configuration
│ ├── db/ # Database connectivity
│ │ ├── neo4j.py # Neo4j connection
│ │ └── vector_index.py # Vector index management
│ ├── ingest/ # Document ingestion
│ │ ├── chunker.py # Text chunking
│ │ ├── embedder.py # Vector embedding
│ │ ├── kg_builder.py # Knowledge graph builder
│ │ └── pipeline.py # End-to-end pipeline
│ ├── models/ # Data models
│ └── retrieval/ # Retrieval strategies
│ ├── base.py # Base retriever
│ ├── vector.py # Vector retrieval
│ ├── graph.py # GraphRAG retrieval
│ ├── kg.py # Knowledge graph retrieval
│ ├── enhanced_kg.py # Enhanced KG retrieval
│ ├── hybrid_retriever.py # Hybrid orchestrator
│ └── service.py # Unified service
├── scripts/ # Utility scripts
├── data/ # Data files
├── docs/ # Documentation
├── tests/ # Test suite
│ ├── integration/ # Integration tests
│ ├── resources/ # Test resources
│ └── unit/ # Unit tests
└── requirements.txt # Dependencies# Run all tests
pytest
# Run with coverage
pytest --cov=graphqnaLicensed under the MIT License. See the LICENSE file for details.

