A comprehensive RAG system that uses a graph database to store and retrieve knowledge in a structured, interconnected way. The system features web scraping capabilities for data collection, document processing with semantic chunking, and a Streamlit interface for easy interaction.
-
Multiple Data Source Options:
- Web scraping functionality to collect data from any website
- Scraping data collection module (Wikipedia format Suported)
-
Advanced Text Processing:
- Semantic chunking with configurable size and overlap
- High-quality embeddings using
Sentence Transformers - Context expansion for improved relevance
-
Graph Database Integration:
- Neo4j backend for knowledge storage
- Semantic relationships between text chunks
- Document and chunk hierarchies with metadata
-
Streamlined User Interface:
- Interactive Streamlit application
- Ability to ask question or provide URLs for ingesting data
- Visualization of the knowledge graph structure (TBD)
-
Customizable LLM Integration:
- Configurable to work with any OpenAI model
- Extensible design for other LLM providers
- Python 3.10
- Neo4j Database (local, docker or cloud instance)
- OpenAI API key (or equivalent)
-
Clone this repository:
git clone git@github.com:MinaBeirami/nexusRAG.git cd nexusRAG -
Install the required dependencies:
pip install -r requirements.txt
-
Set up environment variables:
Edit the
.envfile with your API keys and database credentials. -
Start the Neo4j database (if using a local instance): Run it via Docker Desktop, OR run the following Command:
# If using Docker docker run --name neo4j -p 7474:7474 -p 7687:7687 -e NEO4J_AUTH=neo4j/password -d neo4j
streamlit run app.pyThis will start the Streamlit server, and you can access the application at http://localhost:8501.
The system provides several ways to collect data:
- Web Scraping: Enter URLs to scrape content from websites
- Paste Text: Directly paste text content into the application
- (TODO) Upload Files: Upload local documents (PDF, DOCX, TXT, CSV)
- (TODO) Hugging Face Datasets: Select and import datasets from Hugging Face
After collecting data, the system will:
- Process documents into semantic chunks
- Generate embeddings for each chunk
- Store chunks and their relationships in the Neo4j graph database
- Create semantic relationships between related chunks
Once the knowledge graph is built, you can:
- Ask questions in natural language
- View the retrieved context used to answer the question
- (TODO)Explore the knowledge graph visually
- Export answers and sources
The system can be customized through the src/config/settings.py file:
embedding_model: Change the embedding model (default: "all-MiniLM-L6-v2")chunk_size: Adjust the size of text chunks (default: 500)chunk_overlap: Set the overlap between chunks (default: 50)llm_model: Select the LLM model (default: "gpt-3.5-turbo")
The system follows a modular architecture:
data_collector.py: Modules for acquiring data from various sourcestext_processor.py: Text processing, chunking, and embedding generationengine.py: Core RAG implementation with LLM integrationgraph_handler.py: Neo4j database interactionapp.py: Streamlit user interface
Contributions are welcome! Please feel free to submit a Pull Request.