This project is a high-performance, Production-Grade RAG (Retrieval-Augmented Generation) assistant designed to analyze complex technical documents and academic papers.
It utilizes a hybrid architecture that combines the speed of local vectorization with the cognitive power of Cloud APIs. By leveraging the Groq API for inference and optimized local embeddings, it allows users to "chat" with their PDF documents with extremely low latency.
The system follows a strict technical pipeline designed for efficiency and accuracy:
- Data Ingestion: PDF documents are loaded from the
data/rawdirectory and split into semantic chunks. - Efficient Vectorization: Text chunks are converted into vector embeddings using the lightweight and fast
sentence-transformers/all-MiniLM-L6-v2model. This runs locally on the CPU with minimal resource usage. - Vector Store (ChromaDB): Embeddings are persisted locally in a vector database for retrieval.
- Retrieval & Context Injection: Relevant context is retrieved based on user queries (Similarity Search).
- LLM Generation: The retrieved context and the user prompt are sent to the Llama-3.3-70b model (via Groq API) to generate an evidence-based answer.
- API & RAG Fusion: Combines the general knowledge of Large Language Models with your private data.
- Optimized Performance: Uses
all-MiniLM-L6-v2for fast CPU-based embedding generation, making the Docker image lightweight. - One-Click Setup: Includes automated scripts (
.shand.bat) for instant local environment setup. - Evidence-Based Answers: The assistant cites the specific source files used to generate the response.
- Context-Aware Memory: Maintains chat history to support follow-up questions and conversational flow.
Here the system is demonstrated handling queries in multiple languages across different technical domains.
The model accurately retrieves information to explain core biomedical segmentation architectures like U-Net.

The system is capable of understanding and responding to technical queries in different languages, such as explaining YOLOv1 in Turkish.

Handling specialized queries about evolving architectures like U-Net++, shown in the system's light theme interface.

This project utilizes a comprehensive collection of academic papers, ranging from foundational Deep Learning architectures to state-of-the-art Object Detection (YOLO series) and Segmentation models.
| Paper Title | Topic | Year | Link |
|---|---|---|---|
| Adam: A Method for Stochastic Optimization | Optimization | 2014 | 1412.6980 |
| U-Net: Convolutional Networks for Biomedical Image Segmentation | Biomedical / Seg. | 2015 | 1505.04597 |
| Deep Residual Learning for Image Recognition (ResNet) | Backbone / CV | 2015 | 1512.03385 |
| You Only Look Once: Unified, Real-Time Object Detection (YOLOv1) | Object Detection | 2015 | 1506.02640 |
| Identity Mappings in Deep Residual Networks | Backbone / CV | 2016 | 1603.05027 |
| Wide Residual Networks | Backbone / CV | 2016 | 1605.07146 |
| Aggregated Residual Transformations (ResNeXt) | Backbone / CV | 2016 | 1611.05431 |
| YOLO9000: Better, Faster, Stronger (YOLOv2) | Object Detection | 2016 | 1612.08242 |
| Attention Is All You Need (Transformer) | NLP / Foundation | 2017 | 1706.03762 |
| Squeeze-and-Excitation Networks (SENet) | Backbone / CV | 2017 | 1709.01507 |
| MobileNetV2: Inverted Residuals and Linear Bottlenecks | Mobile / CV | 2018 | 1801.04381 |
| YOLOv3: An Incremental Improvement | Object Detection | 2018 | 1804.02767 |
| EfficientNet: Rethinking Model Scaling for CNNs | Backbone / CV | 2019 | 1905.11946 |
| EfficientDet: Scalable and Efficient Object Detection | Object Detection | 2019 | 1911.09070 |
| YOLOv4: Optimal Speed and Accuracy of Object Detection | Object Detection | 2020 | 2004.10934 |
| Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (RAG) | GenAI / RAG | 2020 | 2005.11401 |
| An Image is Worth 16x16 Words: Transformers for Image Recognition (ViT) | Vision Transformer | 2020 | 2010.11929 |
| EfficientNetV2: Smaller Models and Faster Training | Backbone / CV | 2021 | 2104.00298 |
| LoRA: Low-Rank Adaptation of Large Language Models | LLM / Fine-tuning | 2021 | 2106.09685 |
| YOLOv7: Trainable Bag-of-Freebies | Object Detection | 2022 | 2207.02696 |
| Segment Anything (SAM) | Segmentation | 2023 | 2304.02643 |
| YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information | Object Detection | 2024 | 2402.13616 |
| YOLOv10: Real-Time End-to-End Object Detection | Object Detection | 2024 | 2405.14458 |
| Other Technical Reports | Deep Learning | Misc | arXiv Index |
You need a Groq API Key to run the inference engine.
- Rename
config/config_example.pytoconfig.py. - Paste your API key inside the file:
# config/config.py
GROQ_API_KEY = "gsk_..."This is the cleanest method. You do not need Python installed, only Docker Desktop.
# 1. Build and Start the Container
docker compose up --build
# 2. Access the Application
# Open your browser and go to: http://localhost:8501
If you prefer running locally without Docker, use the provided automation scripts. These scripts automatically handle virtual environment creation, dependency installation, data ingestion, and app launch.
For Linux / macOS Users:
# Give execution permission (only once)
chmod +x run_linux.sh
# Run the script
./run_linux.sh
For Windows Users:
Simply double-click the run_windows.bat file.
Note: Ensure you have placed your PDF files in the data/raw/ directory before running the scripts.
rag-genai-assistant/
βββ config/ # Configuration files and API Keys
βββ data/
β βββ raw/ # Upload your PDF documents here
βββ logs/ # System logs (Ingestion and Chat history)
βββ src/
β βββ data_ingestion.py # Handles document loading and embedding generation
β βββ engine.py # Core logic for the Chat Engine
β βββ llm_setup.py # Initialization of Groq API and Embedding models
β βββ retriever.py # Custom retrieval logic
β βββ utils/
β βββ logger.py # Centralized logging configuration
βββ vector_db/ # Persistent storage for ChromaDB
βββ app.py # Streamlit Frontend Application
βββ docker-compose.yml # Docker orchestration file
βββ Dockerfile # Optimized Docker image definition
βββ run_linux.sh # Automated setup script for Linux/Mac
βββ run_windows.bat # Automated setup script for Windows
| Component | Technology | Description |
|---|---|---|
| LLM Inference | Meta Llama 3.3 (70B) | Powered by Groq API for ultra-fast generation. |
| Orchestration | LlamaIndex | Framework for connecting LLMs with external data. |
| Vector DB | ChromaDB | Open-source embedding database. |
| Embedding Model | all-MiniLM-L6-v2 | High-speed, CPU-friendly sentence transformer. |
| Frontend | Streamlit | Interactive web interface. |
| Deployment | Docker | Containerization and environment isolation. |
Developer: Ozan Bozyel
Role: Biomedical & Deep Learning Engineer
LinkedIn: Ozan Bozyel
GitHub: BozyelOzan
This project is open-source and intended for educational and research purposes.