NUBot: Retrieval-Augmented Generation (RAG) Chatbot

NUBot is an intelligent chatbot designed to assist students and visitors with queries related to Northeastern University, such as courses, faculty, co-op opportunities, and more. It utilizes a Retrieval-Augmented Generation (RAG) approach to provide instant, accurate responses.

Prerequisites

Before setting up the project, install the necessary dependencies and software.

Install the Python debugger extension in VS Code:
- Python Debugger Extension

Features

Instant responses to academic-related queries.
Scalable and efficient system for handling high query volumes.
Continuous updates via cloud deployment.

Setup Steps

Step 1: Install Ollama Software

Download Ollama software.
Install the software on your system.

Step 2: Set Up the Model

Open a terminal or command prompt.
Run the following command:
```
ollama pull mistral
```
This will download the model locally.

Step 3: Clone the Repository

Open a terminal.

Clone the Git repository:

git clone <https://github.com/udishadc/NUBot>

Step 4: Install Dependencies

Navigate to the cloned repository directory.
If you encounter metadata generation issues, visit rustup.rs, copy the suggested command, and run it in your terminal. As rust might be required for sentence transformers or airflow
Run the following command to install all necessary dependencies:
```
pip install .
```

Running the Backend

Step 5: Start the Backend Service

Open a terminal.
Run one of the following commands:
```
python -m src.backend.api
```
or
```
python3 -m src.backend.api
```
This starts the backend service using Flask.

Running the Frontend

Step 6: Start the Frontend Service

Open another terminal.
Run the following command:
```
streamlit run src/frontend/app.py
```
Open a browser and navigate to:
```
localhost:8501
```
This will launch the frontend interface.
Enter a query and wait for the response.

Running the DAG

Airflow (Not Recommended)

Airflow requires Docker and takes too long to run.
To install and run Airflow:
- Install Docker.
- Follow Airflow setup steps.
Airflow can only run the DAG flow for web scraping but will not create embeddings for the corpus.

Prefect (Recommended)

Prefect is platform-independent (Windows, macOS, Linux) and lightweight.

Step 7: Start the Prefect Server

Open a terminal.
Run the following command:
```
prefect server start
```
Open a browser and navigate to:
```
localhost:4200
```
This opens the Prefect UI to monitor the workflows.

Step 8: Run the Workflow

Open another terminal.

Execute the flow file:

python -m src.prefectWorkflows.scraper_flow

This runs the scraping workflow, which can be monitored in Prefect UI.

if run succesfully it will show as below

Step 9(optional): Automating Workflow Execution

Prefect flows run manually by default.
After deployment, you can schedule the workflow using Prefect Cloud.

Testing the Flows

Step 10: Running Test Cases

Open a terminal.
Run the following command to test the scraper workflow:
```
python -m unittest tests.test_scraper_flow
```
Run the following command to test the preprocessing of data:
```
python -m unittest tests.test_preprocess_data
```

Modules Used for Testing

unittest: A built-in Python testing framework used for writing and running test cases.
mock: Part of Python’s unittest.mock module, used to simulate dependencies and isolate test cases.
MagicMock: A powerful feature of mock that allows the simulation of objects, methods, and their return values during testing.

These tests help ensure the correctness of data preprocessing and workflow execution in the pipeline.

Airflow Setup

Initial Setup (First-Time or After Changes)

Install and open Docker.

Run the following command to build the project:

docker compose -f 'docker-compose-airflow.yaml'  build

Initialize Airflow:

docker compose -f 'docker-compose-airflow.yaml' up airflow-init

Start Airflow:

docker compose -f 'docker-compose-airflow.yaml' up

Open a browser and navigate to localhost:8080.
Locate the DAG "web_scraping", run it, and wait until the status shows Success (dark green color).

Stopping Airflow

To stop Airflow, open a new terminal and run:

docker compose -f 'docker-compose-airflow.yaml' down

Running Airflow from the Second Time Onwards

Start in detached mode:

docker compose -f 'docker-compose-airflow.yaml' up -d

Run:

docker compose -f 'docker-compose-airflow.yaml' up

Prefect Workflow Setup

Installation

To install Prefect with all dependencies, run:

pip install -U prefect

Running Prefect Server

Start the Prefect UI server on port 4200 by running:

prefect server start

Once started, access the UI at: http://localhost:4200

Running the Workflow

Run the DAG script in src/prefectWorkflows using one of the following commands:

python scraper_flow.py
# OR
python -m src.prefectWorkflows.scraper_flow

After execution, refresh the Prefect UI at http://localhost:4200 to see the running DAG.

Deploying to Prefect Cloud

To deploy and run workflows anywhere, first log in to Prefect Cloud:

prefect cloud login

Then, register the flows and deploy them accordingly.

By following these steps, you can efficiently run, test, and manage your Prefect workflows both locally and in the cloud.

Logging and Exception Handling in NuBot

Overview

NuBot utilizes the Python logging module for tracking events and debugging information across different services. Additionally, we have implemented custom exception handling to manage errors efficiently and provide meaningful error messages.

Logging Implementation

We have integrated logging into all major services, ensuring that all critical operations are recorded. The following services include dedicated logging mechanisms:

Frontend
Backend
Data Processing
Other Processing Services

Log Location

All logs generated by NuBot can be found at:

NuBot/logs

These logs help in debugging issues, monitoring application performance, and ensuring system reliability.

Logging Usage

We use the logging module to capture important information, including:

INFO: General application flow (e.g., API calls, processing steps)
WARNING: Potential issues that may require attention
ERROR: Errors that prevent certain operations from completing
DEBUG: Detailed information useful for debugging purposes

Each service contains a dedicated logging file where all events are recorded systematically.

Exception Handling

NuBot has a structured exception handling framework to ensure robust error management. Each service includes an exception handling module that captures and processes errors efficiently.

Custom Exception Handling

We have implemented a CustomException class to manage unexpected errors across different components. This ensures that:

Errors are logged appropriately
Useful error messages are provided
The system remains stable and operational

Example of Exception Logging

If an exception occurs, it is logged along with relevant details:

try:
    # Some code execution
except Exception as e:
    logging.error(f"An error occurred: {str(e)}")
    raise CustomException(e, sys)

This structured approach ensures better traceability of issues and easier debugging.

Summary

Logging is implemented across all major services (Frontend, Backend, Data Processing, etc.).
Logs can be accessed at NuBot/logs.
Custom exception handling ensures robust error management.
All critical events and errors are recorded systematically.

DVC Remote Setup with Google Cloud Storage (GCS)

Prerequisites

Before setting up DVC with Google Cloud Storage (GCS), ensure you have the following:

Google Cloud SDK installed and authenticated.
A Google Cloud Storage (GCS) bucket created.
DVC installed (pip install dvc[gdrive] for Google Drive or pip install dvc[gs] for Google Cloud Storage).
Service account key (if using authentication via service account).

Step 1: Authenticate with Google Cloud

Option 1: Using gcloud CLI

# Authenticate with Google Cloud
gcloud auth application-default login

Option 2: Using Service Account Key

Generate a service account key from the Google Cloud Console:
- Go to IAM & Admin > Service Accounts.
- Select or create a service account with Storage Admin permissions.
- Generate a JSON key and download it.
Set the environment variable:

export GOOGLE_APPLICATION_CREDENTIALS="/path/to/your/service-account-key.json"

⚠️ Warning: Make sure to run gcloud auth application-default login unless you use a service account or other ways to authenticate (more info).

To use custom authentication or further configure your DVC remote, set any supported config parameter with:

dvc remote modify --local myremote credentialpath 'path/to/project-XXX.json'

Step 2: Initialize DVC

If you haven't initialized DVC in your project, do so with:

dvc init

Step 3: Configure DVC Remote with GCS

Run the following command to set up a GCS bucket as a DVC remote:

dvc remote add -d gcsremote gs://your-gcs-bucket-name

Verify the configuration by checking .dvc/config: keep your bucket name as modelembeddings

[core]
    remote = gcsremote
[remote "gcsremote"]
    url = gs://<your-gcs-bucket-name>

Step 4: Push Data to GCS

Once the remote storage is set up, you can push your dataset or model files to GCS using DVC:

dvc add faiss_index/
git add faiss_index.dvc .gitignore
git commit -m "Track data with DVC"
dvc push

Step 5: Pull Data from GCS

To retrieve the dataset from the GCS remote, use:

dvc pull

Step 6: Run MLflow UI

To monitor experiments and model versions, run MLflow UI in a separate terminal:

mlflow ui

Troubleshooting

Ensure you have Storage Admin access to the GCS bucket.
If authentication fails, try re-authenticating using:
```
gcloud auth application-default login
```
If using a service account, verify the JSON key path is correct:
```
echo $GOOGLE_APPLICATION_CREDENTIALS
```

You're all set! 🎉 Your DVC is now connected to Google Cloud Storage for versioning your datasets and models.

By leveraging logging and structured exception handling, NuBot maintains stability, traceability, and better error resolution. 🚀

Name		Name	Last commit message	Last commit date
Latest commit History 215 Commits
.dvc		.dvc
.github/workflows		.github/workflows
.vscode		.vscode
airflow		airflow
assets		assets
logs		logs
prefectWorkflows		prefectWorkflows
services		services
tests		tests
.dockerignore		.dockerignore
.dvcignore		.dvcignore
.env		.env
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
pyproject.toml		pyproject.toml
run-requirements.txt		run-requirements.txt
setup.py		setup.py

udishadc/NUBot

Folders and files

Latest commit

History

Repository files navigation

NUBot: Retrieval-Augmented Generation (RAG) Chatbot

Prerequisites

Features

Setup Steps

Step 1: Install Ollama Software

Step 2: Set Up the Model

Step 3: Clone the Repository

Step 4: Install Dependencies

Running the Backend

Step 5: Start the Backend Service

Running the Frontend

Step 6: Start the Frontend Service

Running the DAG

Airflow (Not Recommended)

Prefect (Recommended)

Step 7: Start the Prefect Server

Step 8: Run the Workflow

Step 9(optional): Automating Workflow Execution

Testing the Flows

Step 10: Running Test Cases

Modules Used for Testing

Airflow Setup

Initial Setup (First-Time or After Changes)

Stopping Airflow

Running Airflow from the Second Time Onwards

Prefect Workflow Setup

Installation

Running Prefect Server

Running the Workflow

Deploying to Prefect Cloud

Logging and Exception Handling in NuBot

Overview

Logging Implementation

Log Location

Logging Usage

Exception Handling

Custom Exception Handling

Example of Exception Logging

Summary

DVC Remote Setup with Google Cloud Storage (GCS)

Prerequisites

Step 1: Authenticate with Google Cloud

Option 1: Using gcloud CLI

Option 2: Using Service Account Key

Step 2: Initialize DVC

Step 3: Configure DVC Remote with GCS

Step 4: Push Data to GCS

Step 5: Pull Data from GCS

Step 6: Run MLflow UI

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 6

Uh oh!

Languages

Packages