RAG Chatbot về Pittsburgh/CMU và VNU

Chatbot sử dụng kỹ thuật Retrieval Augmented Generation (RAG) để trả lời câu hỏi về Pittsburgh / Carnegie Mellon University (CMU) và VNU. Hệ thống sử dụng:

Together AI API cho Large Language Model (LLM) : meta-llama/Llama-3.3-70B-Instruct-Turbo-Free
Hugging Face cho Embedding Model : intfloat/multilingual-e5-large-instruct
ChromaDB làm vector database

Yêu cầu

Python 3.8+
CUDA (tùy chọn, để tăng tốc embedding)
Token từ Together AI và Hugging Face (xem phần "Đăng ký và lấy API Token" bên dưới)

Đăng ký và lấy API Token

1. Together AI

Truy cập Together AI
Đăng ký tài khoản mới
Vào Dashboard
Tạo API key mới và lưu lại

2. Hugging Face

Truy cập Hugging Face
Đăng ký tài khoản mới
Vào Settings > Access Tokens
Tạo token mới với quyền "read"

Chạy Chatbot

Chuẩn bị môi trường

Clone repository:

git clone https://github.com/CTNone/RAG_NLP_build_

Cài đặt các thư viện:

pip install -r requirements.txt

Chuẩn bị dữ liệu

Để bắt đầu nhanh:

Tải bộ dữ liệu mẫu: Google Drive

Hoặc tự chuẩn bị dữ liệu theo hướng dẫn bên dưới

Tự chuẩn bị dữ liệu

Các folder dữ liệu thường được bắt đầu bằng "DATA_" và có folder "segmented" như sau:

data/
├── DATA_CMU_Pit/
│   └── segmented/
│       ├── about_cmu.json
│       ├── campus_life.json
│       └── ...
├── ...

Mỗi file JSON trong thư mục segmented/ phải có cấu trúc sau:

{
  "url": "https://example.com/source-page",
  "title": "Tiêu đề trang",
  "chunks": [
    "Đoạn văn bản thứ nhất...",
    "Đoạn văn bản thứ hai...",
    "Đoạn văn bản thứ ba..."
  ]
}

Chuẩn bị tham số

Tùy chỉnh nội dung file config.yaml sao cho phù hợp: Nếu muốn tùy chỉnh embedding nhanh hơn hoặc phù hợp cấu hình máy tính. Hãy thử một số model embedding sau:

sentence-transformers/all-MiniLM-L6-v2 (22.7M params)
BAAI/bge-small-en-v1.5 (33.4M params)
BAAI/bge-base-en-v1.5 (109M params)
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (118M params)
BAAI/bge-m3
... Hoặc giữ nguyên Model intfloat/multilingual-e5-large (560M params)

# API Credentials
api:
  huggingface:
    token: "your-huggingface-token" # Token từ Hugging Face
    embedding_model: "sentence-transformers/all-MiniLM-L6-v2"
  together:
    token: "your-together-token" # Token từ Together AI
    model_id: "meta-llama/Llama-3.3-70B-Instruct-Turbo-Free"

# Database
db_dir: "chroma_db"

# Data Paths
data_path: "data" # Folder chứa các dữ liệu cần thiết

data_prefix: "DATA_" # Prefix để tự động tìm các thư mục dữ liệu

# Test Questions and Answers Paths
test_questions_dir: "data/test_questions"
test_answers_dir: "data/test_answers"

# Model Parameters
max_new_tokens: Số lượng token tối đa mà model có thể sinh ra trong câu trả lời
temperature: Độ ngẫu nhiên trong câu trả lời (0.0-1.0, càng thấp càng ổn định)
top_p: Ngưỡng xác suất tích lũy để chọn token tiếp theo (0.0-1.0)
repetition_penalty: Hệ số phạt cho việc lặp lại từ/cụm từ (>1.0 để giảm lặp lại)

# RAG Settings
retriever_k: Số lượng đoạn văn bản truy xuất cho mỗi câu hỏi

# Prompt Template
template:

Chạy thui nào !

Tạo vector Database

python create_chroma_db.py

Khởi động ứng dụng Chatbot:

python app.py

Truy cập giao diện web:

Mở trình duyệt và truy cập địa chỉ hiển thị trong terminal (thường là http://127.0.0.1:7860)
Bắt đầu đặt câu hỏi về Pittsburgh / CMU và VNU!

Tài liệu

Llama-3.3-70B Instruct
...

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
ASSIGNMENT 2 End-to-end-NLP-System-Building update.pdf		ASSIGNMENT 2 End-to-end-NLP-System-Building update.pdf
Clean_data.ipynb		Clean_data.ipynb
README.md		README.md
answer_questions.py		answer_questions.py
app.py		app.py
config.yaml		config.yaml
contributions.md		contributions.md
craw_data_from_web.ipynb		craw_data_from_web.ipynb
create_chroma_db.py		create_chroma_db.py
github_url.txt		github_url.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Chatbot về Pittsburgh/CMU và VNU

Yêu cầu

Đăng ký và lấy API Token

1. Together AI

2. Hugging Face

Chạy Chatbot

Chuẩn bị môi trường

Chuẩn bị dữ liệu

Tự chuẩn bị dữ liệu

Chuẩn bị tham số

Chạy thui nào !

Tạo vector Database

Khởi động ứng dụng Chatbot:

Truy cập giao diện web:

Tài liệu

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG Chatbot về Pittsburgh/CMU và VNU

Yêu cầu

Đăng ký và lấy API Token

1. Together AI

2. Hugging Face

Chạy Chatbot

Chuẩn bị môi trường

Chuẩn bị dữ liệu

Tự chuẩn bị dữ liệu

Chuẩn bị tham số

Chạy thui nào !

Tạo vector Database

Khởi động ứng dụng Chatbot:

Truy cập giao diện web:

Tài liệu

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages