Skip to content

hyz-courses/CSIT5520-NLP-Final-Project-Data

Repository files navigation

CSIT5520 NLP Final Project Data Proceessing

  • HUANG Yanzhen
  • LI Weijie
  • LI Kaiwen

Introduction

We built a RAG pipeline, encorporating FastAPI as backend server, Dify as pipeline backbone and Milvus as vector database.

This repository is used for data processing and demo. The repository of the FastAPI server can be found at:

https://github.com/hyz-courses/CSIT5520-NLP-Final-Project

Resources

Models

Usage Variant Remarks
Text Embedding Model Qwen text-embedding-v4 d=1024
Question Generation Model Qwen3.5-plus
Answering Model Qwen3.5-plus

Data

We obtained data from MSSS WeChat Public Account, most of which are in Chinese (Simplified). Data are processed by us to fit the pipeline (see Parse and Upload).

All the related data are stored in the demo folder. It is a static snapshot of the resources folder for your quick reference.

Subfolder Description
demo/markdowns Source article data in .md format.
demo/jsons Chunk-format of source article data ready to upload.
demo/pipelines .yaml format snapshot of two pipelines: The general question answering pipeline and the retrieval test pipeline.
demo/question All the related files during retrieval test.

Retrieval Test Result

We performed retrieval test on our milvus data. The final results are follows.

Question Language NDCG@1 NDCG@5 NDCG@10 NDCG@20
Chinese 0.46565 0.60494 0.63176 0.64865
English 0.55206 0.68110 0.70284 0.71637

Setup

Please add a .env file into the /resources directory and add a DOCSEARCH_API_KEY variable the author give to you.

# resources/.env

# To use LLM apis from Alibaba
ALICLOUD_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_API_KEY="your-qwen-api-key"
QWEN_QUESTION_GENERATION_MODEL=qwen-plus

# Communicate with Dify
DIFY_API_URL=https://dify.docsearch.love/v1/workflows/run
DIFY_API_KEY="your-dify-api-key"

# To upload chunk to our server
DOCSEARCH_API_KEY="your-docsearch-api-key"

Install related packages:

pip install -r requirements.txt

Quick Start

Run this command to chat with the pipeline.

streamlit run app.py

Parse and Upload

Run:

python parse.py

This will pares all the files in the /makrdowns directory and upload all the output .json files.

Retrieval Test

Data process pipelines:

Stage File Name Remarks
1 milvus_storage.jsonl Original Milvus Database Storage
2 milvus_questions.jsonl Generated Questions
3 milvus_answers.jsonl Retrieved Answer Chunks
4 milvus_summarized.jsonl Summarized Answer Hitting
5 milvus_analyzed.jsonl NDCG for each question
6 ndcg_mean.json Final mean NDCG

Generate Questions (1 → 2)

Given each chunk stored in the Milvus database, we generate 10 questions in both Chinese and English.

Run:

python -m generate_questions

Ask Questions (2 → 3)

For each chunk, we asked the 20 generated questions to the Milvus database, retrieving top 20 answer chunks for each language. Idealy, a chunk's answer chunk should have itself in the first answer, IDCG = 1.

These chunks are uniquely identified by chunk hash instead of chunk ID.

Run:

python -m ask_questions

Evaluate (3 → 4)

Check for each chunk, for all 20 questions, whether the answer chunk is in the first 20 answer chunks, and exactly at which position it ranks ("hit at").

Run:

python -m summarize_results

Analyze (4 → 5 & 6)

Finally, calculate NDCG at 1, 5, 10, and 20 for each chunk, and calculate the mean NDCG for all chunks.

Run:

python -m analyze_results

About

Data processing for CSIT5520 NLP Final Project, a simple RAG pipeline.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages