CSIT5520 NLP Final Project Data Proceessing

HUANG Yanzhen
LI Weijie
LI Kaiwen

Introduction

We built a RAG pipeline, encorporating FastAPI as backend server, Dify as pipeline backbone and Milvus as vector database.

This repository is used for data processing and demo. The repository of the FastAPI server can be found at:

https://github.com/hyz-courses/CSIT5520-NLP-Final-Project

Resources

Models

Usage	Variant	Remarks
Text Embedding Model	Qwen text-embedding-v4	d=1024
Question Generation Model	Qwen3.5-plus
Answering Model	Qwen3.5-plus

Data

We obtained data from MSSS WeChat Public Account, most of which are in Chinese (Simplified). Data are processed by us to fit the pipeline (see Parse and Upload).

All the related data are stored in the demo folder. It is a static snapshot of the resources folder for your quick reference.

Subfolder	Description
demo/markdowns	Source article data in `.md` format.
demo/jsons	Chunk-format of source article data ready to upload.
demo/pipelines	`.yaml` format snapshot of two pipelines: The general question answering pipeline and the retrieval test pipeline.
demo/question	All the related files during retrieval test.

Retrieval Test Result

We performed retrieval test on our milvus data. The final results are follows.

Question Language	NDCG@1	NDCG@5	NDCG@10	NDCG@20
Chinese	0.46565	0.60494	0.63176	0.64865
English	0.55206	0.68110	0.70284	0.71637

Setup

Please add a .env file into the /resources directory and add a DOCSEARCH_API_KEY variable the author give to you.

# resources/.env

# To use LLM apis from Alibaba
ALICLOUD_BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
QWEN_API_KEY="your-qwen-api-key"
QWEN_QUESTION_GENERATION_MODEL=qwen-plus

# Communicate with Dify
DIFY_API_URL=https://dify.docsearch.love/v1/workflows/run
DIFY_API_KEY="your-dify-api-key"

# To upload chunk to our server
DOCSEARCH_API_KEY="your-docsearch-api-key"

Install related packages:

pip install -r requirements.txt

Quick Start

Run this command to chat with the pipeline.

streamlit run app.py

Parse and Upload

Run:

python parse.py

This will pares all the files in the /makrdowns directory and upload all the output .json files.

Retrieval Test

Data process pipelines:

Stage	File Name	Remarks
1	milvus_storage.jsonl	Original Milvus Database Storage
2	milvus_questions.jsonl	Generated Questions
3	milvus_answers.jsonl	Retrieved Answer Chunks
4	milvus_summarized.jsonl	Summarized Answer Hitting
5	milvus_analyzed.jsonl	NDCG for each question
6	ndcg_mean.json	Final mean NDCG

Generate Questions (1 → 2)

Given each chunk stored in the Milvus database, we generate 10 questions in both Chinese and English.

Run:

python -m generate_questions

Ask Questions (2 → 3)

For each chunk, we asked the 20 generated questions to the Milvus database, retrieving top 20 answer chunks for each language. Idealy, a chunk's answer chunk should have itself in the first answer, IDCG = 1.

These chunks are uniquely identified by chunk hash instead of chunk ID.

Run:

python -m ask_questions

Evaluate (3 → 4)

Check for each chunk, for all 20 questions, whether the answer chunk is in the first 20 answer chunks, and exactly at which position it ranks ("hit at").

Run:

python -m summarize_results

Analyze (4 → 5 & 6)

Finally, calculate NDCG at 1, 5, 10, and 20 for each chunk, and calculate the mean NDCG for all chunks.

Run:

python -m analyze_results

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
demo		demo
resources		resources
.gitignore		.gitignore
README.md		README.md
analyze_results.py		analyze_results.py
app.py		app.py
ask_questions.py		ask_questions.py
files.py		files.py
generate_questions.py		generate_questions.py
parse.py		parse.py
requirements.txt		requirements.txt
summarize_results.py		summarize_results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSIT5520 NLP Final Project Data Proceessing

Introduction

Resources

Retrieval Test Result

Setup

Quick Start

Parse and Upload

Retrieval Test

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CSIT5520 NLP Final Project Data Proceessing

Introduction

Resources

Retrieval Test Result

Setup

Quick Start

Parse and Upload

Retrieval Test

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages