DIP Project: English-to-Thai Machine Translation system for WIPO Nice Classification Products and Services
This project focuses on English → Thai machine translation (En2Th) with domain-specific emphasis on product and service categories defined by the WIPO Nice Classification system.
The main goal is to evaluate and improve the translation quality for category-specific terminology, ensuring translations are accurate, consistent, and context-aware for intellectual property and business use.
- Benchmark multiple Large Language Models (LLMs) on En2Th and Th2En translation tasks.
- Explore fine-tuning and Retrieval-Augmented Generation (RAG) methods for domain adaptation.
- Provide insights into translation quality across different WIPO Nice product categories.
-
Data Preparation
- Source: WIPO Nice classification dataset (product & service categories).
- Preprocessed into English–Thai parallel pairs.
-
Approaches
- Baseline Translation: transformer-based (NLLB-200-3.3B) with LoRA fine-tuning
- Fine-Tuning: Domain-specific fine-tuning on product-category text.
- RAG (Retrieval-Augmented Generation): Incorporating WIPO classification documents as external knowledge for context-aware translations.
- Fine-tuning: unsloth
- RAG: faiss, langchain
- Inference: vllm
- Evaluation: pythainlp, nltk, jiwer