A comprehensive Vietnamese legal benchmark dataset for evaluating Large Language Models (LLMs) on various legal NLP tasks.
This benchmark contains 22 legal tasks organized into 5 main categories, covering key aspects of legal language understanding and generation in Vietnamese. Task 5.3 is divided into 2 subtasks with separate folders. Each task folder contains:
- Dataset file(s):
.jsonlformat containing questions and ground truth answers - Prompt file:
prompt_X_Y.pydefining the evaluation prompt and format, withX.Yis task id defined below.
vlegal-bench/
βββ 1.1/ # Legal Entity Recognition
β βββ 1_1.jsonl
β βββ prompt_1_1.py
βββ 1.2/ # Legal Topic Classification
βββ 1.3/ # Legal Concept Recall
βββ 1.4/ # Article Recall
βββ 1.5/ # Legal Schema Recall
βββ 2.1/ # Relation Extraction
βββ 2.2/ # Legal Element Recognition
βββ 2.3/ # Legal Graph Structuring
βββ 2.4/ # Judgement Verification
βββ 2.5/ # User Intent Understanding
βββ 3.1/ # Article/Clause Prediction
βββ 3.2/ # Legal Court Decision Prediction
βββ 3.3/ # Multi-hop Graph Reasoning
βββ 3.4/ # Conflict & Consistency Detection
βββ 3.5/ # Penalty / Remedy Estimation
βββ 4.1/ # Legal Document Summarization
βββ 4.2/ # Judicial Reasoning Generation
βββ 4.3/ # Object Legal Opinion Generation
βββ 5.1/ # Bias Detection
βββ 5.2/ # Privacy & Data Protection
βββ 5.3/ # Ethical Consistency Assessment
βββ 5.4/ # Unfair Contract Detection
- 1.1: Legal Entity Recognition
- 1.2: Legal Topic Classification
- 1.3: Legal Concept Recall
- 1.4: Article Recall
- 1.5: Legal Schema Recall
- 2.1: Relation Extraction
- 2.2: Legal Element Recognition
- 2.3: Legal Graph Structuring
- 2.4: Judgement Verification
- 2.5: User Intent Understanding
- 3.1: Article/Clause Prediction
- 3.2: Legal Court Decision Prediction
- 3.3: Multi-hop Graph Reasoning
- 3.4: Conflict & Consistency Detection
- 3.5: Penalty / Remedy Estimation
- 4.1: Legal Document Summarization
- 4.2: Judicial Reasoning Generation
- 4.3: Object Legal Opinion Generation
- 5.1: Bias Detection
- 5.2: Privacy & Data Protection
- 5.3: Ethical Consistency Assessment
- 5.4: Unfair Contract Detection
pip install uv
uv venv .venv
source .venv/bin/activate
uv syncCreate your own .env file according to .env_example
- Start VLLM Server
# Edit MODEL_NAME in vllm_serving.sh
bash vllm_serving.sh- Run Inference
# Edit TASK variable in infer.sh (e.g., TASK="1.1")
bash infer.sh
# For tasks with remove_content variant (3.3, 3.4)
USE_REMOVE_CONTENT=true bash infer.sh# For standard tasks
bash infer.sh
# For tasks with remove_content variant (3.3, 3.4)
USE_REMOVE_CONTENT=true bash infer.shEdit the following variables in infer.sh:
TASK: Task number (e.g., "1.1", "3.3", "4.1")MODEL_NAME: Model to use (e.g., "gpt-4o", "gemini-2.5-flash")BATCH_SIZE: Number of samples per batch (default: 1)MAX_MODEL_LEN: Maximum context length (default: 32768)USE_REMOVE_CONTENT: Use content-removed dataset variant (true/false)
The evaluation is automatically performed after inference. Metrics vary by task type:
- Accuracy
- Precision
- Recall
- F1-Score
- BLEU Score
- ROUGE Score
Results are saved in:
./<task>/<task>_llm_test_results_<model_name>.json
To evaluate existing prediction files:
from src.evaluation import Metrics
metrics = Metrics(result_path="./1.1/1_1_llm_test_results_model_name.json")
results = metrics.eval()
print(results)When adding new tasks or modifying existing ones:
- Maintain the folder structure
X.Y/ - Include both dataset (
.jsonl) and prompt (prompt_X_Y.py) files - Update this README with task description
- Test with the evaluation pipeline
Please refer to the repository license for usage terms and conditions.
For questions or issues, please open an issue in the repository or contact the maintainers.
@misc{dong2025vlegalbenchcognitivelygroundedbenchmark,
title={VLegal-Bench: Cognitively Grounded Benchmark for Vietnamese Legal Reasoning of Large Language Models},
author={Nguyen Tien Dong and Minh-Anh Nguyen and Thanh Dat Hoang and Nguyen Tuan Ngoc and Dao Xuan Quang Minh and Phan Phi Hai and Nguyen Thi Ngoc Anh and Dang Van Tu and Binh Vu},
year={2025},
eprint={2512.14554},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2512.14554},
}