This repository contains the code and data used in the paper:
Meta-Evaluation of LLM-Based Reference-Free Metrics for RAG
Xinyuan Cheng Β· LMU Munich
We evaluate the robustness and validity of various reference-free evaluation metrics for Retrieval-Augmented Generation (RAG), including:
- Direct Prompting (DP-Free & DP-Token)
- G-Eval
- GPTScore
- RAGAS
Our evaluation is conducted on the WikiEval benchmark and covers three quality dimensions:
- Faithfulness
- Answer Relevance
- Context Relevance
We assess both pairwise accuracy and correlation with human judgments, and analyze the effects of prompt format and LLM confidence.
llm-metric-metaeval/ βββ experiment/ β βββ scripts/ # Scripts for running metric evaluations β βββ input/ # Input files: QCA data β βββ output/ # Metric outputs (CSV) βββ results_analysis/ β βββ scripts/ # Scripts for accuracy, correlation, and robustness analysis β βββ output/ # Analysis Results (CSV)
You can run individual metric evaluation scripts or reproduce all experiments at once.
Navigate to experiment/scripts/ and run any of the metric scripts. Example:
python ragas_ff.py --input_csv ../input/ff.csv --output_csv ../output/ragas_ff_output.csv
Alternatively, you can use the provided run_exp.sh script to reproduce all experiments in one go:
bash run_exp.sh
This script sequentially executes all evaluation pipelines and saves their outputs under experiment/output/.
The files in follow the naming pattern:
-
ff β Faithfulness
-
ar β Answer Relevance
-
cr β Context Relevance
-
prompt - DP-free + DP-token + G-eval
-
variants - including variants of prompt
For example:
ragas_cr.py evaluates context relevance using RAGAS
prompt_ar_variants.py evaluates answer relevance using DP-free, DP-token and G-eval, including prompt variants