Skip to content

noah-13/rag-meta-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Meta-Evaluation of LLM-Based Reference-Free Metrics for RAG

This repository contains the code and data used in the paper:

Meta-Evaluation of LLM-Based Reference-Free Metrics for RAG
Xinyuan Cheng Β· LMU Munich

🧠 Overview

We evaluate the robustness and validity of various reference-free evaluation metrics for Retrieval-Augmented Generation (RAG), including:

  • Direct Prompting (DP-Free & DP-Token)
  • G-Eval
  • GPTScore
  • RAGAS

Our evaluation is conducted on the WikiEval benchmark and covers three quality dimensions:

  • Faithfulness
  • Answer Relevance
  • Context Relevance

We assess both pairwise accuracy and correlation with human judgments, and analyze the effects of prompt format and LLM confidence.

πŸ“ Project Structure

llm-metric-metaeval/
β”œβ”€β”€ experiment/
β”‚ β”œβ”€β”€ scripts/ # Scripts for running metric evaluations
β”‚ β”œβ”€β”€ input/ # Input files: QCA data
β”‚ β”œβ”€β”€ output/ # Metric outputs (CSV)
β”œβ”€β”€ results_analysis/
β”‚ β”œβ”€β”€ scripts/ # Scripts for accuracy, correlation, and robustness analysis
β”‚ β”œβ”€β”€ output/ # Analysis Results (CSV)

πŸš€ Run Experiments

You can run individual metric evaluation scripts or reproduce all experiments at once.

▢️ Run a Single Metric

Navigate to experiment/scripts/ and run any of the metric scripts. Example:

 python ragas_ff.py --input_csv ../input/ff.csv --output_csv ../output/ragas_ff_output.csv 

▢️ Reproduce All Experiments

Alternatively, you can use the provided run_exp.sh script to reproduce all experiments in one go:

 bash run_exp.sh 

This script sequentially executes all evaluation pipelines and saves their outputs under experiment/output/.

πŸ“Œ Naming Conventions

The files in follow the naming pattern:

  • ff – Faithfulness

  • ar – Answer Relevance

  • cr – Context Relevance

  • prompt - DP-free + DP-token + G-eval

  • variants - including variants of prompt

For example:

ragas_cr.py evaluates context relevance using RAGAS

prompt_ar_variants.py evaluates answer relevance using DP-free, DP-token and G-eval, including prompt variants

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published