Skip to content

Wenbo11/TriviaPlus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TRIVIA+ Dataset

The first RAG-based hallucination detection benchmark that combines human-verified labels, long context (up to 94K chars — 7-33x longer than existing benchmarks), sentence-level annotations, and controlled label noise — satisfying all seven desiderata for hallucination detection evaluation.

Dataset Overview

Split Count
Train 2,263
Valid 316
Test 645
Total 3,224

Data Sources

The dataset aggregates examples from multiple QA benchmarks:

Source Count Description
drop 1,339 (41.5%) Discrete Reasoning Over Paragraphs
msmarco / ms_marco 763 (23.7%) Microsoft Machine Reading Comprehension
nq 674 (20.9%) Natural Questions
trivia 309 (9.6%) Trivia Question Answering
covid 139 (4.3%) COVID-19 scientific literature QA

Note: The source column contains both msmarco (521) and ms_marco (242) as variants for the same origin dataset.

LLM Response Sources

Responses were generated by three LLMs:

Model Count Description
mixtral_8x7b 1,686 (52.3%) Mixtral 8x7B
claude 1,006 (31.2%) Claude (SOTA LLM)
gemma 532 (16.5%) Gemma 7B

Human Annotation

Each sample was annotated at the sentence level by multiple annotators (up to 6 per sample) through a rigorous multi-stage pipeline:

  1. Two annotators label each sample independently
  2. On disagreement, two additional annotators provide labels
  3. If still no clear majority, two more labels are gathered
  4. Labels are aggregated via majority vote with strictest-label tiebreaking

Annotators were trained over two rounds with author audits. Low-performing annotators were removed using the Dawid-Skene model. Each sentence receives one of four labels: Supported, Contradicted, Not Mentioned, or Supplementary.

TRIVIA+ Multi-Vote Annotation Pipeline
Multi-vote annotation pipeline with escalating review stages and Dawid-Skene quality filtering.

File

Triviaplus_all_withnoise_cleaned_20260415.parquet — Cleaned dataset with all annotations.

See DATA_DETAILS.md for complete column descriptions, label aggregation logic, and label distributions.

Loading the Dataset

import pandas as pd

# Load the dataset
df = pd.read_parquet("Triviaplus_all_withnoise_cleaned_20260415.parquet")

# Filter by split
train = df[df['split'] == 'train']
valid = df[df['split'] == 'valid']
test = df[df['split'] == 'test']

# Access sentence-level labels
for idx, row in df.head(3).iterrows():
    print(f"Question: {row['question'][:50]}...")
    print(f"Answer: {row['answer'][:50]}...")
    print(f"Sentences: {row['answer_sentence_list']}")
    print(f"Labels: {row['sentence_level_majority_vote']}")
    print(f"Response label: {row['response_level_label_binary']}")
    print()

Verification

Run the label consistency check:

python verify_label_consistency.py Triviaplus_all_withnoise_cleaned_20260415.parquet

License

MIT

About

Long-Context Hallucination Detection Benchmark: [ACL main] Rethinking Evaluation for LLM Hallucination Detection: A Desiderata, A New RAG-based Benchmark, New Insights

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages