TRIVIA+ Dataset

The first RAG-based hallucination detection benchmark that combines human-verified labels, long context (up to 94K chars — 7-33x longer than existing benchmarks), sentence-level annotations, and controlled label noise — satisfying all seven desiderata for hallucination detection evaluation.

Dataset Overview

Split	Count
Train	2,263
Valid	316
Test	645
Total	3,224

Data Sources

The dataset aggregates examples from multiple QA benchmarks:

Source	Count	Description
drop	1,339 (41.5%)	Discrete Reasoning Over Paragraphs
msmarco / ms_marco	763 (23.7%)	Microsoft Machine Reading Comprehension
nq	674 (20.9%)	Natural Questions
trivia	309 (9.6%)	Trivia Question Answering
covid	139 (4.3%)	COVID-19 scientific literature QA

Note: The source column contains both msmarco (521) and ms_marco (242) as variants for the same origin dataset.

LLM Response Sources

Responses were generated by three LLMs:

Model	Count	Description
mixtral_8x7b	1,686 (52.3%)	Mixtral 8x7B
claude	1,006 (31.2%)	Claude (SOTA LLM)
gemma	532 (16.5%)	Gemma 7B

Human Annotation

Each sample was annotated at the sentence level by multiple annotators (up to 6 per sample) through a rigorous multi-stage pipeline:

Two annotators label each sample independently
On disagreement, two additional annotators provide labels
If still no clear majority, two more labels are gathered
Labels are aggregated via majority vote with strictest-label tiebreaking

Annotators were trained over two rounds with author audits. Low-performing annotators were removed using the Dawid-Skene model. Each sentence receives one of four labels: Supported, Contradicted, Not Mentioned, or Supplementary.

Multi-vote annotation pipeline with escalating review stages and Dawid-Skene quality filtering.

File

Triviaplus_all_withnoise_cleaned_20260415.parquet — Cleaned dataset with all annotations.

See DATA_DETAILS.md for complete column descriptions, label aggregation logic, and label distributions.

Loading the Dataset

import pandas as pd

# Load the dataset
df = pd.read_parquet("Triviaplus_all_withnoise_cleaned_20260415.parquet")

# Filter by split
train = df[df['split'] == 'train']
valid = df[df['split'] == 'valid']
test = df[df['split'] == 'test']

# Access sentence-level labels
for idx, row in df.head(3).iterrows():
    print(f"Question: {row['question'][:50]}...")
    print(f"Answer: {row['answer'][:50]}...")
    print(f"Sentences: {row['answer_sentence_list']}")
    print(f"Labels: {row['sentence_level_majority_vote']}")
    print(f"Response label: {row['response_level_label_binary']}")
    print()

Verification

Run the label consistency check:

python verify_label_consistency.py Triviaplus_all_withnoise_cleaned_20260415.parquet

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
DATA_DETAILS.md		DATA_DETAILS.md
LICENSE		LICENSE
README.md		README.md
triviaplus_dataset.parquet		triviaplus_dataset.parquet
verify_label_consistency.py		verify_label_consistency.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRIVIA+ Dataset

Dataset Overview

Data Sources

LLM Response Sources

Human Annotation

File

Loading the Dataset

Verification

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TRIVIA+ Dataset

Dataset Overview

Data Sources

LLM Response Sources

Human Annotation

File

Loading the Dataset

Verification

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages