The first RAG-based hallucination detection benchmark that combines human-verified labels, long context (up to 94K chars — 7-33x longer than existing benchmarks), sentence-level annotations, and controlled label noise — satisfying all seven desiderata for hallucination detection evaluation.
| Split | Count |
|---|---|
| Train | 2,263 |
| Valid | 316 |
| Test | 645 |
| Total | 3,224 |
The dataset aggregates examples from multiple QA benchmarks:
| Source | Count | Description |
|---|---|---|
| drop | 1,339 (41.5%) | Discrete Reasoning Over Paragraphs |
| msmarco / ms_marco | 763 (23.7%) | Microsoft Machine Reading Comprehension |
| nq | 674 (20.9%) | Natural Questions |
| trivia | 309 (9.6%) | Trivia Question Answering |
| covid | 139 (4.3%) | COVID-19 scientific literature QA |
Note: The source column contains both msmarco (521) and ms_marco (242) as variants for the same origin dataset.
Responses were generated by three LLMs:
| Model | Count | Description |
|---|---|---|
| mixtral_8x7b | 1,686 (52.3%) | Mixtral 8x7B |
| claude | 1,006 (31.2%) | Claude (SOTA LLM) |
| gemma | 532 (16.5%) | Gemma 7B |
Each sample was annotated at the sentence level by multiple annotators (up to 6 per sample) through a rigorous multi-stage pipeline:
- Two annotators label each sample independently
- On disagreement, two additional annotators provide labels
- If still no clear majority, two more labels are gathered
- Labels are aggregated via majority vote with strictest-label tiebreaking
Annotators were trained over two rounds with author audits. Low-performing annotators were removed using the Dawid-Skene model. Each sentence receives one of four labels: Supported, Contradicted, Not Mentioned, or Supplementary.
Multi-vote annotation pipeline with escalating review stages and Dawid-Skene quality filtering.
Triviaplus_all_withnoise_cleaned_20260415.parquet — Cleaned dataset with all annotations.
See DATA_DETAILS.md for complete column descriptions, label aggregation logic, and label distributions.
import pandas as pd
# Load the dataset
df = pd.read_parquet("Triviaplus_all_withnoise_cleaned_20260415.parquet")
# Filter by split
train = df[df['split'] == 'train']
valid = df[df['split'] == 'valid']
test = df[df['split'] == 'test']
# Access sentence-level labels
for idx, row in df.head(3).iterrows():
print(f"Question: {row['question'][:50]}...")
print(f"Answer: {row['answer'][:50]}...")
print(f"Sentences: {row['answer_sentence_list']}")
print(f"Labels: {row['sentence_level_majority_vote']}")
print(f"Response label: {row['response_level_label_binary']}")
print()Run the label consistency check:
python verify_label_consistency.py Triviaplus_all_withnoise_cleaned_20260415.parquetMIT