HotelCheckSpan is the first span-level faithfulness dataset for the hotel domain. Each example pairs one or more hotel accommodation descriptions with a generated summary. Human annotators labeled the summaries at the span level, marking the precise text spans that contain errors and classifying each span into one of three error types:
- Incorrect – factually wrong information
- Misleading – technically true, but open to misinterpretation
- Not Checkable – cannot be supported or refuted by the input description
In addition to human annotations, the dataset includes comparable span-level judgments from multiple LLMs, enabling direct human–model comparisons.
Access: To prevent test-set contamination and mindless scraping, the human annotation files (
hotelcheckspan.json,hotelcheckspan_extra.json) are distributed exclusively through HuggingFace with gated access. The LLM annotation file (hotelcheckspan_llm.json) is available directly in this repository. See Accessing the Data below.
| Split | Examples | Annotation records | Spans |
|---|---|---|---|
hotelcheckspan.json (human) |
496 | 1 482 | 794 |
hotelcheckspan_llm.json (LLM) |
496 | 1 984 | 750 |
hotelcheckspan_extra.json (supplementary) |
7 | 180 | 178 |
The main dataset was annotated in two stages (Stage 1: all 496 examples, Stage 2: 245 of the shorter-description examples) by crowd workers recruited on Prolific. Each example received at least two independent annotations.
Human span label distribution (main file): Not Checkable 46 %, Misleading 40 %, Incorrect 14 %.
LLM models included: Gemma-3, GPT-4o, o3-mini.
hotelcheckspan/
├── data/
│ ├── hotelcheckspan.json # human annotations – HuggingFace (gated) only
│ ├── hotelcheckspan_llm.json # LLM annotations – public (this repo)
│ └── hotelcheckspan_extra.json # supplementary – HuggingFace (gated) only
├── hotelcheckspan/ # Python package (dataset loader + metrics)
├── scripts/
│ └── convert_dataset.py # builds the JSON files from raw exports
├── tests/ # pytest test suite
└── pyproject.toml
All three JSON files share the same format: a list of example objects.
[
{
"example_id": "...",
"source_dataset": "...",
"source_idx": 0,
"description": "...",
"summary": "...",
"annotations": [
{
"annotator_id": "...",
"campaign": "highlights-a",
"group": "A",
"stage": 1,
"subgroup": null,
"model": null,
"prompt_variant": null,
"overall_score": null,
"no_errors": false,
"spans": [
{
"start": 9,
"end": 24,
"text": "local gems like",
"label": "Not Checkable"
}
]
}
]
}
]Fields:
| Field | Description |
|---|---|
example_id |
Unique identifier for the example |
source_dataset |
Name of the original hotel description source |
source_idx |
Index within the source dataset |
description |
Concatenated hotel description(s) used as input |
summary |
Generated summary to be evaluated |
annotations |
List of annotation records (one per annotator/model run) |
annotator_id |
Anonymised Prolific ID or model name |
campaign |
Annotation campaign identifier (e.g. highlights-a, llm-gemma3) |
group |
Annotator group assignment (A or B) |
stage |
Annotation stage (1 or 2) |
model |
LLM model name (null for human annotations) |
prompt_variant |
Prompt variant identifier (null for human annotations) |
overall_score |
Example-level label from LLM output where applicable |
no_errors |
Whether the annotator marked no errors present |
spans |
List of annotated error spans |
spans[].start |
Character offset of span start in summary |
spans[].end |
Character offset of span end (exclusive) in summary |
spans[].text |
Span text (substring of summary) |
spans[].label |
Error type: Incorrect, Misleading, or Not Checkable |
LLM annotations (hotelcheckspan_llm.json) are included in this repository and can be loaded directly:
import json
with open("data/hotelcheckspan_llm.json", encoding="utf-8") as f:
examples = json.load(f)Human annotations (hotelcheckspan.json, hotelcheckspan_extra.json) are available on HuggingFace with gated access to prevent test-set contamination. Request access at:
Once approved, download and place the files under data/ before running the code.
import json
with open("data/hotelcheckspan.json", encoding="utf-8") as f:
examples = json.load(f)
for ex in examples:
print(ex["example_id"], ex["summary"])
for ann in ex["annotations"]:
for span in ann["spans"]:
print(f" [{span['label']}] {span['text']!r}")The repository also provides a Python package with dataset loading utilities and metric implementations:
from hotelcheckspan import load_dataset
examples = load_dataset() # loads hotelcheckspan.jsonThe JSON files are built from raw Factgenie annotation exports stored in highlights/outputs/processed/ (not included in this repository).
pip install -e ".[dev]"
python scripts/convert_dataset.py
pytest tests/The annotation campaign was conducted in two stages, preceded by three pilot rounds. Stage 1 recruited 62 crowd workers on Prolific to double-annotate all 496 examples under two group conditions (A/B) differing only in the phrasing of the "no errors" checkbox. Stage 2 collected two additional annotations for 245 of the shorter-description examples using a stricter qualification filter; 28 of 40 candidates qualified after a five-example scoring task.
Annotators were required to mark minimal spans that, if removed or replaced, would correct the error while preserving grammaticality. Stylistic differences and subjective expressions were explicitly excluded. Annotation time was automatically tracked and implausibly fast submissions were manually reviewed.
This dataset accompanies the following paper (to appear at LREC-COLING 2026):
@inproceedings{schmidtova-etal-2026-hotelcheckspan,
title = {{HotelCheckSpan}: A Benchmark Dataset for {LLM} Faithfulness},
author = {Schmidtová, Patrícia and Dušek, Ondřej and Mahamood, Saad},
booktitle = {Proceedings of the 2026 Joint International Conference on Computational Linguistics,
Language Resources and Evaluation (LREC-COLING 2026)},
year = {2026},
}Note: Human span annotations are distributed via HuggingFace with gated access to prevent test-set contamination. The base data, LLM annotations, and code are publicly released at https://github.com/patuchen/hotelcheckspan.
See LICENSE if present, or contact the authors.