HotelCheckSpan

HotelCheckSpan is the first span-level faithfulness dataset for the hotel domain. Each example pairs one or more hotel accommodation descriptions with a generated summary. Human annotators labeled the summaries at the span level, marking the precise text spans that contain errors and classifying each span into one of three error types:

Incorrect – factually wrong information
Misleading – technically true, but open to misinterpretation
Not Checkable – cannot be supported or refuted by the input description

In addition to human annotations, the dataset includes comparable span-level judgments from multiple LLMs, enabling direct human–model comparisons.

Access: To prevent test-set contamination and mindless scraping, the human annotation files (hotelcheckspan.json, hotelcheckspan_extra.json) are distributed exclusively through HuggingFace with gated access. The LLM annotation file (hotelcheckspan_llm.json) is available directly in this repository. See Accessing the Data below.

Dataset Statistics

Split	Examples	Annotation records	Spans
`hotelcheckspan.json` (human)	496	1 482	794
`hotelcheckspan_llm.json` (LLM)	496	1 984	750
`hotelcheckspan_extra.json` (supplementary)	7	180	178

The main dataset was annotated in two stages (Stage 1: all 496 examples, Stage 2: 245 of the shorter-description examples) by crowd workers recruited on Prolific. Each example received at least two independent annotations.

Human span label distribution (main file): Not Checkable 46 %, Misleading 40 %, Incorrect 14 %.

LLM models included: Gemma-3, GPT-4o, o3-mini.

Repository Structure

hotelcheckspan/
├── data/
│   ├── hotelcheckspan.json          # human annotations – HuggingFace (gated) only
│   ├── hotelcheckspan_llm.json      # LLM annotations – public (this repo)
│   └── hotelcheckspan_extra.json    # supplementary – HuggingFace (gated) only
├── hotelcheckspan/                  # Python package (dataset loader + metrics)
├── scripts/
│   └── convert_dataset.py           # builds the JSON files from raw exports
├── tests/                           # pytest test suite
└── pyproject.toml

Data Format

All three JSON files share the same format: a list of example objects.

[
  {
    "example_id": "...",
    "source_dataset": "...",
    "source_idx": 0,
    "description": "...",
    "summary": "...",
    "annotations": [
      {
        "annotator_id": "...",
        "campaign": "highlights-a",
        "group": "A",
        "stage": 1,
        "subgroup": null,
        "model": null,
        "prompt_variant": null,
        "overall_score": null,
        "no_errors": false,
        "spans": [
          {
            "start": 9,
            "end": 24,
            "text": "local gems like",
            "label": "Not Checkable"
          }
        ]
      }
    ]
  }
]

Fields:

Field	Description
`example_id`	Unique identifier for the example
`source_dataset`	Name of the original hotel description source
`source_idx`	Index within the source dataset
`description`	Concatenated hotel description(s) used as input
`summary`	Generated summary to be evaluated
`annotations`	List of annotation records (one per annotator/model run)
`annotator_id`	Anonymised Prolific ID or model name
`campaign`	Annotation campaign identifier (e.g. `highlights-a`, `llm-gemma3`)
`group`	Annotator group assignment (`A` or `B`)
`stage`	Annotation stage (`1` or `2`)
`model`	LLM model name (null for human annotations)
`prompt_variant`	Prompt variant identifier (null for human annotations)
`overall_score`	Example-level label from LLM output where applicable
`no_errors`	Whether the annotator marked no errors present
`spans`	List of annotated error spans
`spans[].start`	Character offset of span start in `summary`
`spans[].end`	Character offset of span end (exclusive) in `summary`
`spans[].text`	Span text (substring of `summary`)
`spans[].label`	Error type: `Incorrect`, `Misleading`, or `Not Checkable`

Accessing the Data

LLM annotations (hotelcheckspan_llm.json) are included in this repository and can be loaded directly:

import json

with open("data/hotelcheckspan_llm.json", encoding="utf-8") as f:
    examples = json.load(f)

Human annotations (hotelcheckspan.json, hotelcheckspan_extra.json) are available on HuggingFace with gated access to prevent test-set contamination. Request access at:

🤗 patuchen/hotelcheckspan on HuggingFace

Once approved, download and place the files under data/ before running the code.

Loading the Data

import json

with open("data/hotelcheckspan.json", encoding="utf-8") as f:
    examples = json.load(f)

for ex in examples:
    print(ex["example_id"], ex["summary"])
    for ann in ex["annotations"]:
        for span in ann["spans"]:
            print(f"  [{span['label']}] {span['text']!r}")

The repository also provides a Python package with dataset loading utilities and metric implementations:

from hotelcheckspan import load_dataset

examples = load_dataset()  # loads hotelcheckspan.json

Reproducing the Dataset Files

The JSON files are built from raw Factgenie annotation exports stored in highlights/outputs/processed/ (not included in this repository).

pip install -e ".[dev]"
python scripts/convert_dataset.py
pytest tests/

Annotation Design

The annotation campaign was conducted in two stages, preceded by three pilot rounds. Stage 1 recruited 62 crowd workers on Prolific to double-annotate all 496 examples under two group conditions (A/B) differing only in the phrasing of the "no errors" checkbox. Stage 2 collected two additional annotations for 245 of the shorter-description examples using a stricter qualification filter; 28 of 40 candidates qualified after a five-example scoring task.

Annotators were required to mark minimal spans that, if removed or replaced, would correct the error while preserving grammaticality. Stylistic differences and subjective expressions were explicitly excluded. Annotation time was automatically tracked and implausibly fast submissions were manually reviewed.

Citation

This dataset accompanies the following paper (to appear at LREC-COLING 2026):

@inproceedings{schmidtova-etal-2026-hotelcheckspan,
  title     = {{HotelCheckSpan}: A Benchmark Dataset for {LLM} Faithfulness},
  author    = {Schmidtová, Patrícia and Dušek, Ondřej and Mahamood, Saad},
  booktitle = {Proceedings of the 2026 Joint International Conference on Computational Linguistics,
               Language Resources and Evaluation (LREC-COLING 2026)},
  year      = {2026},
}

Note: Human span annotations are distributed via HuggingFace with gated access to prevent test-set contamination. The base data, LLM annotations, and code are publicly released at https://github.com/patuchen/hotelcheckspan.

License

See LICENSE if present, or contact the authors.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HotelCheckSpan

Dataset Statistics

Repository Structure

Data Format

Accessing the Data

Loading the Data

Reproducing the Dataset Files

Annotation Design

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
hotelcheckspan		hotelcheckspan
scripts		scripts
tests		tests
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

HotelCheckSpan

Dataset Statistics

Repository Structure

Data Format

Accessing the Data

Loading the Data

Reproducing the Dataset Files

Annotation Design

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages