Skip to content

patuchen/hotelcheckspan

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

HotelCheckSpan

HotelCheckSpan is the first span-level faithfulness dataset for the hotel domain. Each example pairs one or more hotel accommodation descriptions with a generated summary. Human annotators labeled the summaries at the span level, marking the precise text spans that contain errors and classifying each span into one of three error types:

  • Incorrect – factually wrong information
  • Misleading – technically true, but open to misinterpretation
  • Not Checkable – cannot be supported or refuted by the input description

In addition to human annotations, the dataset includes comparable span-level judgments from multiple LLMs, enabling direct human–model comparisons.

Access: To prevent test-set contamination and mindless scraping, the human annotation files (hotelcheckspan.json, hotelcheckspan_extra.json) are distributed exclusively through HuggingFace with gated access. The LLM annotation file (hotelcheckspan_llm.json) is available directly in this repository. See Accessing the Data below.

Dataset Statistics

Split Examples Annotation records Spans
hotelcheckspan.json (human) 496 1 482 794
hotelcheckspan_llm.json (LLM) 496 1 984 750
hotelcheckspan_extra.json (supplementary) 7 180 178

The main dataset was annotated in two stages (Stage 1: all 496 examples, Stage 2: 245 of the shorter-description examples) by crowd workers recruited on Prolific. Each example received at least two independent annotations.

Human span label distribution (main file): Not Checkable 46 %, Misleading 40 %, Incorrect 14 %.

LLM models included: Gemma-3, GPT-4o, o3-mini.

Repository Structure

hotelcheckspan/
├── data/
│   ├── hotelcheckspan.json          # human annotations – HuggingFace (gated) only
│   ├── hotelcheckspan_llm.json      # LLM annotations – public (this repo)
│   └── hotelcheckspan_extra.json    # supplementary – HuggingFace (gated) only
├── hotelcheckspan/                  # Python package (dataset loader + metrics)
├── scripts/
│   └── convert_dataset.py           # builds the JSON files from raw exports
├── tests/                           # pytest test suite
└── pyproject.toml

Data Format

All three JSON files share the same format: a list of example objects.

[
  {
    "example_id": "...",
    "source_dataset": "...",
    "source_idx": 0,
    "description": "...",
    "summary": "...",
    "annotations": [
      {
        "annotator_id": "...",
        "campaign": "highlights-a",
        "group": "A",
        "stage": 1,
        "subgroup": null,
        "model": null,
        "prompt_variant": null,
        "overall_score": null,
        "no_errors": false,
        "spans": [
          {
            "start": 9,
            "end": 24,
            "text": "local gems like",
            "label": "Not Checkable"
          }
        ]
      }
    ]
  }
]

Fields:

Field Description
example_id Unique identifier for the example
source_dataset Name of the original hotel description source
source_idx Index within the source dataset
description Concatenated hotel description(s) used as input
summary Generated summary to be evaluated
annotations List of annotation records (one per annotator/model run)
annotator_id Anonymised Prolific ID or model name
campaign Annotation campaign identifier (e.g. highlights-a, llm-gemma3)
group Annotator group assignment (A or B)
stage Annotation stage (1 or 2)
model LLM model name (null for human annotations)
prompt_variant Prompt variant identifier (null for human annotations)
overall_score Example-level label from LLM output where applicable
no_errors Whether the annotator marked no errors present
spans List of annotated error spans
spans[].start Character offset of span start in summary
spans[].end Character offset of span end (exclusive) in summary
spans[].text Span text (substring of summary)
spans[].label Error type: Incorrect, Misleading, or Not Checkable

Accessing the Data

LLM annotations (hotelcheckspan_llm.json) are included in this repository and can be loaded directly:

import json

with open("data/hotelcheckspan_llm.json", encoding="utf-8") as f:
    examples = json.load(f)

Human annotations (hotelcheckspan.json, hotelcheckspan_extra.json) are available on HuggingFace with gated access to prevent test-set contamination. Request access at:

🤗 patuchen/hotelcheckspan on HuggingFace

Once approved, download and place the files under data/ before running the code.

Loading the Data

import json

with open("data/hotelcheckspan.json", encoding="utf-8") as f:
    examples = json.load(f)

for ex in examples:
    print(ex["example_id"], ex["summary"])
    for ann in ex["annotations"]:
        for span in ann["spans"]:
            print(f"  [{span['label']}] {span['text']!r}")

The repository also provides a Python package with dataset loading utilities and metric implementations:

from hotelcheckspan import load_dataset

examples = load_dataset()  # loads hotelcheckspan.json

Reproducing the Dataset Files

The JSON files are built from raw Factgenie annotation exports stored in highlights/outputs/processed/ (not included in this repository).

pip install -e ".[dev]"
python scripts/convert_dataset.py
pytest tests/

Annotation Design

The annotation campaign was conducted in two stages, preceded by three pilot rounds. Stage 1 recruited 62 crowd workers on Prolific to double-annotate all 496 examples under two group conditions (A/B) differing only in the phrasing of the "no errors" checkbox. Stage 2 collected two additional annotations for 245 of the shorter-description examples using a stricter qualification filter; 28 of 40 candidates qualified after a five-example scoring task.

Annotators were required to mark minimal spans that, if removed or replaced, would correct the error while preserving grammaticality. Stylistic differences and subjective expressions were explicitly excluded. Annotation time was automatically tracked and implausibly fast submissions were manually reviewed.

Citation

This dataset accompanies the following paper (to appear at LREC-COLING 2026):

@inproceedings{schmidtova-etal-2026-hotelcheckspan,
  title     = {{HotelCheckSpan}: A Benchmark Dataset for {LLM} Faithfulness},
  author    = {Schmidtová, Patrícia and Dušek, Ondřej and Mahamood, Saad},
  booktitle = {Proceedings of the 2026 Joint International Conference on Computational Linguistics,
               Language Resources and Evaluation (LREC-COLING 2026)},
  year      = {2026},
}

Note: Human span annotations are distributed via HuggingFace with gated access to prevent test-set contamination. The base data, LLM annotations, and code are publicly released at https://github.com/patuchen/hotelcheckspan.

License

See LICENSE if present, or contact the authors.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages