Skip to content

Epsilon617/seahorse

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 

Repository files navigation

Seahorse dataset

Seahorse is a dataset for multilingual, multifaceted summarization evaluation. It contains 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main ideas, and conciseness, covering 6 languages, 9 systems and 4 datasets.

More details can be found in the paper, which can be cited as follows:

@misc{clark2023seahorse,
      title={SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation}, 
      author={Elizabeth Clark and Shruti Rijhwani and Sebastian Gehrmann and Joshua Maynez and Roee Aharoni and Vitaly Nikolaev and Thibault Sellam and Aditya Siddhant and Dipanjan Das and Ankur P. Parikh},
      year={2023},
      eprint={2305.13194},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

The Seahorse dataset is released under the CC-BY 4.0 license.

You can download the dataset here: https://storage.googleapis.com/seahorse-public/seahorse_data.zip

Dataset description

The dataset is split into 3 .tsv files: the train, validation, and test sets.

Each file contains the following information:

  • gem_id The ID corresponding to the article that was used to generate the summary (see Retrieving articles from GEM for more details)
  • worker_lang The language ID (de, es-ES, en-US, ru, tr, vi)
  • summary The generated summary
  • model The source of the summary (either reference or the summarization model)
  • question1-6 6 columns with annotator ratings, corresponding to the 6 dimensions of quality (comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness). If question1= No, then there will be no ratings for the remaining questions.

Here is an example entry:

xlsum_english-validation-6416	en-US	Schools in England, Wales and Scotland are being urged to bring back overseas exchange trips.	t5_base	Yes	Yes	Yes	Yes	Yes	Yes

There is also a directory called duplicates, which contains the items that received multiple annotations. Note that this data should NOT be used for training metrics, as there may be overlap between the train/dev/test sets.

Retrieving articles from GEM

If you would like to access the articles that the Seahorse summaries are based on, you will need to retrieve them using their GEM ids.

The xsum, mlsum, and xlsum articles can all be retrieved through GEM on HuggingFace. The gem_id column points to the article in the GEM datasets.

The wikilingua article ids come from a previous version of the GEM dataset and should be retrieved using TensorFlow datasets. Here's an example of how to load the English wikilingua dataset into a dataframe:

import tensorflow_datasets as tfds

lang = 'english_en'
orig_split = 'validation'

ds, info = tfds.load(f'huggingface:gem/wiki_lingua_{lang}', split=orig_split, with_info=True)
hfdf = tfds.as_dataframe(ds,info)

Leaderboard

We are maintaining a leaderboard with official results on our test set.

We ask you to not incorporate any part of the Seahorse validation set into the training data, and only use it for validation/hyperparameter tuning as development sets are typically used.

We report results on two metrics: Pearson correlation ($\rho$) and area under the ROC curve (roc).

Q1 Q2 Q3 Q4 Q5 Q6
Model Link $\rho$ roc $\rho$ roc $\rho$ roc $\rho$ roc $\rho$ roc $\rho$ roc
mT5-seahorse [Clark et al. 2023] 0.52 0.90 0.86 0.98 0.45 0.84 0.59 0.85 0.50 0.80 0.52 0.81
mT5-XNLI [Honovich et al. 2022, Conneau et al. 2018] - - - - - - 0.43 0.78 - - - -
ROUGE-L [Lin et al. 2004] 0.04 0.54 0.06 0.54 -0.03 0.43 0.13 0.55 0.03 0.54 0.02 0.54
Majority Class - - 0.5 - 0.5 - 0.5 - 0.5 - 0.5 - 0.5

Leaderboard Submission

If you want to submit to the leaderboard, please send an email to the contact email below with your results.

Contact

Please email eaclark@google.com if you have any questions about the dataset.

About

Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness, covering 6 languages, 9 systems and 4 datasets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors