SPIRAL Dataset

SPIRAL (Speech Information Retrieval And Lookup) is a dataset designed to test speech language models' ability to process long spoken inputs.

Dataset Description

SPIRAL consists of spoken lectures and conversations with corresponding transcripts, questions, and metadata. The dataset is specifically crafted to evaluate models' comprehension and information retrieval capabilities from extended audio content.

Download

The wavs can be downloaded from this url.

Dataset Structure

spiral/
├── wavs/           # Directory containing main audio files
├── data.jsonl      # Main dataset annotations
└── data_h.jsonl    # Hard subset of the dataset

Data Format

Each entry in the JSONL files contains:

Transcript

Speaker-attributed text segments
Multiple utterances per speaker
Includes speech disfluencies (e.g., "uh")

Test Questions

Question text
Multiple choice options (A-D)
Correct answer

Metadata

Main topic
Subtopic
Transcript type
Unique identifier

Audio Data

References to audio files
Key sentence timestamps
Speaker prompts used
Audio file paths

Example Entry

Each entry contains structured information about:

Complete transcript with speaker attribution
A key sentence selected from the transcript
A test question related to the content
Topic metadata
Audio file references and paths

Usage

This dataset can be used for:

Training and evaluating Speech LLMs in the context of long-form audio
Testing information retrieval from long-form audio
Evaluating question-answering capabilities

Citation

If you use this dataset, please cite it as follows:

@inproceedings{lin2025speechprune,
  title     = {SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval},
  author    = {Lin, Yueqian and Fu, Yuzhe and Zhang, Jingyang and Liu, Yudong and Zhang, Jianyi and Sun, Jingwei and Li, Hai and Chen, Yiran},
  booktitle = {2025 IEEE International Conference on Multimedia and Expo (ICME)},
  year      = {2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
data.jsonl		data.jsonl
data_h.jsonl		data_h.jsonl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SPIRAL Dataset

Dataset Description

Download

Dataset Structure

Data Format

Transcript

Test Questions

Metadata

Audio Data

Example Entry

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

linyueqian/SPIRAL_Dataset

Folders and files

Latest commit

History

Repository files navigation

SPIRAL Dataset

Dataset Description

Download

Dataset Structure

Data Format

Transcript

Test Questions

Metadata

Audio Data

Example Entry

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages