SPIRAL (Speech Information Retrieval And Lookup) is a dataset designed to test speech language models' ability to process long spoken inputs.
SPIRAL consists of spoken lectures and conversations with corresponding transcripts, questions, and metadata. The dataset is specifically crafted to evaluate models' comprehension and information retrieval capabilities from extended audio content.
The wavs can be downloaded from this url.
spiral/
├── wavs/ # Directory containing main audio files
├── data.jsonl # Main dataset annotations
└── data_h.jsonl # Hard subset of the dataset
Each entry in the JSONL files contains:
- Speaker-attributed text segments
- Multiple utterances per speaker
- Includes speech disfluencies (e.g., "uh")
- Question text
- Multiple choice options (A-D)
- Correct answer
- Main topic
- Subtopic
- Transcript type
- Unique identifier
- References to audio files
- Key sentence timestamps
- Speaker prompts used
- Audio file paths
Each entry contains structured information about:
- Complete transcript with speaker attribution
- A key sentence selected from the transcript
- A test question related to the content
- Topic metadata
- Audio file references and paths
This dataset can be used for:
- Training and evaluating Speech LLMs in the context of long-form audio
- Testing information retrieval from long-form audio
- Evaluating question-answering capabilities
If you use this dataset, please cite it as follows:
@inproceedings{lin2025speechprune,
title = {SpeechPrune: Context-aware Token Pruning for Speech Information Retrieval},
author = {Lin, Yueqian and Fu, Yuzhe and Zhang, Jingyang and Liu, Yudong and Zhang, Jianyi and Sun, Jingwei and Li, Hai and Chen, Yiran},
booktitle = {2025 IEEE International Conference on Multimedia and Expo (ICME)},
year = {2025}
}