This tutorial explains the preprocessing steps applied to EEG data in the studies "ZuCo: A Simultaneous EEG and Eye-Tracking Resource for Natural Sentence Reading" and "ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation".
- ZuCo v.1 Dataset
- ZuCo v.2 Dataset
- Eye-tracking Preprocessing and Feature Extraction
- EEG Acquisition
- EEG Preprocessing and Feature Extraction
- Dataset Links
- Citation
ZuCo v.1 is a dataset combining EEG and eye-tracking recordings from subjects reading natural sentences. Eye-tracking makes it possible to mark the exact boundaries of each word as a subject reads a sentence, which in turn allows precise extraction of the corresponding EEG signals for every word.
- Subjects: 12 healthy adult native speakers
- Study Design: Schematic overview of the three tasks in the study design (Source)
-
Reading Materials: The reading materials contain sentences from movie reviews from the Stanford Sentiment Treebank and biographical sentences about notable people from the Wikipedia relation extraction corpus.
- Sentences from the Stanford Sentiment Treebank (task: sentiment reading (SR)): 123 neutral, 137 negative, and 140 positive sentences. Total sentences: 400
- Sentences from the Wikipedia relation extraction dataset for the Normal Reading (NR) task: 300
- Sentences from the Wikipedia relation extraction dataset for the task-specific relation task (TSR): 407
-
Procedure: The sentences were presented to the subjects in a naturalistic reading scenario, where the complete sentence was given on the screen and the subjects read each sentence at their own speed.
ZuCo v.2 is an extended dataset of ZuCo v.1 with more sentences and more subjects.
- Subjects: 18 healthy adult native speakers
- Tasks:
- Normal reading (NR): Participants read the sentences naturally, without any specific tasks other than comprehension
- Task-specific reading paradigm: Participants determine whether a certain relation type occurred in the sentence
- Descriptive Statistics: Descriptive Statistics of the Reading Materials (Source)
- Dataset Overlap: There is an overlap between ZuCo v.1 and ZuCo v.2. 100 normal reading and 85 task-specific sentences recorded for this dataset were already recorded in version 1.
- Procedure: Same as ZuCo v.1 - naturalistic reading scenario with complete sentences presented on screen.
The EyeLink 1000 tracker processes eye-position data, identifying saccades, fixations, and blinks.
- Fixation: Fixation occurs when the eyes stay relatively still on a specific place. In the dataset it consists of time periods without saccades.
- Saccades: A saccade is a rapid eye movement from one point of fixation to another.
- Gaze Duration (GD): The sum of all fixations on the current word in the first-pass reading before the eyes move out of the word
- Total Reading Time (TRT): The sum of all fixation durations on the current word, including regressions
- First Fixation Duration (FFD): The duration of the first fixation on the prevailing word
- Single Fixation Duration (SFD): The duration of the first and only fixation on the current word. SFD only applies to words that are never refixated; if a word has multiple fixations, it does not have an SFD
- Go-past Time (GPT): GPT measures all the time a reader spends on a word and any time spent going back to earlier words before moving forward past the current word
- System: 128-channel EEG Geodesic Hydrocel system (Electrical Geodesics, Eugene, Oregon)
- Sampling rate: The data was recorded at a sampling rate of 500 Hz with a bandpass of 0.1 to 100 Hz
- Recording Reference: All EEG channels were measured relative to the voltage at the Cz electrode (top center of the scalp)
- 105 EEG channels: Used for scalp recordings
- 9 EOG channels: Used to measure electrical activity generated by eye movements for artifact removal
- Discarded channels: The rest of the channels lying mainly on the neck and face were discarded before data analysis
-
Bad electrode identification and replacement: An electrode was considered bad if:
- Its recorded signal correlated less than 0.85 with an estimate derived from the remaining channels
- It had more line noise relative to its signal compared to all other channels (4 standard deviations)
- It had a longer flatline than 5 seconds
-
Filtering: EEG data were high-pass filtered at 0.5 Hz and notch filtered (49-51 Hz) with a Hamming windowed sinc finite impulse response zero-phase filter
-
Artifact removal: Eye artifacts were removed by linearly regressing the EOG channels from the scalp EEG channels
-
Automatic artifact rejection: The Multiple Artifact Rejection Algorithm (MARA) is used for automatic rejection of artifacts
-
Electrode interpolation: Bad electrodes were interpolated using spherical spline interpolation
-
Final quality check: After automatic scanning, noisy channels were selected by visual inspection and interpolated
Oscillatory power in different frequency bands refers to the magnitude of rhythmic neural activity within specific frequency ranges of brain signals. Neural oscillations are repetitive patterns of neural activity measurable across frequency bands. Each band is associated with a different cognitive or physiological state.
- Theta 1 (4-6 Hz) and Theta 2 (6.5-8 Hz): Linked to creativity, intuition, daydreaming, and fantasizing, and is a repository for memories, emotions, sensations
- Alpha 1 (8.5-10 Hz) and Alpha 2 (10.5-13 Hz): Linked to attention, mental imagery, and perception
- Beta 1 (13.5-18 Hz) and Beta 2 (18.5-30 Hz): Linked to cognitive-task engagement
- Gamma 1 (30.5-40 Hz) and Gamma 2 (40-49.5 Hz): Linked to higher cognitive functions, such as attention, memory encoding, sensory perception, and emotion integration
Oscillatory power measures were computed by band-pass filtering the continuous EEG signals across the entire task period (full duration of the task) for five distinct frequency bands, resulting in a time series for each band.
A Hilbert Transform is then applied to each of these time series (bands). The Hilbert transformation maintains temporal information for the amplitude of the frequency bands. This temporal resolution is important because the EEG features need to be aligned with time segments defined by the eye-tracking fixations.
- Download ZuCo v1.0 'Matlab files' for 'task1-SR','task2-NR','task3-TSR' from https://osf.io/q3zws/files/ under 'OSF Storage' root,
unzip and move all.matfiles to~/datasets/ZuCo/task1-SR/Matlab_files,~/datasets/ZuCo/task2-NR/Matlab_files,~/datasets/ZuCo/task3-TSR/Matlab_filesrespectively. - Download ZuCo v2.0 'Matlab files' for 'task1-NR' from https://osf.io/2urht/files/ under 'OSF Storage' root, unzip and move all
.matfiles to~/datasets/ZuCo/task2-NR-2.0/Matlab_files.
- The Jupyter notebook
construct_dataset_v1.ipynbprovides a detailed explanation of the data being loaded. - To automatically load data from ZuCo v1 and ZuCo v2, simply run the scripts
load_data_v1.pyandload_data_v2.py.
The main arguments to specify are:data_dir: Path to the ZuCo dataset directory (local location).save_data_dir: Location where the extracted EEG features from both datasets will be saved.
The Python scripts were created based on this GitHub repository.
- ZuCo v.1: Nature Scientific Data
- ZuCo v.2: arXiv preprint
If you use this dataset in your research, please cite:
Hollenstein, N., Rotsztejn, J., Troendle, M., Pedroni, A., Zhang, C., & Langer, N. (2018).
ZuCo, a simultaneous EEG and eye-tracking resource for natural sentence reading.
Scientific Data, 5, 180291.
Hollenstein, N., de la Torre, M., Langer, N., & Zhang, C. (2019).
ZuCo 2.0: A dataset of physiological recordings during natural reading and annotation.
arXiv preprint arXiv:1912.00903.
Please refer to the original dataset publications for licensing information.
For questions about the dataset, please refer to the contact information provided in the original publications.

