The ParlaTO corpus is part of the larger KIParla collection, which can be freely queried through the NoSketch Engine interface.
The ParlaTO corpus was was funded by the CRT Foundation ("ParlaTO - Corpus del Parlato di Torino" project).
It consists of about 50 hours of interactions collected in Turin and its province through semi-structured interviews. The interviews, conducted between 2018 and 2020, involved 88 speakers with different origins, ages, education levels, and types of occupation, and addressed personal life experiences in the city (study, work, leisure activities, retirement, memories of the past, etc.).
The transcriptions have been anonymized.
Overall, the module is made up of 68 conversations and includes 100 speakers.
This repository contains:
- metadata for both speakers and conversations, in the
metadatasubfolder (see metadata section below) - descriptions of the set of transcription conventions used for this module (Transcription conventions)
For each conversation you will find:
.eaffile ineaf/folder: time-aligned Jefferson-style transcriptions (open with ELAN)..txtfile inlinear-jefferson/folder: linearized Jefferson-style transcription..txtfile inlinear-orthographic/folder: linearized transcription retaining only orthographic words..tsvfile intsv/folder: verticalized version of the transcription, with Jefferson-style information decoupled from the text as features See verticalized-content for more information.
Linear files in linear-jefferson/ and linear-orthographic/ contain one Transcription Unit (TU) per line. Each line has two columns: the first is the speaker code, and the second is the transcription. TUs are sorted by their start time.
Each participant and each conversation are associated to a series of metadata, that can be found in the
metadata/participants.tsv and metadata/conversations.tsv fils.
Metadata is to be interpreted as follows:
-
Participants metadata:
code: unique anonymized 5-char identifier for each participant. Unknown, occasional participants to conversations are associated with a special???code.gender: eitherMfor masculine orFfor feminineage-range: 5 years range including the participant’s age.birth-region: Italian region1 where the participant was born. If outside Italy, the labelesterois used.occupation: occupation of the participant, according to ISTAT categories. For more information see occupation labelstudy-level: highest completed level of education2- Additionally, the
metadata/participants.tsvalso contains aconversationscolum that summarizes the conversations in which the participant appears.
-
Conversations metadata:
code: unique identifier for conversationtype: type of interaction, for this module all conversation aresemistructured-interviewduration: duration of the conversation, expressed inhh:mm:ssformatparticipants-number: number of participants in the conversationlanguages: languages spoken in the conversation, can be eitheritalianordialect, or both.participants-relationship: relation, either symmetrid or asymmetricmoderator: presence of a moderatortopic: fixed for this moduleyear: year of collectioncollection-point: two-letter code of the collection area:TOfor Turin for this module.- Additionally, the
metadata/conversations.tsvalso contains aparticipantsfield that recaps the codes of the participants to that conversation
Conversations are also available in a vertical, pseudo-tokenized version in tsv/.
Tokenization is obtained by validating the Jefferson transcription using custom toolsand splitting on token boundaries: whitespaces, prosodic links (=), and apostrophes used for elision in Italian orthography. Each transcription-derived token is then documented on one row.
Each token is represented as 13 columns, as follows:
token_id: unique token identifier within the conversationspeaker: speakercodeas it can be found inmetadata/participants.tsvtu_id: progressive identifier assigned to transcription unitsspan: portion of the original jefferson transcription containing the tokenform: orthographic form of the token. This differs from thespanas special symbols are stripped out and represented asjefferson_feats. Moreover, shortpauses ((.)in the transcription) are represented as[PAUSE]and unintelligible tokens (sequences ofxin the transcription) are represented asxtype: one oflinguistic: everything that is considered to be a content linguistic tokennonverbalbehaviorused for transcribed non verbal behaviors, such as laughing or sighingshortpausethat identify pausesunknownthat identify unintelligible spans in transcriptionerroris a residual class to mark cases where the transcription is not well formed according to Jefferson format. Therefore, the token is not analyzed and transcription will be corrected in future releases.
variationencodes whether the transcription unit includes code mixing with dialects. In this case, all tokens in the unit havesomeas a value, otherwisenoneis used.jefferson_feats: the column collects a list of word-level features derived from the transcription in Jefferson format. More specifically:SpaceAfter=No: no whitespace between this token and the next (e.g.,l'inl'anno)ProsodicLink=Yes: a prosodic link (=) to the following token,Intonationcan assume valuesFalling,RisingorWeaklyRisingand translates word final punctuation sign in Jefferson transcriptions (i.e.,.,?and,respectively)Interrupted=Yes: words interrupted in speech, transcribed with final, transcribed with final~Truncated=Yes: truncated forms (e.g.,anda'forandare, common in some Italian varieties)Volumecan assume valuesHighorLowand translated Jefferson's uppercase and°respectively
align: alignment features for the first and last token of each TU, throughAlignBeginandAlignEndfeatures expressed in millisecondsprolongations: positions of sound prolongations (colons:) within the word, encoded as a comma-separated list of<char_id>x<count>pairschar_idis the zero-based index of the character in the token's orthographic form.countis the number of consecutive colons immediately following that character in the original span- example: for span =
ese::mpio:, its orthographic form isesempioand the prolongations field would assume value2x2,6x1(the 3rd letterehas 2 colons; the 7th letterohas 1 colon).
pace: marks whether the token participates in a fast or slow paced span within the word.- Format:
Fast=<char_id_start>-<char_id_end>orSlow=<char_id_start>-<char_id_end> - Indices are zero-based, inclusive, and refer to character positions in form
- Format:
guesses: character span(s) transcribed as uncertain (i.e., in round brackets in the Jefferson transcription).- Format:
<char_id_start>-<char_id_end>(zero-based, inclusive, over form)
- Format:
overlaps: comma-separated list of character spans participating in simultaneous speech, with an overlap group identifier.- Format:
<char_id_start>-<char_id_end>(<overlap_id>), where indices are are zero-based, inclusive indices over form andoverlap_idis the progressive number of the overlapping group within the TU - Examples: the span
e[se]mp[iwould be encoded as1-3(2),5-6(3)meaning that characters from position one (inclusive) to three (exclusive) participate to span number 2 while the last character (with id 5) participates to the third overlapping span of the transcription unit. When the overlapping id was not decidable, a?is used
- Format:
Due to GDPR restrictions, pseudo-anonymized audio files (MP3) are available under a restricted-access license. To request access, please contact the corpus coordinators through the KIParla website and follow the provided procedure.
To cite this module please include
Cerruti, M., & Ballarè, S. (2020). ParlaTO: corpus del parlato di Torino. Bollettino dell’Atlante Linguistico Italiano (BALI), 44, 171–196
in your references
@article{Cerruti_ParlaTO_corpus_del_2020,
author = {Cerruti, Massimo and Ballarè, Silvia},
journal = {Bollettino dell’Atlante Linguistico Italiano (BALI)},
pages = {171--196},
title = {{ParlaTO: corpus del parlato di Torino}},
volume = {44},
year = {2020}
}If you use the ParlaBO module in your research, please also reference this repository (commit/tag) in your data statement or appendix.
-
2025-10-07 v1.0.0
- First release
-
2025-11-28 v1.1.0
- Major fix: wrong speaker attribution in linear-jefferson and linear-orthographic
- Minor fix: empty turns in linear-orthographic were removed
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Footnotes
-
abruzzo,basilicata,calabria,campania,emilia-romagna,friuli-venezia-giulia,lazio,liguria,lombardia,marche,molise,piemonte,puglia,sardegna,sicilia,toscana,trentino-alto-adige,umbria,valle-d-aosta,veneto↩ -
elementary-school,liceo-diploma,middle-school,phd,technical-vocational-diploma,university-degree,university-degree-ongoing↩