Methodological issues for spontaneous speech corpora compilation
The case of C-ORAL-BRASIL
Spontaneous Speech Corpus Compilation has been going through a growing period in the past 20 years. This is due majorly to technological advances that have been achieved allowing for highly accurate recording in vivo, new insights coming from empirically-based linguistic theory, concerns for the documentation of threatened languages and the high degree of relevance of findings to speech recognition applications. This paper discusses methodologies associated to spontaneous speech corpus compilation which shed light on specific aspects of relevance to the understanding of linguistic phenomena that pertain to spoken language. The compilation process of C-ORAL-BRASIL I, an informal spontaneous speech Brazilian Portuguese corpus, among other examples, is used as the basis for the discussion carried.
References (75)
Allwood, Jens. 2002. Bodily communications. Dimensions of expression and content. In Multimodality in Language and Speech Systems, Björn Granström, David House & Inger Karlsson (eds), 7–26. Dordrecht: Kluwer. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Austin, John L. 1962. How to do Things with Words. Oxford: OUP.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Berruto, Gaetano. 1987. Sociolinguistica dell’italiano contemporaneo. Roma: La Nuova Italia Scientifica.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Berruto, Gaetano. 1993a. Le varietà del repertorio. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 3–36.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Berruto, Gaetano. 1993b. Varietà diamesiche, diastratiche, diafasiche. In Introduzione all’italiano contemporaneo, Alberto A. Sobrero (ed.). Roma-Bari: Laterza 2: 37–92.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Berruto, Gaetano. 2011. Registri, stili: Alcune considerazioni su categorie mal definite. In La variazione di registro nella comunicazione elettronica, Massimo Cerruti, Elisa Corino & Christina Onesti (eds), 15–35. Roma: Carocci.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas & Conrad, Susan. 2009. Register variation: A corpus approach. In The Handbook of Discourse Analysis, Deborah Schiffrin, Deborah Tannen & Heidi E. Hamilton (eds), 175–196. Oxford: Blackwell.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas, Conrad, Susan & Reppen, Randi. 1998. Corpus linguistics: Investigating language structure and use. Cambridge: CUP. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, Douglas, Johansson, Stig, Leech, Geoffrey, Conrad, Susan & Finegan, Edward. 1999. The Longman Grammar of Spoken and Written English. London: Longman.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Chomsky, Noam. 1970. Remarks on nominalization. In Readings in English Transformational Grammar, Roderick A. Jacobs & Peter S. Rosenbaum (eds), 184–221. Waltham MA: Blaisdell.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, Emanuela. 2000. Corpus di italiano parlato, 2 Vols. Firenze: Accademia della Crusca.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, Emanuela. 2001. Per una nuova definizione di frase. In Studi di storia della lingua italiana offerti a Ghino Ghinassi, Paolo Bongrani, Andrea Dardi, Massimo Fanfani & Riccardo Tesi (Eds.), 511–550. Firenze: Le Lettere.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, E. 2005a. Notes on lexical strategy, structural strategy and surface clause indexes in the C-ORAL-ROM spoken corpora. In Cresti & Moneglia (eds), 209–256.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, Emanuela. 2005b. Enunciato e frase: Teoria e verifiche empiriche. In Italia linguistica: Discorsi di scritto e di parlato. Nuovi studi di linguistica italiana per Giovanni Nencioni, Marco Biffi, Omar Calabrese & Luciana Salibra (eds), 249–260. Siena: Protagon.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, Emanuela & Gramigni, Paola. 2004. Per una linguistica corpus based dell’italiano parlato: Le unità di riferimento. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cresti, Emanuela & Raso, Tommaso. 2012. Text annotation of information units through IPIC. LABLITA [URL]
Dittmar, Norbert. 2004. Register. In Handbuch der Soziolinguistik / Handbook of Sociolinguistics, Vol.1, Ulrich Ammon, Norbert Dittmar, Klaus J. Mattheier & Peter Trudgill (eds), 2016–226. Berlin: De Gruyter.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Du Bois, John W., Chafe, Wallace L., Meyer, Charles, Thompson, Sandra A., Englebretson, Robert & Martey, Nii. 2000–2005. Santa Barbara Corpus of Spoken American English, Parts 1–4. Philadelphia PA: Linguistic Data Consortium.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
EAGLES Standards. 1996. [URL]
Edwards, Jane A. 1993. Principles and contrasting systems of discourse transcription. In Talking data: Transcription and coding in discourse research. Jane A. Edwards & Martin D. Lampert (eds), 3–31. Hillsdale NJ: Lawrence Erlbaum Associates.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Firenzuoli, Valentina. 2003. Le forme intonative di valore illocutivo dell’italiano parlato: Analisi sperimentale di un crpus di parlato spontaneo (LABLITA). PhD dissertation, University of Florence.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fleiss, Joseph L. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin 76(5): 378–382. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gadet, F. 2000. Vers une sociolinguistique des locuteurs. Sociolinguististica 14: 99–103.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gadet, Françoise. 2003. La variation sociale en français. Paris: Ophrys.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gregori, Lorenzo & Panunzi, Allesandro. 2012.
DB-IPIC: An XML database for informational patterning analysis
. In
Proceedings of the 7th GSCP International Conference. Speech and Corpora
, Heliana Mello, Massimo Pettorino & Tommaso Raso (eds), 121–127. Florence: Firenze University Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Halliday, Michael A.K. 1989. Spoken and Written Languages. Oxford: OUP.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
van den Heuvel, Henk, Boves, Louis, Choukri, Khalid, Goddijn, Simo & Sanders, Eric 2000. SLR validation: Present state of affairs and prospects. In
Proceedings of the 2nd International Conference on Language Resource and Evaluation (LREC 2000)
, 435–440. Paris: ELRA.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Johansson, Stig. 1995a. The approach of the Text Encoding Initiative to the encoding of spoken discourse. In Leech, Meyers & Thomas (eds), 82–98.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Johansson, Stig. 1995b. The encoding of spoken texts. Computers and the Humanities 29(1): 149–158. Also in Ide, Nancy & Véronis, Jean. 1995. The Text Encoding Initiative. Background and Context, 149–158. Dordrecht: Kluwer. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Karcevsky, Serge. 1931. Sur la phonologie de la phrase. Travaux du Cercle Linguistique de Prague IV: 188–228.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Labov, William. 1966. The Social Stratification of English in New York City. Washington DC: Center for Applied Linguistics.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Labov, William & Waletzky, Joshua. 1967. Narrative analysis. In Essays on the Verbal and Visual Arts, June Helm (ed.), 12–44. Seattle, WA: University of Washington Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Leech, Geoffrey, Myers, Greg & Thomas, Jenny (eds). 1995. Spoken English on Computer. Transcription, Markup and Applications. Harlow: Longman.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Llisterri, Joaquim. 1996. Preliminary recommendations on spoken texts. EAGLES Documents EAG-TCWG-STP/P. [URL]
MacWhinney, Brian J. 2000. The CHILDES Project: Tools for Analyzing Talk. Mahwah NJ: Lawrence Erlbaum Associates.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mc Neill, David (ed.). 2000. Language and Gesture. Cambridge: CUP. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mc Neill, David. 2012. How Language Began. Cambridge: CUP. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mello, Heliana & Raso, Tommaso. 2009. Para a transcrição da fala espontânea: O caso do C-ORAL-BRASIL. Revista Portuguesa de Humanidades – Estudos Linguísticos 13(1): 153–178.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M., Vale, Heloisa P. & Côrtes, Priscila O. 2012. Transcrição e segmentação prosodic do corpus C-ORAL-BRASIL: Critérios de implementação e validação. In C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal, Tommaso Raso & Heliana Mello (eds), 125–176. Belo Horizonte: Editora UFMG.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mello, Heliana, Raso, Tommaso, Mittmann, Maryualê M. & Furtado, D. DBCom: C-ORAL-BRASIL search engine platform. Forthcoming.
Mettouchi, Amina, Lacheret-Dujour, Anne, Silber-Varod, Vered, Izre’el, Shlomo. 2007. Only prosody? Perception of speech segmentation in Kabyle and Hebrew. Nouveaux Cahiers de Linguistique Française 28: 207–218.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mettouchi, Amina, Caubet, Dominique, Vanhove, Martine, Tosco, Mauro, Comrie, Bernard & Izre’el, Shlomo. 2010. CORPAFROAS. A corpus for spoken Afroasiatic languages: Morphosyntactic and prosodic analysis. In CAMSEMUD 2007, Frederick Mario Fales & Giulia Francesca Grassi (eds), 177–180. Padova: SARGON.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Moneglia, Massimo. 2011. Spoken corpora and pragmatics. Revista Brasileira de Linguística Aplicada 11(2): 479–519.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Moneglia, Massimo & Cresti, Emanuela. 1997. L’intonazione e I criteri di trascrizione del parlato adulto e infantile. In Il progettto CHILDES Italia, Umberta Bortolini & Elen Pizzuto (eds), 57–90. Pisa: Del Cerro.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Moneglia, Massimo, Scaarano, Antonietta & Spinu, Marius. 2005. The multilingual corpus of spontaneous speech C-ORAL-ROM: Validation of the prosodic annotation by expert transcribers. In
Atti della Conferenza CLiP 2003
, Carlotta Nicolas Martinez & Massimo Moneglia (eds), 127–142. Firenze: Firenze University Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Moneglia, Massimo & Scarano, Antonietta. 2008. Il Corpus Stammerjohann. Il primo corpus di italiano parlato, in rete nella base dati di LABLITA. In Atti del convegno internazionale ‘La comunicazione parlata’, Tomo III, Massimo Pettorino (ed.), 1650–1685. Napoli: Liguori.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Moneglia, Massimo & Cresti, Emanuela. Forthcoming. The cross-linguistic comparison of information patterning in spontaneous speech corpora: Data from C-ORAL-ROM ITALIAN and C-ORAL-BRASIL. In Linguistique interactionnelle contrastive. Grammaire et interaction dans les langues romanes, Sabine Diao-Klaeger & Britta Thörle (eds). Tübingen: Stauffenburg.
Nencioni, Giovanni. 1976. Parlato-parlato, parlato-scritto, parlato-recitato. Strumenti Critici 10: 1–56. Also in Nencioni, Giovanni. 1983. Di scritto e parlato. Discorsi linguistici, 126–179. Bologna: Zanichelli.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Oostdijk, Nelleke, Goedertier, Wim, Van Eynde, Frank, Boves, Louis, Martens, Jean-Pierre, Moortgat, Michael, Baayen, R. Harald. 2002. Experiences from the Spoken Dutch Corpus Project. In
Proceedings from the Third International Conference on Language Resources and Evaluations
, Manuel Gonzalez-Rodriguez & Carmen Paz Suárez Araujo (eds), 330–347. Las Palmas de Gran Canaria.
Panunzi, Allesandro & Gregori, Lorenzo. 2012. DB-IPIC. An XML database for the representation of information structure in spoken language. In Pragmatics and Prosody. Illocution, Modality, Attitude, Information Structure and Speech Annotation, Heliana Mello, Allesandro Panunzi & Tommaso Raso (eds), 19–37. Florence: Firenze University Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Poggi, Isabella. 2007. Mind, Hands, Face and Body. A Goal and Belief View of Multimodal Communication. Berlin: Werdler.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso. 2012a. O corpus C-ORAL-BRASIL. In Raso & Mello (eds), 55–90.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso. 2012b. O C-ORAL-BRASIL e a teoria da língua em ato. In Raso & Mello (eds), 91–124.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso. 2012c. Specifications. In Mello & Raso (eds).![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso. In press. Fala e escrita: Meio, canal, consequências pragmáticas e linguísticas. Domínios da Linguagem.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Raso, Tommaso & Mello, Heliana (eds). 2012. C-ORAL – Brasil I: Corpus de referência do português brasileiro falado informal. Belo Horizonte: Editora UFMG.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso & Mittmann, Maryualê M. 2009. Validação estatística dos critérios de segmentação da fala espontânea no corpus C-ORAL-BRASIL. Revista de Estudos da Linguagem 17(2): 73–91.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raso, Tommaso & Mittmann, Maryualê M. 2012. As principais medidas da fala. In Raso & Mello (eds).![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rocha, Bruno. 2013. Metodologia emírica para o estudo de ilocuções no PB. Domínios de Linguagem 14: 109–148.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rossi, Fabio. 2001. Varietà diamesica. In Enciclopedia dell’italiano, 1540–1542. Roma: Treccani.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rossini, Nicla. 2012. Language ‘in action’: Reinterpreting Gesture as Language. Amsterdam: IOS Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Scarano, Antonietta. 2004. Enunciati nominali in un corpus di italiano parlato. Appunti per una grammatica corpus based. In Atti del Convegno ‘L’italiano parlato’, Federico Leoni Albano, Francesco Cutugno, Massimo Pettorino & Renata Savy (eds). Napoli: D’Auria.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schiel, Florian, Baumann, Angela, Draxler, Christoph, Ellbogen, Tania, Hoole, Phil & Steffen, Alexander. 2004. The Validation of Speech Corpora. Munich: University of Munich.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Signorini, Sabrina & Tucci, Ida. 2004. Il restauro e l’ archiviazione elettronica del primo corpus di italiano parlato: Il corpus Stammerjohann. In Costituzione, Gestione e restauro di corpora vocali, Atti delle XIV Giornate del GFS, Collana degli atti dell’associazione italiana di acustica. Viterbo, 4–6 dicembre 2003, Amedeo De Dominicis, Laura Mori & Marianna Stefani (eds), 119–126. Roma: Esagrafica. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sinclair, John. 1996. Preliminary recommendations on corpus typology. EAGLES Document EAG-TCWG-CTYP/P. [URL]
Teubert, Wolfgang. 1993. Phonetic / Phonemic and Prosodic Annotation. NERC-WP 8-171. Mannheim: IDS.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Thompson, Paul. 2005. Spoken language corpora. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.), 59–70. Oxford: Oxbow Books.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Winski, Richard, Moore, Roger & Gibbon, Dafydd. 1995. EAGLES Spoken Language Working Group: Overview and results. In
Eurospeech’95. Proceedings of the 4th European Conference on Speech Communication and Speech Technology
, 18–21 September, Vol 1, 841–844. Madrid, Spain.
Woodbury, A. 2003. Defining documentary linguistics. In Language Documentation and Description, 1: HRELP, Peter Austin (ed.). London: SOAS.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
audio
Example 1
Example 2
Example 3
Example 4
Example 5
Example 6
Example 7
Example 8
Example 9
Example 10
Example 11
Example 12
Example 13
Example 14
Example 15
Example 16
Cited by (2)
Cited by two other publications
Bossaglia, Giulia, Heliana Mello & Tommaso Raso
This list is based on CrossRef data as of 25 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.