This paper discusses key issues in the compilation of spoken language corpora in a computer-mediated communication (CMC) environment, using data from the Corpus of Academic Spoken English (CASE), a corpus of Skype conversations currently being compiled at Saarland University, Germany, in cooperation with European and US partners. Based on first findings, Skype is presented as a suitable tool for collecting informal spoken data. In addition, new recommendations concerning data compilation and transcription are put forward to supplement existing best practice as presented in Wynne (2005). We recommend the preservation of multimodal features during anonymisation, and the addition of annotation elements already at the transcription stage, particularly CMC-related discourse features, English as a Lingua Franca (ELF) features (e.g. non-standard language and code-switching), as well as the inclusion of prosodic, paralinguistic, and non-verbal annotation. Additionally, we propose a layered corpus design in order to allow researchers to focus on specific annotation features.
(2013) Spoken Corpus Linguistics. From Monomodal to Multimodal. London: Routledge.
(1988) Variation across Speech and Writing. Cambridge: Cambridge University Press.
(2015) Negotiating Conversation Starts in the Corpus of Academic Spoken English (Unpublished MA thesis). Universität des Saarlandes, Saarbrücken, Germany.
ECAMM – Call Recorder for Mac
(2013) [Computer software]. Retrieved from [URL] (last accessed March 2016).
CASE – Corpus of Academic Spoken English
Forthcoming S. Diemer, M.-L. Brunner, C. Collet & S. Schmidt). . Saarbrücken: Saarland University (Coordination) / Sofia: St Kliment Ohridski University / Forlì: University of Bologna-Forlì / Santiago: University of Santiago de Compostela / Helsinki: Helsinki University & Hanken School of Economics / Birmingham: Birmingham City University / Växjö: Linnaeus University / Louvain-la-Neuve: Université catholique de Louvain / Lyon: Université Lumière Lyon 2 / Boise: Boise State University. Retrieved from [URL] (last accessed March 2016).
(1994-2016) [Computer software]. Retrieved from [URL] (last accessed March 2016).
Conrad, S., & Mauranen, A.
(2003) The corpus of English as lingua franca in academic settings. TESOL Quarterly, 37(3), 513–527.
Dressler, R.A., & Kreuz, R.J.
(2000) Transcribing oral discourse: A survey and a model system. Discourse Processes, 29(1), 25–36.
(1993) Principles and contrasting systems of discourse transcription. In J.A. Edwards & M.D. Lampert (Eds.), Talking Data: Transcription and Coding in Discourse Research (pp. 3–32). Hillsdale: Lawrence Erlbaum Associates.
ELFA – The Corpus of English as a Lingua Franca in Academic Settings
(2008) A. Mauranen (Director). Retrieved from [URL] (last accessed February 2015).
(1996) The discursive accomplishment of normality: On ‘lingua franca’ English and conversation analysis. Journal of Pragmatics, 26(2) 237–259.
(2014) CASE XML Conversion Tool [Computer software]. Retrieved from [URL] (last accessed November 2015).
(1993) Topic introduction in English conversation. Transactions of the Philological Society, 91(2). 181–214.
Gibbon, D., Moore R., & Winski, R.
(1998) Handbook of Standards and Resources for Spoken Language Systems 1: Spoken Language Systems and Corpus Design. Berlin, Germany: Mouton de Gruyter.
(2003) Laughter in Interaction. Cambridge: Cambridge University Press.
(1996) Phraseology in English Academic Writing: Some Implications for Language Learning and Dictionary Making. Tübingen: Niemeyer.
ICE Corpus annotation guidelines
(2009) Retrieved from [URL] (last accessed March 2016).
IFA Dialog Video Corpus
(2008) Retrieved from [URL] (last accessed March 2016).
Jefferson, G., Sacks, H., & Schegloff, E.A.
(1987) Notes on laughter in the pursuit of intimacy. In G. Button & J.R.E. Lee (Eds.), Talk and Social Organisation (pp. 152–205). Clevedon: Multilingual Matters.
Jenkins, J., Modiano, M., & Seidlhofer, B.
(2001) Euro-English. English Today, 17(4), 13–19.
Leech, G., Myers, G., & Thomas, J.
(Eds.) (1995) Spoken English on Computer. Harlow: Longman.
(2005) Adding linguistic annotation. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 17–29). Oxford: Oxbow Books.
(Ed.) (2003) The Politics of English as a World Language. Amsterdam: Rodopi.
(1996) Englisch als Medium der interkulturellen Kommunikation. Untersuchungen zum non-native-/non-native Speaker-Diskurs. Frankfurt am Main: Peter Lang.
(2002) ICE mark-up manual for spoken texts. Retrieved from [URL] (last accessed 31 March 2016)
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.
PÕLDVERE, NELE, VICTORIA JOHANSSON & CARITA PARADIS
2021. OnThe London–Lund Corpus 2: design, challenges and innovations. English Language and Linguistics 25:3 ► pp. 459 ff.
Rühlemann, Christoph & Alexander Ptak
2023. Reaching beneath the tip of the iceberg: A guide to the Freiburg Multimodal Interaction Corpus. Open Linguistics 9:1
Steen, Francis F., Anders Hougaard, Jungseock Joo, Inés Olza, Cristóbal Pagán Cánovas, Anna Pleshakova, Soumya Ray, Peter Uhrig, Javier Valenzuela, Jacek Woźny & Mark Turner
2018. Toward an infrastructure for data-driven multimodal communication research. Linguistics Vanguard 4:1
[no author supplied]
2022. QUEST: Guidelines and Specifications for the Assessment of Audiovisual, Annotated Language Data
[Working Papers in Corpus Linguistics and Digital Technologies: Analyses and Methodology, 8],
[no author supplied]
2022. List of Example Stand-alone Corpus Description Articles. In Designing and Evaluating Language Corpora, ► pp. 224 ff.
This list is based on CrossRef data as of 26 november 2023. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.