This paper describes the construction of deeply annotated spoken dialogue corpora. To ensure a maximum of flexibility — in the degree of normalization, the types and formats of annotations, the possibilities for modifying and extending the corpus, or the use for research questions not originally anticipated — we propose a flexible multi-layer standoff architecture. We also take a closer look at the interoperability of tools and formats compatible with such an architecture. Free access to the corpus data through corpus queries, visualizations, and downloads — including documentation, metadata, and the original recordings — enables transparency, verifiability, and reproducibility of every step of interpretation throughout corpus construction and of any research findings obtained from this data.
Anderson, A.H., Bader, M., Gurman Bard, E., Boyle, E., Doherty, G., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H.S., & Weinert, R. (1991). The HCRC Map Task Corpus. Language and Speech, 34(4), 351–366.
Belz, M. (2013). Disfluencies und Reparaturen bei Muttersprachlern und Lernern: Eine kontrastive Analyse. Humboldt-Universität zu Berlin. Retrieved from [URL] (last accessed March 2014).
BeMaTaC. (2014). BeMaTaC: A Deeply Annotated Multimodal Map-task Corpus of Spoken Learner and Native German. Retrieved from [URL] (last accessed March 2014).
Boersma, P. (2010). Praat: A system for doing phonetics by computer. Glot International, 5(9/10), 341–345.
Brinckmann, C., Kleiner, S., Knöbl, R., & Berend, N. (2008). German today: An areally extensive corpus of spoken Standard German. In N. Calzolari, Kh. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 3185–3191). Paris: ELRA.
Buchholz, S., & Marsi, E. (2006). CoNLL-X shared task on multilingual dependency parsing. In L. Màrquez & D. Klein (Eds.), Proceedings of the 10th Conference on Computational Natural Language Learning (pp. 149–164). Stroudsburg, PA: Association for Computational Linguistics.
Burnard, L. (Ed.). (2007). Reference Guide for the British National Corpus (XML Edition). Oxford: Research Technologies Service. Retrieved from [URL] (last accessed March 2014).
Carletta J., Evert, S., Heid, U., Kilgour, J., Robertson, J., & Voormann, H. (2003). The NITE XML Toolkit: Flexible annotation for multi-modal language data. Behavior Research Methods, Instruments, & Computers, 35(3), 353–363.
Carletta J., Evert, S., Heid, U., & Kilgour, J. (2005). The NITE XML Toolkit: Data model and query. Language Resources and Evaluation, 39(4), 313–334.
Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2009). A flexible framework for integrating annotations from different tools and tagsets. Traitement Automatique des Langues, 49(2), 271–291.
Creative Commons. (2014). About the Licenses - Creative Commons. Retrieved from [URL] (last accessed March 2014).
Dipper, S. (2005). XML-based stand-off representation and exploitation of multi-level linguistic annotation. In R. Eckstein & R. Tolksdorf (Eds.), Proceedings of Berliner XML Tage 2005 (pp. 39–50). Berlin: Humboldt-Universität zu Berlin.
Dipper, S., Lüdeling, A., & Reznicek, M. (2013). NoSta-D: A corpus of German non-standard varieties. In M. Zampieri & S. Diwersy (Eds.), Non-Standard Data Sources in Corpus-Based Research (pp. 69–76). Aachen: Shaker.
Druskat, S., Bierkandt, L., Gast, V., Rzymski, C., & Zipser, F. (2014). Atomic: An open-source software platform for multi-level corpus annotation. In J. Ruppenhofer & G. Faaß (Eds.), Proceedings of the 12th Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2014) (pp. 228–234). Retrieved from [URL] (last accessed May 2015).
Gerdes, K. (2014). Arborator [Computer software]. Retrieved from [URL] (last accessed March 2014).
Giesel, L., Klapi, M., Krüger, D., Nunberger, I., Rasskazova, O., & Sauer, S. (2013) Berlin Map Task Corpus: A deeply annotated multimodal map-task corpus of spoken learner and native German. Poster presented at the
35. Jahrestagung der Deutschen Gesellschaft für Sprachwissenschaft
, Potsdam, Germany. Retrieved from [URL] (last accessed March 2014).
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I.H. (2009). The WEKA data mining software: An update. In O.R. Zaiane (Ed.), SIGKDD Explorations, 11(1), 10–18.
Hanke, T., & Storz, J. (2008). iLex: A database tool for integrating sign language corpus linguistics and sign language lexicography. In O. Crasborn, E. Efthimiou, T. Hanke, E. Thoutenhoofd & I. Zwitserlood (Eds.), LREC 2008 Workshop, Proceedings, W 25: 3rd Workshop on the Representation and Processing of Sign Languages: Construction and Exploitation of Sign Language Corpora (pp. 64–67). Paris: ELRA.
Himmelmann, N.P. (2012). Linguistic data types and the interface between language documentation and description. Language Documentation & Conservation, 61, 187–207.
Hinrichs, E.W., Hinrichs, M., & Zastrow, T. (2010). WebLicht: Web-Based LRT services for German. In ACL 2010 System Demonstrations, Proceeding (pp. 25–29). Stroudsburg, PA: Association for Computational Linguistics.
Ide, N., & Suderman, K. (2007). GrAF: A graph-based format for linguistic annotations. In B. Boguraev, N. Ide, A. Meyers, Sh. Nariyama, M. Stede, J. Wiebe & G. Wilcock (Eds.), ACL 2007 Workshop, Proceedings, Linguistic Annotation Workshop (pp. 25–29). Stroudsburg, PA: Association for Computational Linguistics.
Kirk, J.M. (this volume). The pragmatic annotation scheme of the SPICE-Ireland corpus.
Krause, T., Lüdeling, A., Odebrecht, C., & Zeldes, A. (2012). Multiple tokenization in a diachronic corpus. Paper presented at
Exploring Ancient Languages through Corpora Conference 2012
, Oslo. Retrieved from [URL] (last accessed March 2014).
Krause, T., & Zeldes, A. (2014). ANNIS3: A new architecture for generic corpus query and visualization. Digital Scholarship in the Humanities. Retrieved from [URL] (last accessed May 2015).
Lüdeling, A. (2011). Corpora in linguistics: Sampling and annotation. In K. Grandin (Ed.), Going Digital. Evolutionary and Revolutionary Aspects of Digitization (pp. 220–243). New York, NY: Science History Publications.
Max Planck Society. (2014). Max Planck Open Access: Berlin Declaration. Retrieved from [URL] (last accessed March 2014).
Müller, C., & Strube, M. (2006). Multi-level annotation of linguistic data with MMAX2. In S. Braun, K. Kohn & J. Mukherjee (Eds.), Corpus Technology and Language Pedagogy (pp. 197–214). Frankfurt am Main: Peter Lang,
Nivre, J. (2008). Treebanks. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 225–241). Berlin: Mouton de Gruyter.
Pajas P., & Stepanek, J. (2008). Recent advances in a feature-rich framework for treebank annotation. In
Proceedings of the 22nd International Conference on Computational Linguistics
(pp. 673–680). Stroudsburg, PA: Association for Computational Linguistics.
R Core Team. (2013). R: A Language and Environment for Statistical Computing [Computer software]. Retrieved from [URL] (last accessed March 2014).
Sauer, S., & Rasskazova, O. (2014). BeMaTaC: Eine digitale multimodale Ressource für Sprach- und Dialogforschung. Poster presented at the
workshop Grenzen überschreiten – Digitale Geisteswissenschaft heute und morgen
, Berlin, Germany. Retrieved from [URL] (last accessed March 2014).
Schiel, F., Draxler, C., & Harrington, J. (2011). Phonemic segmentation and labelling using the MAUS technique.
Workshop New Tools and Methods for Very-Large-Scale Phonetics Research
. Retrieved from [URL] (last accessed April 2016).
Schiller, A., Teufel, S., Stöckert, C., & Thielen, C. (1999). Guidelines für das Tagging deutscher Textcorpora mit STTS (Kleines und großes Tagset). Retrieved from [URL] (last accessed March 2014).
Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In
Proceedings of International Conference on New Methods in Language Processing
. Retrieved from [URL] (last accessed November 2014).
Schmid, H. 2008. Tokenizing and part-of-speech tagging. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 527–551). Berlin: Mouton de Gruyter.
Schmidt, T. (2004). Transcribing and annotating spoken language with EXMARaLDA. In A. Witt, U. Heid, H.S. Thompson, J. Carletta & P. Wittenburg (Eds.), LREC 2004 Workshop, Proceedings, XML-based Richly Annotated Corpora (pp. 69–74). Paris: ELRA.
Schmidt, T., Hedeland, H., Lehmberg, T., & Wörner, K. (2010). HAMATAC: The Hamburg MapTask Corpus. Retrieved from [URL] (last accessed March 2014).
Sloetjes, H., & Wittenburg, P. (2008). Annotation by category: ELAN and ISO DCR. In N. Calzolari, Kh. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 816–820). Paris: ELRA.
Stede, M. (2011). Discourse Processing. San Rafael, CA: Morgan & Claypool.
Stenetorp, P., Pyysalo, S., Topić, G., Ohta, T., Ananiadou, S., & Tsujii, J. 2012. Brat: A web-based tool for NLP-assisted text annotation. In F. Segond (Ed.), Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics (pp. 102–107). Stroudsburg, PA: Association for Computational Linguistics.
Stührenberg, M. (2012). The TEI and current standards for structuring linguistic data. In P. Bański, E. Litta Modignani Picozzi & A. Witt (Eds.), Journal of the Text Encoding Initiative, 31. Retrieved from [URL] (last accessed March 2014).
TEI Consortium. (2014). TEI: Text Encoding Initiative. Retrieved from [URL] (last accessed March 2014).
Thompson, P. (2005). Spoken language corpora. In M. Wynne (Ed.), Developing Linguistic Corpora: A Guide to Good Practice (pp. 59–70). Oxford: Oxbow Books. Retrieved from [URL] (last accessed March 2014).
Wichmann, A. (2008). Speech corpora and spoken corpora. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics. An International Handbook (pp. 187–207). Berlin: Mouton de Gruyter.
Wörner, K. (2009). Werkzeuge zur flachen Annotation von Transkriptionen gesprochener Sprache. Bielefeld: Bielefeld University. Retrieved from [URL] (last accessed April 2016).
Wynne, M. (2008). Searching and concordancing. In A. Lüdeling, & M. Kytö. (Eds.), Corpus Linguistics. An International Handbook (pp. 706–737). Berlin: Mouton de Gruyter.
Yimam, S.M., Gurevych, I., Eckart de Castilho, R., & Biemann, C. (2013). WebAnno: A flexible, web-based and visually supported system for distributed annotations. In M. Butt & S. Hussain (Eds.), 51st Annual Meeting of the Association for Computational Linguistics: Proceedings of the Conference System Demonstration (pp. 1–6). Stroudsburg, PA: Association for Computational Linguistics.
Zeldes, A., Ritz, J., Lüdeling, A., & Chiarcos, C. (2009). ANNIS: A search tool for multi-layer annotated corpora. In M. Mahlberg, V. González-Díaz & C. Smith (Eds.), Proceedings of Corpus Linguistics 2009. Retrieved from [URL] (last accessed March 2014).
Zipser, F., & Romary, L. (2010). A model oriented approach to the mapping of annotation formats using standards. In G. Budin, L. Romary, T. Declerck & P. Wittenburg (Eds.), LREC 2010 Workshop, Proceedings, W4: Language Resource and Language Technology Standards. Paris: ELRA. Retrieved from [URL] (last accessed November 2014).
Cited by (9)
Cited by nine other publications
Lemmenmeier-Batinić, Dolores, Josip Batinić & Anastasia Escher
2023. Map Task Corpus of Heritage BCMS spoken by second-generation speakers in Switzerland. Language Resources and Evaluation 57:4 ► pp. 1607 ff.
Hirschmann, Hagen & Thomas Schmidt
2022. Gesprochene Lernerkorpora: Methodisch-technische Aspekte der Erhebung, Erschließung und Nutzung. Zeitschrift für germanistische Linguistik 50:1 ► pp. 36 ff.
Wisniewski, Katrin
2022. Gesprochene Lernerkorpora des Deutschen: Eine Bestandsaufnahme. Zeitschrift für germanistische Linguistik 50:1 ► pp. 1 ff.
Põldvere, Nele, Johan Frid, Victoria Johansson & Carita Paradis
2021. Challenges of releasing audio material for spoken data: The case of the London-Lund Corpus 2. Research in Corpus Linguistics 9:1 ► pp. 35 ff.
This list is based on CrossRef data as of 19 november 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.