Multi-word discourse markers and their corpus-driven identification
The case of MWDM extraction from the reference corpus of spoken Slovene
With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.
Article outline
- 1.Introduction
- 2.Multi-word discourse markers
-
2.1Related research on discourse-marking multi-word expressions
- 2.2Multi-word discourse markers in this study
- 3.Statistical methods for MWE identification in corpora
-
4.Aims, data and methodology
- 4.1The GOS corpus
- 4.2N-gram extraction
- 4.3N-gram ranking
- 4.3.1Selected frequency-based measures
-
4.3.2Selected association-based measures
- 4.3.3Comparability of selected measures
- 4.4MWDM identification
- 4.5MWDM classification
- 5.Results
-
5.1Features of identified MWDMs
- 5.2Comparison of statistical methods
- 5.3Comparison of statistical and manual methods
- 6.Discussion and conclusions
- Acknowledgements
- Notes
-
References
References (84)
References
Adolphs, S., & Carter, R. (2013). Spoken Corpus Linguistics: From Monomodal to Multimodal. London/New York: Routledge.
Aijmer, K. (1996). Conversational Routines in English: Convention and Creativity. London/New York: Addison Wesley Longman.
Alonso, L., Castellón, I., & Padró, L. (2002). X-TRACTOR: A tool for extracting discourse markers. In A. Lenci, S. Montemagni & V. Pirelli (Eds.), Proceedings of the LREC 2002 Workshop on Linguistic Knowledge Acquisition and Representation: Bootrstrapping Annotated Language Data (pp. 100–105). Paris: ELRA.
Balažic Bulc, T. (2009). Torej, namreč, zato … o konektorjih: Raba in funkcija konektorjev v slovenskem in hrvaškem jezikoslovnem diskurzu. Ljubljana: Filozofska fakulteta.
Biber, D., Conrad, S., & Cortes, V. (2004). If you look at …: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25(3), 371–405.
Biber, D., Johansson, S., Leech, G., & Conrad, S. (1999). Longman Grammar of Spoken and Written English. Harlow: Pearson Education.
Blakemore, D. (2006). Divisions of labour: The analysis of parentheticals. Lingua, 116(10), 1670–1687.
Bolly, C., Crible, L., Degand, L., & Uygur, D. (forthcoming). Towards a model for discourse marker annotation in spoken French: From potential to feature-based discourse markers. In C. Fedriani & A. Sanso (Eds.), Pragmatic Markers, Discourse Markers and Modal Particles: New Perspectives (pp. 71–98). Amsterdam/Philadelphia: John Benjamins.
Brinton, L. J. (2008). The Comment Clause in English: Syntactic Origis and Pragmatic Development. Cambridge: Cambridge University Press.
Brinton, L. J., & Traugott, E. C. (2005). Lexicalization and Language Change. Cambridge: Cambridge University Press.
Bybee, J. (2010). Language, Usage and Cognition. Cambridge: Cambridge University Press.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Conklin, K., & Schmitt, N. (2007). Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29(1), 72–89.
Csomay, E. (2013). Lexical bundles in discourse structure: A corpus-based study of classroom discourse. Applied Linguistics, 34(3), 369–388.
da Silva, J. F., & Lopes, G. P. (1999). A Local Maxima method and a Fair Dispersion Normalization for extracting multi-word units from corpora. In J. Rogers (Ed.), Proceedings of the 6th Meeting on the Mathematics of Language (pp. 369–381). Orlando, FL: University of Central Florida.
Dehé, N., & Kavalova, Y. (Eds.). (2007). Parentheticals. Amsterdam/Philadelphia: John Benjamins.
Dér, C. (2010). On the status of discourse markers. Acta Linguistica Hungarica, 57(1), 3–28.
Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology, 26(3), 297–302.
Dobrovoljc, K. (forthcoming). Lexical features of spoken language in user-generated content: The case of multi-word discourse markers (Doctoral dissertation). Faculty of Arts, University of Ljubljana, Slovenia.
Dobrovoljc, K., & Nivre, J. (2016). The Universal Dependencies treebank of spoken Slovenian. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (pp. 1566–1573). Paris: ELRA.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text – Interdisciplinary Journal for the Study of Discourse, 20(1), 29–62.
Evert, S. (2009). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus Linguistics: An International Handbook (pp. 1212–1248). Berlin/New York: Mouton de Gruyter.
Fischer, K. (Ed.). (2006a). Approaches to Discourse Particles. Oxford: Elsevier.
Fischer, K. (2006b). Towards an understanding of the spectrum of approaches to discourse particles: Introduction to the volume. In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 1–20). Oxford: Elsevier.
Fischer, K. (2014). Discourse markers. In K. P. Schneider & A. Barron (Eds.), Pragmatics of Discourse (pp. 271–294). Berlin: Mouton De Gruyter.
Fox Tree, J. E., & Schrock, J. C. (1999). Discourse markers in spontaneous speech: Oh what a difference an oh makes. Journal of Memory and Language, 40(2), 280–295.
Fraser, B. (2013). Combinations of contrastive discourse markers in English. International Review of Pragmatics, 5(2), 318–340.
Gantar, P., Kosem, I., & Krek, S. (2016). Discovering automated lexicography: The case of the Slovene Lexical Database. International Journal of Lexicography, 29(2), 200–225.
Hansen, M. -B. M. (1998). The semantic status of discourse markers. Lingua, 1041, 235–260.
Hansen, M. -B. M. (2006). A dynamic polysemy approach to the lexical semantics of discourse markers (with an exemplary analysis of French toujours). In K. Fischer (Ed.), Approaches to Discourse Particles (pp. 21–41). Oxford: Elsevier.
Heine, B. (2013). On discourse markers: Grammaticalization, pragmaticalization, or something else? Linguistics, 51(6), 1205–1247.
Jucker, A. H., & Ziv, Y. (Eds.). (1998) Discourse Markers. Amsterdam/Philadelphia: John Benjamins.
Kilgarriff, A., Rychly, P., Kovar, V., & Baisa, V. (2012). Finding multiwords of more than two words. In R. V. Fjeld & J. M. Torjusen (Eds.), Proceedings of the 15th EURALEX International Congress (pp. 693–700). Oslo: Department of Linguistics and Scandinavian Studies, University of Oslo.
Krek, S. (2012). The Slovene Language in the Digital Age. Berlin/Heidelberg: Springer.
Lapshinova-Koltunski, E., & Kunz, K. (2014). Conjunctions across languages, registers and modes: Semi-automatic extraction and annotation. In A. Diaz Negrillo & F. J. Daz Prez (Eds.), Specialisation and Variation Language Corpora (pp. 77–104). Bern: Peter Lang.
Ljubešić, N., Dobrovoljc, K., & Fišer, D. (2015). MWELex – MWE lexica of Croatian, Slovene and Serbian extracted from parsed corpora. Informatica, 39(3), 293–300.
Logar, N., Gantar, P., & Kosem, I. (2014). Collocations and examples of use: A lexical-semantic approach to terminology. Slovenščina 2.0, 2(1), 41–61.
Louwerse, M. M., & Mitchell, H. H. (2003). Toward a taxonomy of a set of discourse markers in dialog: A theoretical and computational account. Discourse Processes, 351, 199–239.
Manning, C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Cambridge, MA: The MIT Press.
Maschler, Y., & Schiffrin, D. (2015). Discourse markers: Language, meaning, and context. In D. Tanen, H. E. Hamilton & D. Schiffrin (Eds.), The Handbook of Discourse Analysis (pp. 189–221). Hoboken, NJ: John Wiley & Sons.
McCarthy, M., & Carter, R. (2006). This, that and the other: Multi-word clusters in spoken English as visible patterns of interaction. In M. McCarthy (Ed.), Explorations in Corpus Linguistics (pp. 7–26). Cambridge: Cambridge University Press.
O’Donnell, M. B. (2010). The adjusted frequency list: A method to produce cluster-sensitive frequency lists. ICAME Journal, 351, 135–169.
Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.
Overstreet, M. (2000). Whales, Candlelight, and Stuff Like That: General Extenders in English Discourse. Oxford/New York: Oxford University Press
Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44(1–2), 137–158.
Prasad, R., & Bunt, H. (2015). Semantic relations in discourse: The current state of ISO 24617–8. In H. Bunt (Ed.), Proceedings of the 11th Joint ACL-ISO Workshop on Interoperable Semantic Annotation (pp. 80–92). London: Queen Mary University of London.
Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., & Webber, B. (2008). The Penn Discourse TreeBank 2.0. In N. Calozolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp. 2961–2968). Paris: ELRA.
Prasad, R., Joshi, A., & Webber, B. (2010). Realization of discourse relations by other means: Alternative lexicalizations. In C. -R. Huang & D. Jurafsky (Eds.), Proceedings of the 23rd International Conference on Computational Linguistics (pp. 1023–1031). Beijing: Chinese Information Processsing Society of China.
Redeker, G. (2000). Coherence and structure in text and discourse. In H. V. Bunt & W. J. Black (Eds.), Abduction, Belief and Context in Dialogue: Studies in Computational Pragmatics (pp. 233–263). Amsterdam/Philadelphia: John Benjamins.
Roze, C., Danlos, L., & Muller, P. (2012). LEXCONN: A French lexicon of discourse connectives. Discours, 101. [URL]
Rychlý, P. (2007). Manatee/Bonito – A Modular Corpus Manager. In P. Sojk & A. Horák (Eds.), First Workshop on Recent Advances in Slavonic Natural Language Processing (pp. 65–70). Brno: Masaryk University.
Rysová, M., & Rysová, K. (2015). Secondary connectives in the Prague Dependency Treebank. In J. Nivre & E. Hajičova (Eds.), Proceedings of the Third International Conference on Dependency Linguistics (pp. 291–299). Uppsala: Uppsala University.
Schiffrin, D. (1987). Discourse Markers. Cambridge: Cambridge University Press.
Schnur, E. (2014). Phraseological signaling of discourse organization in academic lectures: A comparison of lexical bundles in authentic lectures and EAP listening materials. Yearbook of Phraseology, 5(1), 95–122.
Schourup, L. (1999). Discourse markers. Lingua, 3(4), 227–265.
Siepmann, D. (2005). Discourse Markers Across Languages: A Contrastive Study of Second-level Discourse Markers in Native and Non-native Text with Implications for General and Pedagogic Lexicography. London/New York: Routledge
Simpson-Vlach, R., & Ellis, N. C. (2010). An academic formulas list: New methods in phraseology research. Applied Linguistics, 31(4), 487–512.
Stede, M. (2002). DiMLex: A lexical approach to discourse markers. In A. Lenci & V. Di Tomaso (Eds.), Exploring the Lexicon: Theory and Computation (pp. 151–177). Alessandria: Edizioni dell’Orso.
Stede, M. (2011). Discourse Processing. San Rafael, CA: Morgan & Claypool.
Taboada, M. (2006). Discourse markers as signals (or not) of rhetorical relations. Journal of Pragmatics, 38(4), 567–592.
Tadić, M., & Šojat, K. (2003). Finding multiword term candidates in Croatian. In H. Cunningham, E. Paskaleva, K. Bontcheva & G. Angelova (Eds.), Proceedings of the International Workshop on Information Extraction for Slavonic and Other Central and Eastern European Languages (pp. 102–107). Sofia: BAS.
van Dijk, T. A. (Ed.) (1997). Discourse as Structure and Process. London: SAGE.
Verdonik, D. (2008). Označevanje vrste diskurznih označevalcev. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Sixth Language Technologies Conference (pp. 25–28). Ljubljana: Institut “Jožef Stefan”.
Verdonik, D. (2014). Vprašanja zapisovanja govora v govornem korpusu Gos. In T. Erjavec & J. Žganec Gros (Eds.), Proceedings of the Ninth Language Technologies Conference (pp. 151–156). Ljubljana: Institut “Jožef Stefan”.
Verdonik, D., Kosem, I., Vitez, A. Z., Krek, S., & Stabej, M. (2013). Compilation, transcription and usage of a reference speech corpus: The case of the Slovene corpus GOS. Language Resources and Evaluation, 47(4), 1031–1048.
Verdonik, D., Rojc, M., & Stabej, M. (2007). Annotating discourse markers in spontaneous speech corpora on an example for the Slovenian language. Language Resources and Evaluation, 41(2), 147–180.
Wiechmann, D. (2008). On the computation of construction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253–290.
Wray, A. (2005). Formulaic Language and the Lexicon. Cambridge: Cambridge University Press.
Wray, A. (2013). Formulaic language. Language Teaching, 46(3), 316–334.
Zufferey, S., & Degand, L. (2013). Annotating the meaning of discourse connectives in multilingual corpora. Corpus Linguistics and Linguistic Theory, 101, 1–18.
Zwitter Vitez, A., Zemljarič Miklavčič, J., Krek, S., Stabej, M., & Erjavec, T. (2013). Spoken corpus Gos 1.0. Retrieved from: [URL]
Cited by (2)
Cited by two other publications
Mlakar, Izidor, Matej Rojc, Simona Majhenič & Darinka Verdonik
Dobrovoljc, Kaja
2020.
Identifying dictionary-relevant formulaic sequences in written and spoken corpora.
International Journal of Lexicography 33:4
► pp. 417 ff.
This list is based on CrossRef data as of 5 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.