Association measures for collocation extraction
Automatic evaluation on a large-scale corpus
In this study, we propose a new evaluation scheme to assess the strengths and limitations of collocation extraction measures and explore type-sensitive methods for extracting collocations. We introduced the pooling strategy widely used in Information Retrieval and automated the evaluation process using online dictionaries. Sixteen well-known metrics are evaluated based on their effectiveness and then distributional and linguistic compared. The results show that Group A methods (e.g. z-score, Dice, PMI) are more effective in extracting low-frequency collocations with relatively small extraction scales. In contrast, Group B methods (e.g. t-test, LMI, LLR) perform better at finding high-frequency collocations, most of which outperform Group A methods as the extraction scale increases. Moreover, Group A prefers NN collocations, while Group B identifies collocations with a wide range of syntactic structures. This study provides suggestions for studies to identify hybrid extraction methods as well as for language educators and dictionary compilers.
Article outline
- 1.Introduction
- 2.Approaches to collocation extraction
- 3.Methodology
- 3.1Collocation extraction and validation
- 3.2Evaluation of the extraction methods
- 4.Results
- 4.1General performance
- 4.2Intersection of the metrics
- 4.3Frequency distribution
- 4.4Syntactic structure
- 5.Discussion
- 6.Conclusions
- Notes
-
References
References (56)
References
Ackermann, K., & Chen, Y. H. (2013). Developing the Academic Collocation List (ACL) – A corpus-driven and expert-judged approach. Journal of English for Academic Purposes,
12
(4), 235–247.
Agresti, A. (2003). Categorical Data Analysis. John Wiley & Sons.
Agrawal, S., Sanyal, R., & Sanyal, S. (2018). Hybrid method for automatic extraction of multi-word expressions. International Journal of Engineering & Technology,
7
1, 33–38.
Bartsch, S., & Evert, S. (2014). Towards a Firthian notion of collocation. Vernetzungsstrategien Zugriffsstrukturen und automatisch ermittelte Angaben in Internetwörterbüchern,
2
(1), 48–61.
Berry-Rogghe, G. (1973). The computation of collocations and their relevance in lexical studies. In A. Aitken, R. Bailey, & N. Hamilton-Smith. (Eds.), The Computer and Literary Studies (pp. 103–112). Edinburgh University Press.
Blaheta, D., & Johnson, M. (2001, July 7). Unsupervised learning of multi-word verbs [Paper presentation]. ACL/EACL Workshop on the Computational Extraction, Analysis and Exploitation of Collocations. Toulouse, France.
Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics,
16
(1), 22–29.
Constant, M., Eryiğit, G., Monti, J., Van Der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multi-word expression processing: A survey. Computational Linguistics,
43
(4), 837–892.
Daille, B. (1994). Study and implementation of combined techniques for automatic extraction of terminology. In J. L. Klavans & P. Resnik. (Eds.), The Balancing Act: Combining Symbolic and Statistical Approaches to Language (pp. 49–66). MIT.
Dice, L. R. (1945). Measures of the amount of ecologic association between species. Ecology,
26
(3), 297–302.
Dunning, T. E. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics,
19
(1), 61–74.
Espinosa-Anke, L., Schockaert, S., & Wanner, L. (2019). Collocation classification with unsupervised relation vectors. In A. Korhonen, D. Traum, L. Màrquez. (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5765–5772). Association for Computational Linguistics. [URL].
Evert, S. (2005). The Statistics of Word Co-occurrences: Word Pairs and Collocations [Doctoral dissertation, Universität Stuttgart]. Online Publikationen der Universität Stuttgart. [URL]
Evert, S. (2008). Corpora and collocations. In A. Lüdeling & M. Kytö. (Eds.), Corpus Linguistics: An International Handbook (Vol. 21, pp. 1212–1248). De Gruyter.
Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 188–195). Association for Computational Linguistics. [URL].
Firth, J. R. (1968). A synopsis of linguistic theory, 1930–1955. In F. Palmer. (Ed.), Selected Papers of J. R. Firth 1952–1959 (pp. 168–205). Longman. (Original work published 1957)
Green, S., de Marneffe, M., & Manning, C. (2013). Parsing models for identifying multi-word expressions. Computational Linguistics,
39
(1), 195–227.
Gueguen, L., Velasco-Forero, S., & Soille, P. (2014). Local mutual information for dissimilarity-based image segmentation. Journal of Mathematical Imaging and Vision,
48
(3), 625–644.
Hausmann, F. (1985). Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen Beispiels [Collocations in the German dictionary: A contribution to the theory of the lexicographic example]. In H. Bergenholtz & J. Mugdan. (Ed.), Lexikographie und Grammatik (pp. 118–129). Max Niemeyer.
Hausmann, F. (2004). Was sind eigentlich Kollokationen? [What are collocations actually?] In K. Steyer. (Ed.), Wortverbindungen – mehr oder weniger fest (pp. 309–334). De Gruyter.
Herbst, T. (1996). What are collocations: Sandy beaches or false teeth? English Studies,
77
(4), 379–393.
Huang, L. S. (2001). Knowledge of English collocations: An analysis of Taiwanese EFL learners. In C. Luke & B. Rubrecht. (Eds.), Texas Papers in Foreign Language Education: Selected Proceedings from the Texas Foreign Language Education Conference (pp. 113–132). ERIC.
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist,
11
(2), 37–50.
Jones, K., & Van Rijsbergen, C. J. (1975). Report on the need for and provision of an ideal information retrieval test collection. British Library Research and Development Report 5266. University Computer Laboratory, Cambridge.
Kita, K., Kato, Y., Omoto, T., & Yano, Y. (1994). A comparative study of automatic extraction of collocations from corpora: Mutual information vs. cost criteria. Journal of Natural Language Processing,
1
(1), 21–33.
L’homme, M., & Bertrand, C. (2000). Specialized lexical combinations: Should they be described as collocations or in terms of selectional restrictions. In U. Heid, S. Evert, E. Lehman, & C. Rohrer. (Eds.), Proceedings of the Ninth Euralex International Congress (pp. 497–506). Stuttgart University.
Loper, E., & Bird, S. (2002). NLTK: The natural language toolkit. arXiv preprint cs/0205028.
Manning, C., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT.
Mel’čuk, I. (1998). Collocations and lexical functions. In A. Cowie. (Ed.), Phraseology: Theory, Analysis, and Applications (pp. 23–53). Clarendon.
Montemurro, M. A., & Zanette, D. H. (2002). New perspectives on Zipf’s law in linguistics: From single texts to large corpora. Glottometrics,
4
1, 87–99.
Moon, R. (2008). Dictionaries and collocation. In S. Granger & F. Meunier. (Eds.), Phraseology: An Interdisciplinary Perspective (pp. 313–336). Benjamins.
Orliac, B., & Dillinger, M. (2003). Collocation extraction for machine translation. In Proceedings of Machine Translation Summit IX (pp. 292–298). MTSummit. [URL]
Pearce, D. (2001, June 3–4). Synonymy in collocation extraction [Paper presentation]. NAACL workshop on WordNet and other lexical resources. Pittsburgh, USA.
Pearce, D. (2002, May). A comparative evaluation of collocation extraction techniques. In M. Rodríguez & C. Araujo. (Eds.), Proceedings of the Third International Conference on Language Resources and Evaluation (pp. 1530–1536). European Language Resources Association. [URL]
Pecina, P. (2005, June). An extensive empirical study of collocation extraction methods. In C. Callison-Burch & S. Wan. (Eds.), Proceedings of the ACL Student Research Workshop (pp. 13–18). Association for Computational Linguistics. [URL].
Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation,
44
(1–2), 137–158.
Pedersen, T., & Bruce, R. (1996). What to infer from a description. Technical Report 96-CSE-04. Southern Methodist University.
Pivovarova, L., Kormacheva, D., & Kopotev, M. (2017). Evaluation of collocation extraction methods for the Russian language. In M. Kopotev, O. Lyashevskaya, & A. Mustajoki. (Eds.), Quantitative Approaches to the Russian Language (pp. 137–157). Routledge.
Quasthoff, U., & Wolff, C. (2002, July). The poisson collocation measure and its applications [Paper presentation]. Second International Workshop on Computational Approaches to Collocations. Vienna, Austria.
Seretan, V. (2011). Syntax-based Collocation Extraction. Springer.
Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In N. Calzolari, C. Cardie, & P. Isabelle. (Eds.), Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 953–960). Association for Computational Linguistics. [URL].
Shimohata, S., Sugio, T., & Nagata, J. (1997, July). Retrieving collocations by co-occurrences and word order constraints. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics (pp. 476–481). Association for Computational Linguistics. [URL]
Siepmann, D. (2005). Collocation, colligation and encoding dictionaries. Part I: Lexicological aspects. International Journal of Lexicography,
18
(4), 409–443.
Sinclair, J. (1966). Beginning the study of lexis. In C. Bazell, J. Catford, M. Halliday, & R. Robins. (Eds.), In Memory of J. R. Firth (pp. 410–429). Longman.
Sinclair, J. (1991). Corpus Concordance Collocation. Oxford University Press.
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics,
19
(1), 143–178.
Tan, P. N., Kumar, V., & Srivastava, J. (2002). Selecting the right interestingness measure for association patterns. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 32–41). Association for Computing Machinery.
Tonon, A., Demartini, G., & Cudré-Mauroux, P. (2015). Pooling-based continuous evaluation of information retrieval systems. Information Retrieval Journal,
18
1, 445–472.
Tutin, A. (2008). For an extended definition of lexical collocations. In E. Bernal & J. DeCesaris. (Eds.), Proceedings of the 13th Euralex International Congress (pp. 1453–1460). Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra. [URL]
Uhrig, P., & Proisl, T. (2012). Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica,
28
(1), 141–180.
Voorhees, E. M. (2001, September). The philosophy of information retrieval evaluation. In C. Peters, M. Braschler, J. Gonzalo, & M. Kluck. (Eds.), Evaluation of Cross-Language Information Retrival Systems (pp. 355–370). Springer.
Zobel, J. (1998, August). How reliable are the results of large-scale information retrieval experiments? In A. Moffat, C. J. van Rijsbergen, R. Wilkinson, & J. Zobel. (Eds.), Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 307–314). ACM.
Cited by (1)
Cited by one other publication
LI, Jingjie & Wenjie HU
2024.
Identification of Sentence Stems Characteristic of Chinese Learner English Writing.
Heliyon ► pp. e37166 ff.
This list is based on CrossRef data as of 11 september 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.