A flexible framework for collocation retrieval and translation from parallel and comparable corpora
This paper outlines a methodology and a system for collocation retrieval and translation from parallel and comparable
corpora, developed with translators and language learners in mind. It is based on a phraseology framework, applies
statistical techniques, and employs source tools and online resources. The collocation retrieval and translation has
proved successful for English and Spanish and can be easily adapted to other languages. The evaluation results are
promising and future goals are proposed. Furthermore, conclusions are drawn on the nature of comparable corpora and
how they can be better exploited to suit particular needs of target users.
Article outline
- 1.Introduction
- 2.Phraseology
- 2.1Typologies of collocations
- 2.2Transfer rules
- 3.Related work
- 3.1Collocation retrieval
- 3.2Parallel corpora
- 3.3Comparable corpora
- 4.System
- 4.1Candidate selection module
- 4.2Candidate filtering module
- 4.3Dictionary look-up module
- 4.4Parallel corpora module
- 4.5Comparable corpora module
- 5.Evaluation
- 5.1Experimental setup
- 5.2Experimental results
- 5.3Discussion and future work
-
Acknowledgements
-
Notes
-
References
References (31)
References
Baldwin, T., and Kim, S. N. (2010). Multiword Expressions. In: Handbook of Natural Language Processing, second edition. Boca Raton, FL.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Grammar of spoken and written English. Edimburgh: Pearson Education Limited.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bradford, W., and Hill, S. (2000). Bilingual Grammar of English-Spanish Syntax. University Press of America.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Brown, P., Lai, J., and Mercer, R. (1991). Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics (pp.169–176). Berkeley, Canada.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cardey, S., Chan, R., & Greenfield, P. (2006). The Development of a Multilingual Collocation Dictionary. In: Proceedings of the Workshop on Multilingual Language Resources and Interoperability, Sydney,32–39.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Choueka, Y., Klein, T., and Neuwitz, E. (1983). Automatic Retrieval of Frequent Idiomatic and Collocational Expressions in a Large
Corpus. In: Journal for Literary and Linguistic Computing, 4(1): 34–38.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Church, K. W., and Hanks, P. (1989). Word Association Norms, Mutual Information, and Lexicography. In: Proceedings of the 27th annual meeting on Association for Computational Linguistics, 76–83.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Corpas Pastor, G. (1995). Un Estudio Paralelo de los Sistemas Fraseológicos del Inglés y del Español. Málaga: SPICUM.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Corpas Pastor, G. (1996). Manual de Fraseología Española. Madrid, Gredos.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Corpas Pastor, G. (2013). Detección, Descripción y Contraste de las Unidades Fraseológicas mediante Tecnologías
Lingüísticas. Manuscript submitted for publication. In Fraseopragmática, I. Olza, and E. Manero (Eds.). Berlin: Frank & Timme.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fung, P., and Yuen, Y. (1998). An IR Approach for Translating New Words from Nonparallel, Comparable Texts. In: Proceedings of the 17th International Conference on Computational Linguistics, 414–420.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gale, W., and Church K. (1993). A Program for Aligning Sentences in Bilingual Corpora. In: Journal of Computational Linguistics, 19: 75–102.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gelbukh, A., and Kolesnikova O. (2013). Expressions in NLP: General Survey and a Special Case of Verb-Noun Constructions. In Emerging Applications of Natural Language Processing: Concepts and New Research, S. Bandyopadhyay, S. K. Naskar, and A. Ekbal (Eds.). Hershey: Information Science Reference. IGI Global.1–21.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hausmann, F. (1985). Kollokationen im deutschen Wörterbuch. Ein Beitrag zur Theorie des lexikographischen
Beispiels. In: Lexikographie und Grammatik, (Lexicographica, series maior 3), Ed. H. Bergenholtz, and J. Mugdan. Tübingen: Niemeyer. 175–186.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Hoang, H. H., Kim, S. N., and Kan, M. Y. (2009). A Re-examination of Lexical Association Measures, In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP, Singapore: ACL and AFNLP. 31–39.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jackendoff, R. (1997). The Architecture of the Language Faculty, Cambridge, Mass., MIT Press. ![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jackendoff, R. (2007). Language, Consciousness, Culture: Essays on Mental Structure. The MIT Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lea, D. and Runcie, M. (2002). Oxford Collocations Dictionary for Students of English. Oxford University Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lü, Y. and Zhou, M. (2004). Collocation Translation and Acquisition Using Monolingual Corpora. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL ’04).167–174.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ramisch, C., Villavicencio, A., and Boitet, C. (2010). MWEToolkit: A Framework for Multiword Expression Identification. In: Proceedings of LREC’10 (7th International Conference on Language Resources and Evaluation) .![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ramisch, C. (2012). A Generic Framework for Multiword Expressions Treatment: from Acquisition to
Applications. In: Proceedings of ACL 2012 Student Research Workshop, 61–66.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Rapp, R. (1995). Identifying Word Translations in Nonparallel Texts. In: Proceedings of the 35th Conference of the Association of Computational Linguistics, 321–322. Boston, Massachusetts.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sag, I. et al. (2002). Multiword Expressions: A Pain in the Neck for NLP. In: Proceedings of the 3rd International Conference on Intelligent Text Processing and Computational
Linguistics (COCLing-2002), 1–15.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Santana, O. et al. (2011). Extracción Automática de Colocaciones Terminológicas en un Corpus Extenso de Lengua
General. In: Procesamiento del Lenguaje Natural, (47),145–152.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schmid, H. (1994). Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing. Manchester, UK.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Seretan, V. (2011). Syntax-Based Collocation Extraction (Text, Speech and Language Technology). (1st ed.). Springer. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sharoff, S., Babych, B., and Hartley, A. (2009). “Irrefragable answers” using comparable corpora to retrieve translation equivalents. In: Language Resources and Evaluation, 43(1).15–25. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sinclair, J., & Jones, S. (1974). English Lexical Collocations: A study in computational linguistics. In: Cahiers de lexicologie, 24(2).15–61.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Smadja, F. (1993). Retrieving collocations from text: Xtract. In: Computational Linguistics, 19(1). 143–177.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Varga, D. et al.. (2005). Parallel corpora for medium density languages. In: Proceedings of the RANLP 2005.590–596.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wehrli, E., Nerima, L., and Scherrer, Y. (2009). Deep linguistic multilingual translation and bilingual dictionaries. In: Proceedings of the Fourth Workshop on Statistical Machine Translation.90–94.![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (2)
Cited by two other publications
Buendía Castro, Miriam
2024.
Recursos bilingües y multilingües del dominio del turismo.
Hikma 23:3
► pp. 1 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
Garcia, Marcos, Marcos García-Salido & Margarita Alonso-Ramos
This list is based on CrossRef data as of 2 january 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.