Aligning verb + noun collocations to improve a French-Romanian FSMT system

Todirascu, Amalia; Navlea, Mirabela

doi:10.1075/cilt.341.04tod

Part of

Multiword Units in Machine Translation and Translation Technology
Edited by Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
[Current Issues in Linguistic Theory 341] 2018
► pp. 81–100

Aligning verb + noun collocations to improve a French-Romanian FSMT system

Amalia Todirascu

Mirabela Navlea

We present several Verb + Noun collocation integration methods using linguistic information, aiming to improve the results of a French-Romanian factored statistical machine translation system (FSMT). The system uses lemmatised, tagged and sentence-aligned legal parallel corpora. Verb + Noun collocations are frequent word associations, sometimes discontinuous, related by syntactic links and with non-compositional sense (Gledhill, 2007). Our first strategy extracts collocations from monolingual corpora, using a hybrid method which combines morphosyntactic properties and frequency criteria. The second method applies a bilingual collocation dictionary to identify collocations. Both methods transform collocations into single tokens before alignment. The third method applies a specific alignment algorithm for collocations. We evaluate the influence of these collocation alignment methods on the results of the lexical alignment and of the FSMT system.

Keywords: MWE, FSMT, hybrid collocation identification, lexical alignment, MWE-aware MT systems, collocation dictionary

Article outline

1.Context and motivation
2.Handling MWEs for MT
3.Collocation definition
4.Translation problems
5.The Architecture of the FSMT system and verb + noun collocation integration
6.Preprocessing Verb + Noun collocations
7.The MWE dictionary
8.The collocation alignment algorithm
9.Experiments
- 9.1MWEs and the lexical alignment system
- 9.2MWEs and FSMT system
- 9.3MWE identification before aligning
10.Conclusions and future work
Notes
References

Published online: 20 July 2018

https://doi.org/10.1075/cilt.341.04tod

References (47)

References

Avramidis, E., & Koehn, P. (2008). Enriching Morphologically Poor Languages for Statistical Machine Translation. In Proceedings of ACL-08: HLT (pp.763–770). Columbus (USA). Stroudsburg (USA, PA): Association for Computational Linguistics.

Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya, & F. J. Damerau (Eds.), Handbook of Natural Language Processing (pp.267–292), Second edition. Boca Raton (USA, FL): CRC Press, Taylor and Francis Group.

Bertoldi, N., Haddow, B., & Fouet, J.-B. (2009). Improved Minimum Error Rate Training in Moses. Prague Bulletin of Mathematical Linguistics (PBML), 91, 7–16.

Birch, A., Osborne, M., & Koehn, P. (2007). CCG Supertags in factored Statistical Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation (pp.9–16). Prague (République Tchèque). Stroudsburg (USA, PA): Association for Computational Linguistics.

Bouamor, D., Semmar, N., & Zweigenbaum, P. (2012). Identifying bilingual multi-word expressions for statistical machine translation. In Proceedings of Eigth International Conference on Language Resources and Evaluation (pp.674–679). Istanbul, Turkey: ELRA.

Cap, F, Fraser, A., Weller, M., & Cahill, A. (2014).How to Produce Unseen Teddy Bears: Improved Morphological Processing of Compounds in SMT. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (pp.579–587). Goteborg, Sweden.

Ceauşu, A., & Tufiş, D. (2011). Addressing SMT Data Sparseness when Translating into Morphologically-Rich Languages. InB. Sharp, M. Zock, M. Carl, & A. Lykke Jakobsen (Eds.), Proceedings of the 8th international NLPCS workshop: Human-machine interaction in translation (pp.57–68). Copenhagen Business School (Danemark). Copenhagen (Danemark): Samfundslitteratur.

de Gispert, A., Gupta, D., Popović, M., Lambert, P., Mariño, J., Federico, M., Ney, H., & Banchs, R. (2006). Improving Statistical Word Alignments with Morpho-syntactic Transformations. In Proceedings of 5th International Conference on Natural Language Processing, FinTAL’06 (pp.368–379).

Deksne, D., Skadiņš, R., & Skadiņa, I. (2008). Dictionary of Multiword Expressions for Translation into Highly Inflected Languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the Sixth International Conference on Language Resources and Evaluation (pp;1401–1405). Marrakech, Morocco: ELRA.

Dunning, T. (1993). Accurate Methods for the Statistics of Surprise and Coincidence. Computational Linguistics, 19(1), 61–74.

Erjavec, T. (2004). MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (pp.1535–1538). Paris: ELRA.

Gledhill, C. (2007). La portée : seul dénominateur commun dans les constructions verbo-nominales. In Frath, P., Pauchard, J., & Gledhill, C. (Eds.), Actes du 1er colloque, Res per nomen, pour une linguistique de la dénomination, de la référence et de l’usage (pp.113–125), Université de Reims-Champagne-Ardenne.

Gledhill, C., & Todiraşcu, A. (2008). Collocations en contexte : extraction et analyse contrastive. Texte et corpus, 3, Actes des Journées de la linguistique de Corpus 2007 (pp.137–148).

Hausmann, F. J. (2004). Was sind eigentlich Kollokationen?. In K. Steyer (Ed.), Wortverbindungen -mehr oder weniger fest. (pp.309–334) Institut fur Deutsche Sprache Jahrbuch.

Ide, N., & Véronis, J. (1994). Multext (multilingual tools and corpora). In Proceedings of the 15th CoLing (pp.90–96). Kyoto (Japon).

Ion, R. (2007). Metode de dezambiguizare semantică automată. Aplicaţii pentru limbile engleză şi română [Semantic desambiguation methods. Application for English and Romanian Languages]. Ph.D.Thesis. Bucharest (Romania): Romanian Academy.

Koehn, P., & Hoang, H. (2007). Factored translation models. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp.868–876). Prague (République Tchèque).

Koehn, P., Hoang, H., Birch, A., Callison-Burch, Ch., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, Ch., Zens, R., Dyer, Ch., Bojar, O., Constantin, A., & Herbst, E. (2007). Moses : Open source toolkit for statistical machine translation. In Proceedings of the ACL 2007 Demo and Poster Sessions (pp.177–180). Prague.

Koehn, P., Och, F. J., & Marcu, D. (2003). Statistical Phrase-Based Translation. In Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology (pp.48–54). Edmonton (Canada). Stroudsburg (USA, PA): Association for Computational Linguistics.

Kordoni, V., & Simova, I. (2014). Multiword Expressions in Machine Translation. In Proceedings of the International Conference on Language Resources and Evaluation (pp.1208–1211). Reykjavik, Iceland: ELRA.

Lambert P. & Banchs R. (2006). Grouping multi-word expressions according to Part-Of-Speech in statistical machine translation. In Proceedings of the EACL Workshop on Multi-word expressions in a multilingual context (pp; pp.9–16). Trento, Italy.

Melamed D. I. (1997). Automatic Discovery of Non-Compositional Compounds in Parallel Data. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing (pp.97–108). RI, USA: Providence.

Melamed, D. I. (1998). Manual annotation of translational equivalence: The Blinker project. Cognitive Science Technical Report. University of Pennsylvania.

Navlea, M. (2014). La traduction automatique statistique factorisée : une application à la paire de langues français - roumain. Thèse de doctorat, Université de Strasbourg, Strasbourg.

Och, F. J., & Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Journal of Computational Linguistics, 29(1), 19–51.

Okita, T, Guerra, A. M., Graham, Y., & Way, A. (2010). Multi-Word Expression-Sensitive Word Alignment. In Proceedings of the 4th International Workshop on Cross Lingual Information Access at COLING 2010 (pp.26–34). Beijing.

Pal, S., Naskar, S. K., & Bandyopadhyay, S. (2013). MWE Alignment in Phrase Based Statistical Machine Translation. In K. Sima’an, M. L. Forcada, D. Grasmick, H. Depraetere, & A. Way (Eds.), Proceedings of the XIV Machine Translation Summit (pp.61–68).

Papineni, K., Roukos, S., Ward, T., & Zhu, W. J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of 40th Annual meeting of the Association for Computational Linguistics (ACL) (pp.311–318), Philadelphia (USA, PE). Stroudsburg (USA, PA): Association for Computational Linguistics.

Ramisch, C., Besacier, L., & Kobzar, A. (2013). How hard is it to automatically translate phrasal verbs from English to French?. In J. Monti, R. Mitkov, G. Corpas Pastor, V. Seretan (Eds.), Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technology (pp.53–61), Nice (France).

Rapp, R. & Sharoff, S. (2014) Extracting Multiword Translations from Aligned Comparable Documents, Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra), 87–95.

.

Reinhard R., & Sharoff S. (2014). Extracting Multiword Translations from Aligned Comparable Documents. In Proceedings of the 3rd Workshop on Hybrid Approaches to Translation (HyTra) (pp.87–95). Gothenburg, Sweden.

Ren, Z, Lü, Cao J., Liu, Q, & Huang, Y. (2009). Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions. In Proceedings of the 2009 Workshop on Multiword Expressions, ACL-IJCNLP2009 (pp.47–54).

Salton, G., Ross, R., & Kelleher, J. (2014). Evaluation of a Substitution Method for Idiom Transformation in Statistical Machine Translation. In Proceedings of the 10th Workshop on Multiword Expressions (MWE) (pp.38–42), EACL 2014. Göteborg, Sweden: Association for Computational Linguistics.

Schottmüller N., & Nivre, J. (2014). Issues in Translating Verb-Particle Constructions from German to English. In Proceedings of the 10th Workshop on Multiword Expressions (MWE) (pp.124–131), EACL 2014. Göteborg, Sweden: Association for Computational Linguistics.

Steinberger, R., Eisele, A., Klocek, S., Pilos, S., & Schlüter, P. (2012). DGT-TM: A freely Available Translation Memory in 22 Languages. In Proceedings of the 8th international conference on Language Resources and Evaluation (LREC’2012) (pp.454–459). Istanbul (Turquie): ELRA.

Steinberger, R., Pouliquen, B., Widiger, A., Ignat, C., Erjavec, T., Tufiş, D., & Varga, D. (2006). The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20 + Languages. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC 2006) (pp;2142–2147). Gênes (Italie). Paris (France): ELRA.

Stolcke, A. (2002). SRILM - An Extensible Language Modeling Toolkit. In Proceedings of the International Conference Spoken Language Processing (pp.901–904). Denver (USA, Colorado).

Tan, L, & Pal, S. (2014). Manawi: Using Multi-Word Expressions and Named Entities to Improve Machine Translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation (pp.201–206). Baltimore, Maryland USA.

Tiedemann J. (1999). Word alignment -step by step. In Proceedings of the 12th Nordic Conference on Computational Linguistics (pp.216–227). University of Trondheim, Norway.

Todiraşcu A., Gledhill C., & Stefanescu D. (2009). Extracting Collocations in Contexts. In Z. Vetulani, & H. Uszkoreit (Eds.), Responding to Information Society Challenges: New Advances in Human Language Technologies, LNAI 5603 (pp.336–349). Berlin Heidelberg: Springer-Verlag.

Todiraşcu, A., Heid, U., Stefanescu, D., Tufiş, D., Gledhill, C., Weller M., & Rousselot F. (2008). Vers un dictionnaire de collocations multilingue. Cahiers de Linguistique, 33(1), 171–185.

Todiraşcu, A., Ion, R., Navlea, M., & Longo, L. (2011). French text preprocessing with TTL. In Proceedings of the Romanian Academy, Series A, 12(2), 151–158.

Todiraşcu, A., Pado, S., Krisch, J., Kisselew, M., & Heid, U. (2012). French and German Corpora for Audience-based Text Type Classification. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12) (pp.1591–1597). Istanbul, Turkey: ELRA.

Tufiş, D., Ion, R., & Dumitrescu, Ș. (2013). Wikipedia as an SMT Training Corpus. In Proceedings of the International Conference on Recent Advances on Language Technology (RANLP 2013) (pp.702–709). Hissar (Bulgarie).

Venkatapathy, S. & Joshi, A. (2006).Using Information about Multi-word Expressions for the Word-Alignment Task. In Proceedings of the Workshop on Multiword Expressions: Identifying and Exploiting Underlying Properties (pp.20–28), Sydney.

Wehrli, E, Seretan, V, Nerima L, & Russo L, (2009). Collocations in a Rule-Based MT System: A Case Study Evaluation of Their Translation Adequacy. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation (EAMT) (pp.128–135). Barcelona: EAMT.

Wu, H., Wang, H., & Zong C, (2008). Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008) (pp.993–1000). Manchester.

Aligning verb + noun collocations to improve a French-Romanian FSMT system

Aligning verb + noun collocations to improve a French-Romanian FSMT system