The case of an English-Polish comparable corpus of patient information leaflets: On identification of bilingual lexical bundles for translation purposes

Grabowski, Łukasz

doi:10.1075/cilt.341.09gra

Part of

Multiword Units in Machine Translation and Translation Technology
Edited by Ruslan Mitkov, Johanna Monti, Gloria Corpas Pastor and Violeta Seretan
[Current Issues in Linguistic Theory 341] 2018
► pp. 181–200

On identification of bilingual lexical bundles for translation purposes

The case of an English-Polish comparable corpus of patient information leaflets

Łukasz Grabowski

Grounded in phraseology and corpus linguistics, this paper aims to explore the use of bilingual lexical bundles to improve the degree of naturalness and textual fit of translated texts. More specifically, this study attempts to identify lexical bundles, that is, recurrent sequences of 3–7 words with similar discursive functions in a purpose-designed comparable corpus of English and Polish patient information leaflets, with 100 text samples in each language. Because of cross-linguistic differences, we additionally apply a number of formal criteria in order to filter out the bundles in each subcorpus. The results show that bilingual lexical bundles with overlapping discourse functions in texts and extracted from comparable corpora hold unexplored potential for machine translation, computer-assisted translation and bilingual lexicography.

Keywords: lexical bundles, comparable corpora, translation quality, translation universals, patient information leaflets

Article outline

1.Introduction
2.Background and related work
3.Research material and methodology
4.Results
5.Discussion and conclusions
Notes
References
Appendix

Published online: 20 July 2018

https://doi.org/10.1075/cilt.341.09gra

References (48)

References

Allschwil: The European Association for Machine Translation. Available at: [URL] (accessed November 2014)

Baker, M. (1993). Corpus linguistics and translation studies: Implications and applications. In M. Baker, G. Francis, & E. Toginini-Bonelli (Eds.), Text and Technology. In Honor of John Sinclair (pp.233–250). Amsterdam: John Benjamins.

(1996). Corpus-based translation studies: The challenges that lie ahead. In H. Somers (Ed.), Terminology, LSP and Translation: Studies in Language Engineering. In Honour of Juan C. Sager (pp.175–186). Amsterdam: John Benjamins.

Barreiro, A., Monti, J., Batista F. & Orliac B. (2013). When Multiwords Go Bad in Machine Translation. In J. Monti, R. Mitkov, G. Corpas-Pastor, & V. Seretan (Eds.), Workshop Proceedings: Multi-Word Units in Machine Translation and Translation Technologies (pp.26–33).

Biber, D. (2006). University Language. A corpus-based study of spoken and written registers. Amsterdam/Philadelphia: John Benjamins.

(2009). A corpus-driven approach to formulaic language in English: multi-word patterns in speech and writing. International Journal of Corpus Linguistics, 14 (3), 275–311.

Biber, D., S. Johansson, G. Leech, S. Conrad & Finegan, E. (1999). The Longman Grammar of Spoken and Written English. London: Longman.

Biber, D., Conrad, S. & Cortes, V. (2003). Lexical bundles in speech and writing: An initial taxonomy. In A. Wilson, P. Rayson, & T. McEnery (Eds.), Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech (pp.71–92). Frankfurt am Main: Peter Lang.

(2004). “If you look at…: Lexical bundles in university teaching and textbooks”. Applied Linguistics, 25(3), 371–405.

Biel, Ł. (2014). Lost in the Eurofog. The Textual Fit of Translated Law. Frankfurt am Main: Peter Lang.

Bouayad-Agha, N. (2006) The Patient Information Leaflet (PIL) corpus. Available at: [URL] (accessed May 2012).

Callison-Burch, Ch., Fordyce, C., Koehn, P., Monz, Ch. & Schroeder, J. (2007). (Meta-) Evaluation of Machine Translation. StatMT '07 Proceedings of the Second Workshop on Statistical Machine Translation, Association for Computational Linguistics, 136–158. Available at: [URL] (accessed February 2015).

(2008). Further Meta-Evaluation of Machine Translation. StatMT '08 Proceedings of the Third Workshop on Statistical Machine Translation, Association for Computational Linguistics, 70–106. Available at: [URL] (accessed February 2015).

Chen, Y.-H. & Baker, P. (2010). Lexical bundles in L1 and L2 academic writing. Language Learning and Technology, 14(2), 30–49.

Cheng., W, Greaves, C. & Warren, M. (2006). From n-gram to skipgram to concgrams. International Journal of Corpus Linguistics, 11(4), 411–433.

Chesterman, A. (2004). Hypothesis about translation universals. In G. Hansen, K. Malmkjaer, & D. Gile (Eds.), Claims, Changes and Challenges in Translation Studies (pp.1–13). Amsterdam: John Benjamins.

Cobb, T. (2003). Review: Alison Wray. 2001. Formulaic Language and the Lexicon. Cambridge: Cambridge. University Press. xi + 332pp. Canadian Journal of Applied Linguistics, 6 (1), 105–110.

di Buono, M., Monti, J., Monteleone, M. & Marano, F. (2013). Multiword processing in an ontology-based Cross-Language Information Retrieval model for specific domain collections. In J. Monti, R. Mitkov, G. Corpas-Pastor, & V. Seretan (Eds.), Workshop Proceedings: Multi-Word Units in Machine Translation and Translation Technologies (pp.43–52). Allschwil: The European Association for Machine Translation. Available at: [URL] (accessed November 2014).

Farwell, D., Guthrie, L. & Wilks, Y. (1993). Automatically Creating Lexical Entries for ULTRA, a Multilingual MT System. Machine Translation, 8, 127–145.

Forchini, P. & Murphy, A. (2008). N-grams in comparable specialized corpora. Perspectives on phraseology, translation and pedagogy. International Journal of Corpus Linguistics, 13(3), 351–367.

Frantzi, K., Ananiadou, S. & Mima, H. (2000). Automatic recognition of multi-word terms: the C-value/NC-value method. International Journal on Digital Libraries, 3(2), 115–130.

Goźdź-Roszkowski, S. (2011). Patterns of Linguistic Variation in American Legal English. A Corpus-Based Study. Frankfurt am Main: Peter Lang Verlag.

Grabowski, Ł. (2013). Interfacing corpus linguistics and computational stylistics: translation universals in translational literary Polish. International Journal of Corpus Linguistics, 18(2), 254–280.

(2014). On Lexical Bundles in Polish Patient Information Leaflets: A Corpus-Driven Study. Studies in Polish Linguistics, 9(1), 21–43.

(2015). Keywords and lexical bundles within English pharmaceutical discourse: a corpus-driven description. English for Specific Purposes, 38, 23–33.

Granger, S. (2010). Comparable and translation corpora in cross-linguistic research. Design, analysis and applications. Journal of Shanghai Jiaotong University, 2, 14–21. Available at: [URL] (accessed November 2014).

(2014). A lexical bundle approach to comparing languages. Stems in English and French. In M.-A. Lefer, & S. Vogeleer (Eds.), Genre- and register-related discourse features in contrast. Special issue of Languages in Contrast , 14(1), 58–72.

Gray, B. & Biber, D. (2013). Lexical frames in academic prose and conversation. International Journal of Corpus Linguistics, 18(1), 109–135.

Hoey, M. (2005). Lexical Priming: A New Theory of Words and Language. London: Routledge.

(2007). Lexical priming and literary creativity. In M. Hoey, M. Mahlberg, M. Stubbs, & W. Teubert (Eds.), Text, Discourse and Corpora. London: Continuum, 7–30.

Hoang, H. & Koehn, P. (2008). Design of the Moses Decoder for Statistical Machine Translation. Software Engineering, Testing, and Quality Assurance for Natural Language Processing (pp.58–65). Columbus, Ohio, USA, June (2008). Association for Computational Linguistics. Available at: [URL] (accessed November 2014).

Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27, 4–21.

Kajzer-Wietrzny, M. (2012). Interpreting Universals and Interpreting Style. Unpublished PhD dissertation. Adam Mickiewicz University, Poznań, Poland. Available at: [URL] (accessed September 2012).

Kilgarriff, A. (2005). Language is never ever ever random. Corpus Linguistics and Linguistic Theory, 1(2), 263–276.

Koehn, P., Hoang, H., Birch, A., Callison-Burch, Ch., Federico, M., Bertoldi, N., Cowan, B., Shen, W., Moran, Ch., Zens, R., Dyer, Ch., Bojar, O., Constantin, A. & Herbst, E. (2007). Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Prague, Czech Republic, June 2007. Available at: [URL] (accessed November 2014).

Laviosa, S. (1998). Core patterns of lexical use in a comparable corpus of English narrative prose. Meta, 43(4), 557–570.

Montalt Resurreccio, V. & Gonzalez Davies, M. (2007). Medical Translation Step by Step. Translation Practices explained. Manchester: St. Jerome Publishing.

Olohan, M. (2004). Introducing Corpora in Translation Studies. London/New York: Routledge.

Olohan, M. & Baker, M. (2000). Reporting that in translated English: Evidence for subconscious processes of explicitation?. Across Languages and Cultures, 1, 141–172 (cited in Olohan 2004: 94).

Papineni, K., Roukos, S., Ward, T., Zhu, W-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings for the 40th Annual Meeting of the Association for Computation Linguistics, Philadelphia, July 2002. (pp.311–318). Available at: [URL] (accessed November 2014).

Ren, Z., Lu, Y., Cao, J., Liu, Q. & Huang, Y. (2009). Improving Statistical Machine Translation Using Domain Bilingual Multiword Expressions. Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications. MWE’ 09. (pp.47–54). Stroudsburg: Association for Computational Linguistics. Available at: [URL] (accessed November 2014).

Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger D. (2002). Multiword Expressions: A Pain in the Neck for NLP. Computational Linguistics and Intelligent Text Processing: Third International Conference (CICLing 2002), 1–15. Available at: [URL] (accessed May 2013).

Salazar, D. (2011). Lexical bundles in scientific English: A corpus-based study of native and non-native writing. Unpublished PhD dissertation. University of Barcelona. Available at: [URL] (accessed March 2013)

Scott, D., Bouayad-Agha, N., Power, R., Shultz, S., Beck, R., Murphy, D. & Lockwood, R.. (2001). PILLS: A Multilingual Authoring System for Patient Information. Proceedings of the 2001 Meeting of the American Medical Informatics Association (AMAI'01), Washington, D.C., USA. Available at: [URL] (accessed May 2013).

Scott, M. (2007). WordSmith Tools 4.0. Liverpool: Lexical Analysis Software.

Stubbs, M. & Barth, I. (2003). Using recurrent phrases as text-type discriminators: a quantitative method and some findings. Functions of Language, 10(1), 65–108.

White, J. (2003). How to evaluate machine translation. In H. Somers (Ed.), Computers and Translation: A Translator’s Guide (pp.211–244). Amsterdam: John Benjamins.

Wilks, Y. (2009). Machine Translation: Its Scope and Limits. New York: Springer.

Cited by (2)

Cited by two other publications

Lee, Changsoo

2022. How do machine translators measure up to human literary translators in stylometric tests?. Digital Scholarship in the Humanities 37:3 ► pp. 813 ff.

Mikhailov, Mikhail

2021. Mind the Source Data! Translation Equivalents and Translation Stimuli from Parallel Corpora. In New Perspectives on Corpus Translation Studies [New Frontiers in Translation Studies, ], ► pp. 259 ff.

This list is based on CrossRef data as of 27 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.