Working with parallel corpora
Usefulness and usability
Although parallel corpora are vital for cross-linguistic and natural language processing (NLP) research, most have been designed for just one particular purpose, which may unnecessarily restrict their usefulness and usability. My argument is that the usefulness of existing parallel corpora increases exponentially when data so obtained are combined with those yielded by comparable and/or monolingual corpora. Usability criteria such as the choice of processing tools and adherence to international standards, among others, also have an impact on corpus usefulness. This chapter proposes courses of action that serve to improve the recycling and reprocessing of available resources. It also presents a corpus-based, post-editing and quality assessment application as an illustration of the multifarious uses parallel corpora may serve.
Article outline
- 1.Introduction
- 2.Concepts
- 3.Resources
- 4.Uses of parallel corpora
- 5.Needs analysis
- 6.Parallel corpora: Building or using
- 7.Applications
- 8.Useful strategies
- 9.Conclusions
-
Acknowledgment
-
Notes
-
References
References (77)
References
Anthony, Laurence. 2014. AntPConc (Version 1.1.0) [Computer Software]. Tokyo, Japan: Waseda University. <[URL]> (7 July 2017).
Badia, Toni, Boleda, Gema, Brumme, Jenny, Colominas, Carme, Garmendia, Mireia & Quixal, Martí. 2002. BancTrad: un banco de corpus anotados con interfaz web. Procesamiento del lenguaje natural 29: 293–294 < [URL]> (13 November 2018).
BancTrad. 2002. <[URL]> (11 July 2017).
Biber, Douglas. 1993. Representativeness in corpus design. Literary and Linguistic Computing 8(4): 243–257.
Biber, Douglas. 1998. Variation across Speech and Writing. Cambridge: CUP.
Bowker, Lynne. 2002. Computer-Aided Translation Technology: A Practical Introduction. Ottawa: University of Ottawa Press.
Chen, Jian & Nie, Jian-Yun. 2000. Parallel text mining for cross-language IR. In Proceedings of the 6th International Conference on Computer-assisted Information Retrieval (RIAO 2000), 62–77.
Chesterman, Andrew. 2004. Hypotheses about translation universals. In Claims, Changes and Challenges in Translation Studies, [Benjamins Translation Library 50], Gyde Hansen, Kirsten Malmkjaer & Daniel Gile (eds), 113. Amsterdam: John Benjamins.
CLARIN. European Common Language Resources and Technology Infrastructure. 2012. <[URL]> (7 July 2017).
COCA. Corpus of Contemporary American English. 201. <[URL]> (19 July 2018).
CORPES XXI. Corpus del Español del Siglo XXI. 2016. < <[URL]> (19 July 2018).
Corpuscle. 2017. <[URL]> (7 July 2017).
Coseriu, Eugenio. 1981. Los conceptos de ‘dialecto’, ‘nivel’ y ‘estilo de lengua’ y el sentido propio de la dialectologia. Lingüística española actual 3: 1–32.
Coulthard, Malcolm. 2004. Author identification, idiolect and linguistic uniqueness. Applied Linguistics 25(4): 431–447.
COVALT. 2005. Corpus Valencià de Literatura Traduïda. <[URL]> (7 July 2017).
CWB. IMS Open Corpus Workbench. 2013: <[URL]> (7 July 2017).
Ebeling, Jarle. 1998. Contrastive linguistics, translation, and parallel corpora. Meta 43: 602–615.
ENPC. 1996. English –Norwegian Parallel Corpus. <[URL]>(19 July 2018).
Europarl. 2012. Release v7 <[URL]> (11 July 2017).
European Language Resources Association (ELRA). 2015. <[URL]> (11 July 2017).
Evert, Stefan. 2016. CQP query language tutorial. CWB Version 3.4 <[URL]> (11 July 2017).
Germann, Ulrich. 2017. Aligned Hansards of the 36th Parliament of Canada release 2001–1a. <[URL]> (30 June 2017).
Granger, Sylviane & Lefer, Marie-Aude. 2016. From general to learners’ bilingual dictionaries: Towards a more effective fulfilment of advanced learners’ phraseological needs. International Journal of Lexicography 29(3): 279–295.
Granger, Sylvianne, Lerot, Jacques & Petch-Tyson, Stephanie (eds). 2003. Corpus-based Approaches to Contrastive Linguistics and Translation Studies. Amsterdam: Rodopi
Halverson, Sandra. 1998. Translation studies and representative corpora: Establishing links between translation corpora, theoretical/descriptive categories and a conception of the object of study. Meta 43(4): 494–514.
Hareide, Lidun. 2013. The Norwegian–Spanish parallel corpus, common language resources and technology infrastructure Norway (CLARINO) Bergen Repository <[URL]> (4 July 2017).
Hofland, Knut & Johansson, Stig. 1998. The Translation Corpus Aligner: A program for automatic alignment of parallel texts. In Corpora and Crosslinguistic Research: Theory, Method, and Case Studies, S. Johansson & S. Oksefjell (eds), 87–100. Amsterdam: Rodopi.
Hofland, Knut & Reigem, Øysten. 2017. Translation Corpus Aligner, version 2. An interactive sentence aligner <[URL]> (7 July 2017).
Hu, Xinhui, Isotani, Ryosuke & Nakamura, Satoshi. 2009. Construction of Chinese conversational corpora for spontaneous speech recognition and comparative study on the trilingual parallel corpora. In ALR7 Proceedings of the 7th Workshop on Asian Language Resources Suntec, Singapore – August 06–07, 2009. 70–75. Stroudsburg PA: Association for Computational Linguistics. <[URL]> (12 July 2017).
Huddleston, Rodney & Pullum, Geoffrey K. 2002. The Cambridge Grammar of the English Language.Cambridge: CUP.
Izquierdo, Marlén, Hofland, Knut & Reigem, Øysten. 2008. The ACTRES parallel corpus: an English–Spanish translation corpus. Corpora 3: 31–41.
Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation, MT Summit, 79–86. <[URL]> (4 July 2017).
Laboratorio de Lingüística Informática (LLI-UAM). 2017. <[URL]> (11 July 2017).
Labrador, Belén, Ramón, Noelia, Alaiz-Moretón, Héctor & Sanjurjo-González, Hugo. 2014. Rhetorical structure and persuasive language in the subgenre of online advertisements. English for Specific Purposes 34(1): 38–47.
Lavid, Julia. 2017. Annotating complex linguistic features in bilingual corpora: The case of MULTINOT. In Proceedings of the Workshop on Corpora in the Digital Humanities (CDH 201), Thierry Declerck & Sandra Kübler (eds), 19–28. Bloomington, IN. <[URL]> (11 July 2017).
Marco, Josep. 2012. An analysis of explicitation in the COVALT corpus: The case of the substituting pronoun one(s) and its translation into Catalan. Across Languages and Cultures 13(2): 229–246.
Multinot Corpus. 2015. <[URL]> (7 July 2017).
Norwegian–Spanish Parallel Corpus (NSPC). 2013. <[URL]> (7 July 2017).
OMC. Oslo Multilingual Corpus. 2008. <[URL]> (19 July 2018).
Open Parallel Corpus (OPUS). 2012. <[URL]> (7 July 2017).
P-ACTRES 2.0 Corpus. 2018. Demo. <[URL]> (12 November 2018).
Peters, Carol, Braschler, Martin & Clough, Paul. 2012. Cross-language information retrieval. In Multilingual Information Retrieval. From Research To Practice, by Carol Peters, Martin Braschler, Paul Clough, 57–84. Berlin: Springer.
Piao, Scott, Bianchi, Francesca, Dayrell, Carmen, D’egidio, Angela & Rayson, Paul. 2015. Development of the multilingual semantic annotation system. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics – Human Language Technologies (NAACL HLT 2015), Denver, Colorado, United States
, 1268–1274. <[URL]> (7 July 2017).
Piskorski, Jakub & Yangarber, Roman. 2013. Information extraction: past, present and future. In Multisource, Multilingual Information Extraction and Summarization, Thierry Poibeau, Horacio Saggion, Jakub Piskorski & Roman Yangarber (eds), 23–49. Berlin: Springer.
Pustejovsky, James & Stubbs, Amber. 2012. Natural Language Annotation for Machine Learning. A Guide to Corpus-Building for Applications. Sebastopol CA: O’Reilly Media.
Rabadán, Rosa, Alaiz-Moretón, Héctor, Fernández, Ramón-Ángel, García-Gallego, Ana, Gutiérrez-Lanza, Camino, Labrador, Belén, Ramón, Noelía & Sanjurjo-González, Hugo. 2014. Procedimiento de evaluación de la calidad gramatical de las traducciones al español de textos en lengua inglesa (PETRA 1.0) <[URL]>
Rabadán, Rosa, Pizarro, Isabel & Sanjurjo-González, Hugo. 2015. GEDIRE©: A directors’ reports writing tool. Paper presented at CILC 2015. 7th International Conference on Corpus Linguistics. Valladolid, 5–7 March 2015.
Rabadán, Rosa, Colwell, Veronica & Sanjurjo-González, Hugo. 2016. Bi-texting your food: Helping the gastro industry reach the global market. In CILC 2016. 8th International Conference on Corpus Linguistics [EPiC Series in Language and Linguistics 1]. Antonio Moreno Ortiz & Chantal Pérez-Hernández (eds), 361–371.
Rabadán, Rosa. 2008. Refining the idea of ‘applied extensions’. In Beyond Descriptive Translation Studies: Investigations in homage to Gideon Toury [Benjamins Translation Library 75], Anthony Pym, Miriam Shlesinger & Daniel Simeoni (eds), 103–117. Amsterdam: John Benjamins.
Rabadán, Rosa. 2010. Applied Translation Studies. In Handbook of Translation Studies 1, Yves Gambier and Luc van Doorslaer (eds). <[URL]> (7 July 2017).
Rabadán, Rosa. 2010a. English–Spanish contrastive analysis for translation applications. Quaderns de Filologia. Anejo n.° 73: 161–180.
Rabadán, Rosa. 2011. Any into Spanish: A corpus-based analysis of a translation problem. Linguistica Pragensia 21(2): 57–69.
Rabadán, Rosa. 2015. A corpus-based study of aspect: Still and already + verb phrase constructions into Spanish. In Cross-linguistic Studies at the Interface between Lexis and Grammar, Karin Aijmer & Hilde Hasselgård (eds). Nordic Journal of English Studies 14(1): 34–61.
Rafalovitch, Alexandre & Dale, Robert. 2009. United Nations general assembly resolutions: A six-language parallel corpus. In MT Summit XII, 292–299. Ottawa: AMTA. <[URL]> (7 July 2017).
Ramón, Noelia. 2009. Translating epistemic adverbs from English into Spanish: Evidence from a parallel corpus Meta 54(1): 73–96.
Real Academia Española (RAE). 2009. Nueva gramática de la lengua española. Madrid: Espasa.
Resnik, Philip & Smith, Noah A. 2003. The web as a parallel corpus. Computational Linguistics 29(3): 349–380.
Samy, Doaa & González-Ledesma, Ana. 2008. Pragmatic annotation of discourse markers in a multilingual parallel corpus (Arabic–Spanish–English). Proceedings of the VI Language Resources and Evaluation Conference (LREC). Marrakech, Morocco. <[URL]> (7 July 2017).
San Vicente, Iñaki & Manterola, Iker. 2012. PaCo2: A fully automated tool for gathering parallel corpora from the Web. Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12). <[URL]> (19 July 2018).
Schmid, Helmut. 1994. TreeTagger – a part-of-speech tagger for many languages. <[URL]> (7 July 2017).
Sinclair, John. 2004. Corpus and text. Basic principles. In Developing Linguistic Corpora: a Guide to Good Practice. Corpus and Text – Basic Principles, Martin Wynne (ed.). <[URL]> (11 July 2017).
TAUS. 2016. <[URL]> (4 July 2017).
Tiedemann, Jörg. 2012. Parallel data, tools and interfaces in OPUS. In Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC’2012) <[URL]> (4 July 2017).
Wilson, Paul & Foulkes, Kim. 2014. Borders, variation, and identity: Language analysis for the determination of origin (LADO). In Language, Borders and Identiy, Dominic Watt & Carmen Llamas (eds), 218–229. Edinburgh: EUP.
Ziemski, Michał, Junczys-Dowmunt, Marcin & Pouliquen, Bruno. 2016. The United Nations Parallel Corpus v1.0. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk & Stelios Piperidis (eds), 3530–3534. ELRA.< [URL]> (13 November 2018).
Cited by (2)
Cited by two other publications
Izquierdo, Marlén & Zuriñe Sanz-Villar
Pérez Blanco, María & Marlén Izquierdo
This list is based on CrossRef data as of 27 december 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.