Building EPTIC
A many-sided, multi-purpose corpus of EU parliament proceedings
This chapter describes the steps involved in the construction of EPTIC, an intermodal corpus of European Parliament speeches. Despite its limited size, this corpus has features that justify its labour-intensive building process, in particular its multiple alignments. The text-to-text alignments allow users to compare interpretations and translations of source speeches and their written-up reports, while text-to-video alignments allow them to access the multimedia components from concordance lines. To illustrate the potential of EPTIC, a case study is presented of English loan words in original, translated and interpreted Italian and French. Results suggest that borrowing is more likely to occur in translated Italian than in any of the other corpus components.
Article outline
- 1.Introduction: Why another corpus of European Parliament speeches?
- 2.What EPTIC looks like
- 2.1One corpus, fourteen subcorpora
- 2.2Practical details: Size and availability
- 3.Building EPTIC
- 3.1Selecting and obtaining raw corpus materials
- 3.2Transcribing the oral data
- 3.3Adding metadata
- 3.4Performing text-to-text alignment
- 3.5Performing text-to-video alignment
- 3.6POS-tagging, lemmatization and indexing
- 4.An example: English loan words in Italian and French
- 5.Conclusion: Teaming up
-
Acknowledgement
-
Notes
-
References
References (21)
References
Bernardini, Silvia, Collard, Camille, Ferraresi, Adriano, Russo Mariachiara & Defrancq, Bart. 2018. Building interpreting and intermodal corpora: A how-to for a formidable task. In Making Way in Corpus-based Interpreting Studies, Mariachiara Russo, Claudio Bendazzoli & Bart Defrancq (eds), 21–42. Singapore: Springer. 

Bogaards, Paul. 2008. On ne parle pas franglais: La langue française face à l'anglais. Brussels: De Boeck/Duculot. 

Burnard, Lou. 2004. Metadata for corpus work. In Developing Linguistic Corpora: A Guide to Good Practice, Martin Wynne (ed.). <[URL]> (30 June 2017).
Chesterman, Andrew. 2004. Hypotheses about translation universals. In Claims, Changes and Challenges in Translation Studies [Benjamins Translation Library 50], Gyde Hansen, Kirsten Malmkjaer & Daniel Gile (eds), 1–13. Amsterdam: John Benjamins. 

Codrea-Rado, Anna. 2014. European parliament has 24 official languages, but MEPs prefer English. The Guardian. <[URL]> (30 October 2017).
Evert, Stefan & the CWB Development Team. 2016. The IMS Open Corpus Workbench (CWB) Corpus Encoding Tutorial. CWB Version 3.4: <[URL]> (30 October 2017).
Frankenberg-Garcia, Ana & Santos, Diana. 2003. Introducing COMPARA: The Portuguese–English parallel corpus. In Corpora in Translator Educatio, Federico Zanettin, Silvia Bernardini & Dominic Stewart (eds), 71–87. Manchester: St. Jerome.
Granger, Sylviane. 2010. Comparable and translation corpora in cross-linguistic research. Design, analysis and applications. Journal of Shanghai Jiaotong University 2: 14–21.
Johansson, Stig. 1998. On the role of corpora in cross-linguistic research. In Corpora and Cross-linguistic Research, Stig Johansson & Signe Oksefjell (eds), 3–24. Amsterdam: Rodopi.
Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. In Machine Translation Summit X, 79–86. Phuket, Thailand.
Niemants, Natacha. 2015. Transcription. In The Routledge Encylopedia of Intepreting Studies, Franz Pöchhacker (ed), 421–422. London: Routledge.
Nisioi, Sergiu, Rabinovich, Ella, Dinu, Liviu P. & Wintner, Shuly. 2016. A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), 4197–4201.
Rychlý, Pavel. 2007. Manatee/Bonito – A modular corpus manager. In 1st Workshop on Recent Advances in Slavonic Natural Language Processing, 65–70. Masaryk University, Brno.
Varga, Dániel, Németh, László, Halácsy, Péter, Kornai, András, Viktor Trón & Nagy, Viktor. 2005. Parallel corpora for medium density languages. In Proceedings of the RANLP 2005, 590–596.
Vondřička, Pavel. 2014. Aligning parallel texts with InterText. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), 1875–1879.
Zanettin, Federico. 2012. Translation-driven Corpora: Corpus Resources for Descriptive and Applied Translation Studies. Abingdon: Taylor & Francis.
Cited by (3)
Cited by three other publications
Bendazzoli, Claudio, Michela Bertozzi & Mariachiara Russo
2020.
Du texte aux ressources multimodales : faire avancer la recherche en interprétation à partir d’un corpus déjà existant†.
Meta 65:1
► pp. 211 ff.

Ferraresi, Adriano, Silvia Bernardini, Maja Miličević Petrović & Marie-Aude Lefer
2019.
Simplified or not Simplified? The Different Guises of Mediated English at the European Parliament.
Meta 63:3
► pp. 717 ff.

This list is based on CrossRef data as of 27 december 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.