Publications

Publication details [#22325]

Macken, Lieve, Orphée De Clercq and Hans Paulussen. 2011. Dutch parallel corpus: a balanced copyright-cleared parallel corpus. In Ballard, Michel and Carmen Pineira-Tresmontant, eds. Les corpus en linguistique et en traductologie [Linguistic and translation corpora]. Arras: Artois Presses Université. pp. 374–390.

Publication type

Article in jnl/bk

Publication language

English

Keywords

corpus linguistics | corpus=corpora | Internet=World Wide Web=www=website=web site

Source language

Dutch | English | French

Abstract

This paper presents the Dutch Parallel Corpus, a high-quality parallel corpus for Dutch, French and English consisting of more than ten million words. The corpus contains five different text types and is balanced with respect to text type and translation direction. All texts included in the corpus have been cleared from copyright. We discuss the importance of parallel corpora in various research domains and contrast the Dutch Parallel Corpus with existing parallel corpora. The Dutch Parallel Corpus distinguishes itself from other parallel corpora by having a balanced composition and by its availability to the wide research community, thanks to its copyright clearance. All texts in the corpus are sentence-aligned and further enriched with basic linguistic annotations (lemmas and word class information). Approximately 25,000 words of the Dutch-English part have been manually aligned at the sub-sentential level. Rich metadata facilitates the navigability of the corpus and enables users to select the texts that satisfy their needs. The entire corpus is released as full texts in XML format and is also available via a web interface, which supports basic and complex search queries and presents the results as parallel concordances. The corpus will be distributed by the Flemish-Dutch Human Language Technology Agency (TST-Centrale).

Source : Abstract in journal