Creating a test corpus for term extractors through term annotation

Bernier-Colborne, Gabriel; Drouin, Patrick

doi:10.1075/term.20.1.03ber

Article published In:

Terminology
Vol. 20:1 (2014) ► pp.50–73

Creating a test corpus for term extractors through term annotation

Gabriel Bernier-Colborne

Patrick Drouin

In this paper, we describe a methodology used to create a test corpus for the evaluation of term extractors. This methodology relies on term annotation: terms in a corpus on automotive engineering are selected based on specific criteria pertaining to the terminological setting as well as linguistic and formal properties of terms and term variations. The test corpus accounts for the variety of ways in which terms are realized in running text, and provides a means of automatically evaluating the relevance of term candidate lists produced by term extractors. Due to the XML annotation scheme used, the corpus can be customized, e.g. by filtering out some of the annotated terms based on the type of term or term variation, or frequency. In this paper, we focus on the methodological aspects of this work.

Keywords: term extractor evaluation, corpus annotation, test corpus, term extraction, evaluation, term variation, terminological variation

Published online: 25 April 2014

https://doi.org/10.1075/term.20.1.03ber

References

Ahmad, Khurshid, Andrea Davies, Heather Fulford, and Margaret Rogers

1994 “What Is a Term? The Semi-Automatic Extraction of Terms from Text.” In Translation Studies: An Interdiscipline, ed. by Mary Snell-Hornby, Franz Pöchhacker, and Klaus Kaindl, 267–278. Amsterdam: John Benjamins.

Bernier-Colborne, Gabriel

2012 Élaboration d’un corpus étalon pour l’évaluation d’extracteurs de termes [Creating a Test Corpus for the Evaluation of Term Extractors]. MA thesis, Université de Montréal.

Cabré, Maria-Teresa, Anne Condamines, and Fidelia Ibekwe-SanJuan

2005 “Introduction: Application-Driven Terminology Engineering.” Terminology 11 (1): 1–19.

Carl, Michael, Ecaterina Rascu, Johann Haller, and Philippe Langlais

2004 “Abducing Term Variant Translations in Aligned Texts.” Terminology 10 (1): 101–130.

Carreño Cruz, Sahara I

2004 Analyse de la variation terminologique en corpus parallèle anglais-espagnol et de son incidence sur l’extraction de termes bilingue [Analysis of Term Variation in an English-Spanish Parallel Corpus and its Influence on Bilingual Term Extraction]. MA thesis, Université de Montréal.

Collet, Tanja

1997 “La réduction des unités terminologiques complexes de type syntagmatique [The Reduction of Complex Terms].” Meta: journal des traducteurs 42 (1): 193–206.

Cohen, K. Bretonnel, Lynne Fox, Philip V. Ogren, and Lawrence Hunter

2005 “Corpus Design for Biomedical Natural Language Processing.” In Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases: Mining Biological Semantics , 38–45. Association for Computational Linguistics.

Daille, Béatrice

1996 “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology.” In The Balancing Act: Combining Symbolic and Statistical Approaches to Language, ed. by Judith L. Klavans, and Philip Resnik, 49–66. Cambridge: MIT Press.

2005 “Variations and Application-oriented Terminology Engineering.” Terminology 11 (1): 181–197.

Estopà, Rosa

2001 “Les unités de signification spécialisées: élargissant l’objet du travail en terminologie [Units of Specialised Meaning: Broadening the Scope of Terminology Work].” Terminology 7 (2): 217–237.

Fulford, Heather

2001 “Exploring Terms and Their Linguistic Environment in Text: A Domain-Independent Approach to Automated Term Extraction.” Terminology 7 (2): 259–279.

Haralambous, Yannis, and Elisa Lavagnino

2011 “La réduction des termes complexes dans les langues de spécialité.” [The Reduction of Multi-word Terms in Specialized Languages] TAL 52 (1): 37–68.

Jacquemin, Christian

2001 Spotting and Discovering Terms through Natural Language Processing. Cambridge: MIT Press.

Kageura, Kyo, Masaharu Yoshioka, Koichi Takeuchi, Teruo Koyama, Keita Tsuji, and Fuyuki Yoshikane

2000 “Recent Advances in Automatic Term Recognition: Experiences from the NTCIR Workshop on Information Retrieval and Term Recognition.” Terminology 6 (2): 151–173.

Kano, Yoshinobu, William A. Baumgartner Jr., Luke McCrohon, Sophia Ananiadou, K. Bretonnel Cohen, Lawrence Hunter, and Jun'ichi Tsujii

2009 “U-Compare: Share and Compare Text Mining Tools with UIMA.” Bioinformatics 25 (15): 1997–1998.

L’Homme, Marie-Claude

2004 La terminologie: principes et techniques [Terminology: Principles and Techniques]. Montréal: Presses de l’Université de Montréal.

Loginova, Elizaveta, Anita Gojun, Helena Blancafort, Marie Guégan, Tatiana Gornostay, and Ulrich Heid

2012 “Reference Lists for the Evaluation of Term Extraction Tools.” In Proceedings of the 10th Terminology and Knowledge Engineering Conference (TKE 2012) , 177–192. Madrid.

Love, Stacy

2000 Benchmarking the Performance of Two Automated Term-Extraction Systems: LOGOS and ATAO. MA thesis, Université de Montréal.

Nazarenko, Adeline, Haïfa Zargayouna, Olivier Hamon, and Jonathan van Puymbrouck

2009 “Évaluation des outils terminologiques: enjeux, difﬁcultés et propositions [Evaluating Terminology Tools: Issues, Challenges and Proposals].” Traitement automatique des langues 50 (1): 257–281.

Pearson, Jennifer

1998 Terms in Context. Amsterdam: John Benjamins.

Timimi, Ismaïl, and Widad Mustafa El Hadi

2008 “CESART: une campagne d’évaluation de systèmes d’acquisition de ressources terminologiques [CESART: An Evaluation Campaign for Terminological Resource Acquisition Systems].” In L’évaluation des technologies en traitement de la langue: Les campagnes Technolangue [Evaluating Natural Language Processing Technologies: The Technolangue Campaigns], ed. by Stéphane Chaudiron, and Khalid Choukry, 71–91. Paris: Hermès.

Vivaldi, Jorge, and Horacio Rodríguez

2007 “Evaluation of Terms and Term Extraction Systems: A Practical Approach.” Terminology 13 (2): 225–248.

Widlöcher, Antoine, and Yann Mathet

2009 “La plate-forme Glozz: environnement d’annotation et d’exploration de corpus.” [The Glozz Platform: A Corpus Annotation and Exploration Environment]. Proceedings of Traitement Automatique des Langues Naturelles (TALN) , 2009. Senlis (France).

Cited by

Cited by 5 other publications

Order by:

Astrakhantsev, N. A., D. G. Fedorenko & D. Yu. Turdakov

2015. Methods for automatic term recognition in domain-specific text collections: A survey. Programming and Computer Software 41:6 ► pp. 336 ff.

Kwong, Oi Yee

2021. User-driven assessment of commercial term extractors. Terminology. International Journal of Theoretical and Applied Issues in Specialized Communication 27:2 ► pp. 179 ff.

Ljubešić, Nikola, Darja Fišer & Tomaž Erjavec

2019. KAS-term: Extracting Slovene Terms from Doctoral Theses via Supervised Machine Learning. In Text, Speech, and Dialogue [Lecture Notes in Computer Science, 11697], ► pp. 115 ff.

Rigouts Terryn, Ayla, Véronique Hoste & Els Lefever

2020. In no uncertain terms: a dataset for monolingual and multilingual automatic term extraction from comparable corpora. Language Resources and Evaluation 54:2 ► pp. 385 ff.

Zeng, Wen, Changqing Yao & Hui Li

2017. The exploration of information extraction and analysis about science and technology policy in China. The Electronic Library 35:4 ► pp. 709 ff.

This list is based on CrossRef data as of 8 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.