A filter for syntactically incomparable parallel sentences

Kroon, Martin; Barbiers, Sjef; Odijk, Jan; van der Pas, Stéphanie

doi:10.1075/avt.00029.kro

Article published In:

Linguistics in the Netherlands 2019
Edited by Janine Berns and Elena Tribushinina
[Linguistics in the Netherlands 36] 2019
► pp. 147–161

Part II: Selected papers presented at the Dutch Annual Linguistics Day of 2019

A filter for syntactically incomparable parallel sentences

Martin Kroon | Leiden University Centre for Linguistics

Sjef Barbiers | Leiden University Centre for Linguistics

Jan Odijk | Universiteit Utrecht, UIL-OTS

Stéphanie van der Pas | Mathematical Institute, Leiden University

Massive automatic comparison of languages in parallel corpora will greatly speed up and enhance comparative syntactic research. Automatically extracting and mining syntactic differences from parallel corpora requires a pre-processing step that filters out sentence pairs that cannot be compared syntactically, for example because they involve “free” translations. In this paper we explore four possible filters: the Damerau-Levenshtein distance between POS-tags, the sentence-length ratio, the graph-edit distance between dependency parses, and a combination of the three in a logistic regression model. Results suggest that the dependency-parse filter is the most stable throughout language pairs, while the combination filter achieves the best results.

Keywords: filter, parallel corpus, syntactic comparability, dependency parses

Article outline

1.Introduction
2.Syntactic comparability
3.Data
4.Filters
- 4.1Levenshtein distance on POS-tags
- 4.2Sentence-length ratio
- 4.3Graph edit distance on dependency trees
- 4.4Combination filter
- 4.5Automatically setting a threshold
5.Evaluation of the filters
6.Results
7.Discussion
8.Conclusion
Acknowledgements
Notes
References

Available under the Creative Commons Attribution-NonCommercial (CC BY-NC) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 5 November 2019

https://doi.org/10.1075/avt.00029.kro

References (14)

References

Abu-Aisheh, Zeina, Romain Raveaux, Jean-Yves Ramel & Patrick Martineau. 2015. “An exact graph edit distance algorithm for solving pattern recognition problems”. 4th International Conference on Pattern Recognition Applications and Methods 2015. Jan 2015, Lisbon, Portugal. ff10.5220/0005209202710278ff. ffhal-01168816.

Abzianidze, Lasha, Johannes Bjerva, Kilian Evang, Hessel Haagsma, Rik van Noord, Pierre Ludmann & Johan Bos. 2017. “The Parallel Meaning Bank: Towards a multilingual corpus of translations annotated with compositional meaning representations”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 242–247.

Barbiers, Sjef. 2009. “Locus and limits of syntactic microvariation”. Lingua 119 (11): 1607–1623.

Bard, Gregory V. 2007. “Spelling-error tolerant, order-independent pass-phrases via the Damerau-Levenshtein string-edit distance metric”. Proceedings of the Fifth Australasian Symposium on ACSW Frontiers: Volume 68, 117–124. Australian Computer Society, Inc.

Cohen, Jacob. 1960. “A coefficient of agreement for nominal scales”. Educational and Psychological Measurement. 20 (1): 37–46.

Fleiss, J. L. & Jacob Cohen. 1973. “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability”. Educational and Psychological Measurement 331: 613–619.

Hagberg, Aric, Daniel Schult & Pieter Swart. 2008. “Exploring network structure, dynamics, and function using Network”. Proceedings of the 7th Python in Science Conference (SciPy2008) ed. by G. Varoquaux, T. Vaught, & J. Millman, 11–15. Pasadena, CA USA.

Klis, van der, Martijn, Bert Le Bruyn & Henriëtte de Swart. 2017. “Mapping the perfect via translation mining”. Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, 497–502.

Koehn, Philipp. 2005. “Europarl: A parallel corpus for statistical machine translation”. MT Summit: Volume 5, 79–86.

Levenshtein, Vladimir I. 1966. “Binary codes capable of correcting deletions, insertions, and reversals”. Soviet Physics Doklady 10 (8): 707–710.

Nivre, Joakim, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan Mc Donald et al. “Universal dependencies v1: A multilingual treebank collection”. LREC 2016, pp. 1659–1666.

Straka, Milan & Jana Straková. 2017. “Tokenizing, POS-tagging, lemmatizing and parsing UD 2.0 with UDPipe”. Proceedings of the CONLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 88–99. Vancouver, Canada: Association of Computational Linguistics.

Wiersma, Wybo, John Nerbonne & Timo Lauttamus. 2011. “Automatically extracting typical syntactic differences from corpora”. Literary and Linguistic Computing 26 (1): 107–124.

Youden, William J. 1950. “Index for rating diagnostic tests”. Cancer 3 (1): 32–35.