Part II: Selected papers presented at the Dutch Annual Linguistics Day
of 2019
A filter for syntactically incomparable parallel
sentences
Massive automatic comparison of languages in parallel corpora
will greatly speed up and enhance comparative syntactic research. Automatically
extracting and mining syntactic differences from parallel corpora requires a
pre-processing step that filters out sentence pairs that cannot be compared
syntactically, for example because they involve “free” translations. In this
paper we explore four possible filters: the Damerau-Levenshtein distance between
POS-tags, the sentence-length ratio, the graph-edit distance between dependency
parses, and a combination of the three in a logistic regression model. Results
suggest that the dependency-parse filter is the most stable throughout language
pairs, while the combination filter achieves the best results.
Article outline
- 1.Introduction
- 2.Syntactic comparability
- 3.Data
- 4.Filters
- 4.1Levenshtein distance on POS-tags
- 4.2Sentence-length ratio
- 4.3Graph edit distance on dependency trees
- 4.4Combination filter
- 4.5Automatically setting a threshold
- 5.Evaluation of the filters
- 6.Results
- 7.Discussion
- 8.Conclusion
- Acknowledgements
- Notes
-
References