Edited by Gisle Andersen
[Studies in Corpus Linguistics 49] 2012
► pp. 51–66
The paper describes the improvement of the rule-based Constraint Grammar (CG) Oslo-Bergen Tagger (OBT) by the addition of a statistical module. It is in the nature of CG taggers to leave some words ambiguous between different readings, due to a lack of coverage by the linguistics-based rules. Such ambiguities are often a problem for applications that use the tagger, among them the Norwegian Newspaper Corpus. Our statistical module not only removes part of speech (PoS) and morphological ambiguities, but also disambiguates lemmas. We show how this new system, referred to as OBT+stat, in a straightforward manner combines the strengths of the linguistic knowledge-based CG approach with data-driven methods. The result is a high-performing, fully disambiguating PoS/morphological tagger and lemmatizer with very satisfactory evaluation results.
This list is based on CrossRef data as of 17 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.