The problem of data sparsity: Using electronic corpora to study language variation

Moisl, Hermann

doi:10.1075/silv.5.14moi

Part of

Language Variation – European perspectives II: Selected papers from the 4th International Conference on Language Variation in Europe (ICLaVE 4), Nicosia, June 2007
Edited by Stavroula Tsiplakou, Marilena Karyolemou and Pavlos Pavlou
[Studies in Language Variation 5] 2009
► pp. 169–178

Using electronic corpora to study language variation

The problem of data sparsity

Hermann Moisl | University of Newcastle

As more and larger digital electronic corpora of natural language text appear, effective linguistic analysis of them will increasingly be tractable only by using the computational interpretative methods developed by the statistical, information retrieval, and related communities. To use such analytical methods effectively, however, issues that arise with respect to the abstraction of data from corpora have to be understood. This paper addresses an issue that has a fundamental bearing on the validity of analytical results based on such data: sparsity. The discussion is in three main parts. The first part shows how a particular class of computational methods, exploratory multivariate analysis, can be used in language variation research, the second explains why data sparsity can be a problem in such analysis, and the third outlines a solution.

Published online: 19 November 2009

https://doi.org/10.1075/silv.5.14moi