Assessing word commonness
Adding dispersion to frequency
The article investigates the two main corpus indicators of word commonness, frequency and dispersion, through a
cross-validation analysis of frequency and four dispersion measures (‘Range’, ‘Chi-squared’, ‘Deviation of Proportions’ and
‘Juilland’s D’). The approach provides an estimation of the capacity of the named measures to predict the distribution of corpus
items in an extracted language sample. Based on a dataset of 273 Norwegian compounds, the results show that especially Deviation
of Proportions is a robust measure of dispersion that can be used in conjunction with frequency to substantiate assertions of word
commonness based on corpus data. In addition, dispersion measures do not only reflect what sort of distribution the frequency
statistic is generated from, but also how reliable the frequency estimation in the corpus sample is in terms of giving an accurate
representation of frequency in the language variety that the corpus is sampled from.
Article outline
- 1.Introduction
- 2.Dictionaries and distributions
- 2.1Dictionaries and core vocabulary
- 2.2The case of compounds in Norwegian
- 2.3Disadvantages of frequency, advantages of dispersion
- 2.4Comparing measures
- 2.4.1Frequency
- 2.4.2Relative range
- 2.4.3Chi squared χ2
- 2.4.4Deviation of proportions (DP)
- 2.4.5Juilland’s D (D_uneq)
- 2.4.6Which one(s) to choose?
- 2.5What are distributions in corpora supposed to tell us?
- 3.Methodology
- 3.1Cross-validation and correlation analysis
- 4.Results and analysis
- 4.1Frequency
- 4.2Range
- 4.3Chi squared
- 4.4Deviation of proportions
- 4.5Juilland’s D
- 5.Summary and concluding discussion
- Notes
-
References