Article published In:
International Journal of Corpus Linguistics: Online-First ArticlesDown-sampling from hierarchically structured corpus data
Resource constraints often force researchers to downsize the list of tokens returned by a corpus query. This paper sketches a methodology for down-sampling and offers a survey of current practices. We build on earlier work and extend the evaluation of down-sampling designs to settings where tokens are clustered by text file and lexeme. Our case study deals with third-person present-tense verb inflection in Early Modern English and focuses on five predictors: year, gender, genre, frequency, and phonological context. We evaluate two strategies for selecting 2,000 (out of 11,645) tokens: simple down-sampling, where each hit has the same selection probability; and structured down-sampling, where this probability is inversely proportional to the author- and verb-specific token count. We form 500 subsamples using each scheme and compare regression results to a reference model fit to the full set of cases. We observe that structured down-sampling shows better performance on several evaluation criteria.
Keywords: down-sampling, thinning, methodology, data structure, study design
Article outline
- 1.Introduction
- 2.Down-sampling in corpus-based work
- 2.1Down-sampling designs: Design features and terminology
- 2.2A survey of down-sampling designs in corpus-based work
- 2.3Previous methodological work
- 3.Methodology
- 3.1Case study and corpus data
- 3.1.1Third-person verb inflection in Early Modern English
- 3.1.2Data preparation
- 3.1.3Data structure
- 3.2Evaluation method
- 3.2.1Implementation of down-sampling designs
- 3.2.2The reference model
- 3.2.3Evaluation of down-sampling designs
- 3.1Case study and corpus data
- 4.Results
- 5.Summary and outlook
- Acknowledgements
- Notes
-
References
Published online: 25 March 2024
https://doi.org/10.1075/ijcl.23079.son
https://doi.org/10.1075/ijcl.23079.son
References (19)
Baayen, R. H. (2008). Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.
BNC Consortium. (2007). British National Corpus (version 3, BNC XML ed.). [URL]
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and Other Stories. Cambridge University Press.
Gries, S. T., & Hilpert, M. (2010). Modeling diachronic change in the third person singular: A multifactorial, verb- and author-specific exploratory approach. English Language and Linguistics,
14
(3), 293–320.
Jenset, G. B., & McGillivray, B. (2017). Quantitative Historical Linguistics: A Corpus Framework. Oxford University Press.
Kroch, A., Santorini, B., & Delfs, L. (2004). The Penn-Helsinki Parsed Corpus of Early Modern English (PPCEME). [URL]
Kytö, M. (1993). Third-person singular verb inflection in early British and American English. Language Variation and Change,
5
(2), 113–139.
Nevalainen, T., & Raumolin-Brunberg, H. (2003). Historical Sociolinguistics: Language Change in Tudor and Stuart England. Pearson Education.
Rothman, K. J., Greenland, S., & Lash, T. L. (2008). Case-control studies. In K. J. Rothman, S. Greenland, & T. L. Lash (Eds.), Modern Epidemiology (3rd ed.) (pp. 111–127). Lippincott Williams & Wilkins.
Singer, J. D. (1991). Types of factors and their structural layouts. In D. C. Hoaglin, F. Mosteller, & J. W. Tukey (Eds.), Fundamentals of Exploratory Analysis of Variance (pp. 50–71). Wiley.
Smith, N., & Waters, C. (2019). Variation and change in a specialized register: A comparison of random and sociolinguistic sampling outcomes in Desert Island Discs. International Journal of Corpus Linguistics,
24
(2), 169–201.
Sönning, L. (2023). Data from Jenset & McGillivray (2017), adapted for “Down-sampling from hierarchically structured corpus data”. DataverseNO, V1.
Sönning, L., & Krug, M. (2022). Comparing study designs and down-sampling strategies in corpus analysis: The importance of speaker metadata in the BNCs of 1994 and 2014. In O. Schützler & J. Schlüter (Eds.), Data and Methods in Corpus Linguistics: Comparative Approaches (pp. 127–160). Cambridge University Press.
Vaden, K. I., Halpin, H. R., & Hickok, G. S. (2009). Irvine Phonotactic Online Dictionary, (Version 2.0). [Data file]. [URL]