Not all linguistic variation is equally predictable: Evidence from two case studies with large language models

Morger, Felix; Berdicevskis, Aleksandrs

doi:10.1075/ijcl.24065.mor

Article In: International Journal of Corpus Linguistics: Online-First Articles

Not all linguistic variation is equally predictable

Evidence from two case studies with large language models

Felix Morger | University of Gothenburg

Aleksandrs Berdicevskis | University of Gothenburg

This content is being prepared for publication; it may be subject to changes.

Abstract

We compare to what extent the choice of a variant can be predicted from language-internal factors for two linguistic variables: the English dative alternation and the omission of the infinitival marker att in a Swedish future-tense construction. Previous research has shown that for the dative alternation, near-ceiling performance can be achieved by fitting a regression model with manually selected predictors. Similar attempts for att-omission have been unsuccessful. To test whether the two variables differ in predictability or whether optimal methods have not been found for att-omission, we apply a large language model (LLM) to the same task. For the dative alternation, LLM and regression perform equally well. For att-omission, LLM outperforms regression, but still performs worse than for the dative alternation, thus suggesting that att-omission is inherently less predictable. We argue that LLMs can be useful for estimating how much the choice of a variant depends on language-internal factors.

Keywords: language variation, English, Swedish, large language models, prediction

Article outline

1.Introduction
2.Linguistic variation and large language models
- 2.1Predicting variation
- 2.2Predicting the dative alternation and att-omission
- 2.3Large language models
3.Methodology
- 3.1Metrics and baseline
- 3.2BERT: Basic properties
- 3.3The task of predicting the choice of a variant
  - 3.3.1BERT without fine-tuning: Sentence scoring
  - 3.3.2BERT with fine-tuning: Binary idiomaticity task
- 3.4Technical details
4.The dative alternation study
- 4.1The dataset
- 4.2Predicting with logistic regression
- 4.3Predicting with BERT
- 4.4Interim conclusions
5.The att-omission study
- 5.1The datasets
- 5.2Predicting with logistic regression
- 5.3Predicting with BERT
- 5.4Additional features: Time and inter-individual variation
  - 5.4.1Time
  - 5.4.2Inter-individual variation
6.Discussion
7.Conclusions
Acknowledgments
Notes
References

References (41)

References

Agrawal, M., Peterson, J. C., & Griffiths, T. L. (2020). Scaling up psychology via scientific regret minimization. Proceedings of the National Academy of Sciences, 117(16), 8825–8835.

Arnold, J. E., Losongco, A., Wasow, T., & Ginstrom, R. (2000). Heaviness vs. newness: The effects of structural complexity and discourse status on constituent ordering. Language, 76(1), 28–55.

Baayen, R. H. (2011). Corpus linguistics and naive discriminative learning. Revista Brasileira de Linguística Aplicada, 11(2), 295–328.

(2024). The wompom. Corpus Linguistics and Linguistic Theory, 20(3), 615–648.

Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48.

Berdicevskis, A., Bouma, G., Kurtz, R., Morger, F., Öhman, J., Adesam, Y., Borin, L., Dannélls, D., Forsberg, M., Isbister, T., Lindahl, A., Malmsten, M., Rekathati, F., Sahlgren, M., Volodina, E., Börjeson, L., Hengchen, S., & Tahmasebi, N. (2023). Superlim: A Swedish language understanding evaluation benchmark. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in Natural Language Processing (pp. 8137–8153). Association for Computational Linguistics.

Berdicevskis, A., Coussé, E., Koplenig, A., & Adesam, Y. (2024). To drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future construction. Corpus Linguistics and Linguistic Theory, 20(1), 219–261.

Bresnan, J., Anna, C., Nikitina, T., & Baayen, R. H. (2007). Predicting the dative alternation. In G. Bouma, I. Krämer, & J. Zwarts (Eds.), Cognitive foundations of interpretation (pp. 69–94). KNAW.

Bresnan, J., & Ford, M. (2010). Predicting syntax: Processing dative constructions in American and Australian varieties of English. Language, 86(1), 168–213. [URL]

Bresnan, J., Rosenbach, A., Szmrecsanyi, B., Tagliamonte, S., & Simon, T. (2017). Syntactic alternations data: Datives and genitives in four varieties of English. Stanford Digital Repository. [URL].

Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The balanced accuracy and its posterior distribution. In J. E. Guerrero (ed.) 2010 20th international conference on pattern recognition (pp. 3121–3124). Institute of Electrical and Electronics Engineers.

Chambaz, A., & Desagulier, G. (2016). Predicting is not explaining: Targeted learning of the dative alternation. Journal of Causal Inference, 4(1), 1–30.

Daelemans, W., Zavrel, J., Van der Sloot, K., & Van den Bosch, A. (2010). TiMBL: Tilburg memory based learner reference guide. Version 6.3 (Technical Report No. ILK 10–01). Computational Linguistics Tilburg University. [URL].

Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings of the 2019 conference of the north American chapter of the Association for Computational Linguistics: Human language technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186).

Dyer, W., Torres, C., Scontras, G., & Futrell, R. (2023). Evaluating a century of progress on the cognitive science of adjective ordering. Transactions of the Association for Computational Linguistics, 111, 1185–1200.

Eckert, P. (2012). Three waves of variation study: The emergence of meaning in the study of sociolinguistic variation. Annual Review of Anthropology, 411, 87–100.

Engel, A., Grafmiller, J., Rosseel, L., Szmrecsanyi, B., & Van de Velde, F. (2021). How register-specific is probabilistic grammatical knowledge?: A programmatic sketch and a case study on the dative alternation with give. In E. Seoane & D. Biber (Eds.), Corpus-based approaches to register variation (pp. 51–84). John Benjamins.

Engel, A., & Szmrecsanyi, B. (2022). Variable grammars are variable across registers: Future temporal reference in English. Language Variation and Change, 34(3), 355–378.

Gries, S. T. (2003). Towards a corpus-based identification of prototypical instances of constructions. Annual Review of Cognitive Linguistics, 1(1), 1–27.

Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.

Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., Margetts, H., Mullainathan, S., Salganik, M. J., Vazire, S. and Vespignani, A., & Yarkoni, T. (2021). Integrating explanation and prediction in computational social science. Nature, 595(7866), 181–188.

Hovy, D., & Yang, D. (2021). The importance of modeling social factors of language: Theory and practice. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 conference of the north American chapter of the Association for Computational Linguistics: Human language technologies (pp. 588–602). Association for Computational Linguistics.

Hu, J., Mahowald, K., Lupyan, G., Ivanova, A., & Levy, R. (2024). Language models align with human judgments on key grammatical constructions. Proceedings of the National Academy of Sciences, 121(36), e2400917121.

Jenset, G. B., McGillivray, B., & Rundell, M. (2018). The dative alternation revisited: Fresh insights from contemporary British spoken data. In V. Brezina, R. Love, & K. Aijmer (Eds.), Corpus approaches to contemporary British speech (pp. 185–208). Routledge.

Mahowald, K., Ivanova, A. A., Blank, I. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2024). Dissociating language and thought in large language models. Trends in Cognitive Sciences, 28(6), 517–540.

Malmsten, M., Börjeson, L., & Haffenden, C. (2020). Playing with words at the National Library of Sweden — Making a Swedish BERT. arXiv, arXiv:2007.01658.

Nguyen, D., Rosseel, L., & Grieve, J. (2021). On learning and representing social meaning in NLP: A sociolinguistic perspective. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings of the 2021 conference of the north American chapter of the Association for Computational Linguistics: Human language technologies (pp. 603–612). Association for Computational Linguistics.

Peters, M. E., Neumann, M., Zettlemoyer, L., & Yih, W. -T. (2018). Dissecting contextual word embeddings: Architecture and representation. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 conference on empirical methods in Natural Language Processing (pp. 1499–1509). Association for Computational Linguistics.

Pijpops, D., Franco, K., Speelman, D., & Van de Velde, F. (2024). Introduction: What are alternations and how should we study them? Linguistics Vanguard, 10(1), 1–7.

R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. [URL].

Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender bias in coreference resolution. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings of the 2018 conference of the north American chapter of the Association for Computational Linguistics: Human language technologies, Volume 2 (Short Papers) (pp. 8–14). Association for Computational Linguistics.

Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked language model scoring. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings of the 58th annual meeting of the Association for Computational Linguistics (pp. 2699–2712). Association for Computational Linguistics.

Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic attribution for deep networks. Proceedings of the 34th International Conference on Machine Learning, 701, 3319–3328. https://proceedings.mlr.press/v70/sundararajan17a.html.

Szmrecsanyi, B., Grafmiller, J., Bresnan, J., Rosenbach, A., Tagliamonte, S., & Todd, S. (2017). Spoken syntax in a comparative perspective: The dative and genitive alternation in varieties of English. Glossa: A Journal of General Linguistics, 2(1), 86.

Tagliamonte, S. A., Durham, M. & Smith, J. (2014). Grammaticalization at an early stage: Future be going to in conservative British dialects. English Language and Linguistics, 18(1), 75–108.

Tenney, I., Das, D., & Pavlick, E. (2019). BERT Rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th annual meeting of the Association for Computational Linguistics (pp. 4593–4601). Association for Computational Linguistics.

Theijssen, D., Ten Bosch, L., Boves, L., Cranen, B., & Van Halteren, H. (2013). Choosing alternatives: Using Bayesian networks and memory-based learning to study the dative alternation. Corpus Linguistics and Linguistic Theory, 9(2), 227–262.

Van den Bosch, A., & Bresnan, J. (2015). Modeling dative alternations of individual children. In R. Berwick, A. Korhonen, A. Lenci, T. Poibeau, & A. Villavicencio (Eds.), Proceedings of the sixth workshop on cognitive aspects of computational language learning (pp. 103–112). Association for Computational Linguistics.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems, Volume 301 (pp. 5998–6008). NeurIPS. [URL].

Williams, R. S. (1994). A statistical analysis of English double object alternation. Issues in Applied Linguistics, 5(1), 37–58.

Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative and genitive variability in Late Modern English: Exploring cross-constructional variation and change. Diachronica, 30(3), 382–419.