Article In: International Journal of Corpus Linguistics: Online-First Articles
Not all linguistic variation is equally predictable
Evidence from two case studies with large language models
This content is being prepared for publication; it may be subject to changes.
Abstract
We compare to what extent the choice of a variant can be predicted from language-internal factors for two
linguistic variables: the English dative alternation and the omission of the infinitival marker att in a Swedish
future-tense construction. Previous research has shown that for the dative alternation, near-ceiling performance can be achieved
by fitting a regression model with manually selected predictors. Similar attempts for att-omission have been
unsuccessful. To test whether the two variables differ in predictability or whether optimal methods have not been found for
att-omission, we apply a large language model (LLM) to the same task. For the dative alternation, LLM and
regression perform equally well. For att-omission, LLM outperforms regression, but still performs worse than for
the dative alternation, thus suggesting that att-omission is inherently less predictable. We argue that LLMs can
be useful for estimating how much the choice of a variant depends on language-internal factors.
Keywords: language variation, English, Swedish, large language models, prediction
Article outline
- 1.Introduction
- 2.Linguistic variation and large language models
- 2.1Predicting variation
- 2.2Predicting the dative alternation and att-omission
- 2.3Large language models
- 3.Methodology
- 3.1Metrics and baseline
- 3.2BERT: Basic properties
- 3.3The task of predicting the choice of a variant
- 3.3.1BERT without fine-tuning: Sentence scoring
- 3.3.2BERT with fine-tuning: Binary idiomaticity task
- 3.4Technical details
- 4.The dative alternation study
- 4.1The dataset
- 4.2Predicting with logistic regression
- 4.3Predicting with BERT
- 4.4Interim conclusions
- 5.The att-omission study
- 5.1The datasets
- 5.2Predicting with logistic regression
- 5.3Predicting with BERT
- 5.4Additional features: Time and inter-individual variation
- 5.4.1Time
- 5.4.2Inter-individual variation
- 6.Discussion
- 7.Conclusions
- Acknowledgments
- Notes
References
References (41)
Agrawal, M., Peterson, J. C., & Griffiths, T. L. (2020). Scaling
up psychology via scientific regret minimization. Proceedings of the National Academy of
Sciences, 117(16), 8825–8835.
Arnold, J. E., Losongco, A., Wasow, T., & Ginstrom, R. (2000). Heaviness
vs. newness: The effects of structural complexity and discourse status on constituent
ordering. Language, 76(1), 28–55.
Baayen, R. H. (2011). Corpus
linguistics and naive discriminative learning. Revista Brasileira de Linguística
Aplicada, 11(2), 295–328.
Bates, D., Mächler, M., Bolker, B., & Walker, S. (2015). Fitting
linear mixed-effects models using lme4. Journal of Statistical
Software, 67(1), 1–48.
Berdicevskis, A., Bouma, G., Kurtz, R., Morger, F., Öhman, J., Adesam, Y., Borin, L., Dannélls, D., Forsberg, M., Isbister, T., Lindahl, A., Malmsten, M., Rekathati, F., Sahlgren, M., Volodina, E., Börjeson, L., Hengchen, S., & Tahmasebi, N. (2023). Superlim:
A Swedish language understanding evaluation benchmark. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings
of the 2023 conference on empirical methods in Natural Language
Processing (pp. 8137–8153). Association for Computational Linguistics.
Berdicevskis, A., Coussé, E., Koplenig, A., & Adesam, Y. (2024). To
drop or not to drop? Predicting the omission of the infinitival marker in a Swedish future
construction. Corpus Linguistics and Linguistic
Theory, 20(1), 219–261.
Bresnan, J., Anna, C., Nikitina, T., & Baayen, R. H. (2007). Predicting
the dative alternation. In G. Bouma, I. Krämer, & J. Zwarts (Eds.), Cognitive
foundations of
interpretation (pp. 69–94). KNAW.
Bresnan, J., & Ford, M. (2010). Predicting
syntax: Processing dative constructions in American and Australian varieties of
English. Language, 86(1), 168–213. [URL]
Bresnan, J., Rosenbach, A., Szmrecsanyi, B., Tagliamonte, S., & Simon, T. (2017). Syntactic
alternations data: Datives and genitives in four varieties of English. Stanford Digital Repository. [URL].
Brodersen, K. H., Ong, C. S., Stephan, K. E., & Buhmann, J. M. (2010). The
balanced accuracy and its posterior distribution. In J. E. Guerrero (ed.) 2010
20th international conference on pattern
recognition (pp. 3121–3124). Institute of Electrical and Electronics Engineers.
Chambaz, A., & Desagulier, G. (2016). Predicting
is not explaining: Targeted learning of the dative alternation. Journal of Causal
Inference, 4(1), 1–30.
Daelemans, W., Zavrel, J., Van der Sloot, K., & Van den Bosch, A. (2010). TiMBL:
Tilburg memory based learner reference guide. Version 6.3 (Technical Report No. ILK
10–01). Computational Linguistics Tilburg University. [URL].
Devlin, J., Chang, M. -W., Lee, K., & Toutanova, K. (2019). BERT:
Pre-training of deep bidirectional transformers for language
understanding. In J. Burstein, C. Doran, & T. Solorio (Eds.), Proceedings
of the 2019 conference of the north American chapter of the Association for Computational Linguistics: Human language
technologies, Volume 1 (Long and Short
Papers) (pp. 4171–4186).
Dyer, W., Torres, C., Scontras, G., & Futrell, R. (2023). Evaluating
a century of progress on the cognitive science of adjective ordering. Transactions of the
Association for Computational
Linguistics, 111, 1185–1200.
Eckert, P. (2012). Three
waves of variation study: The emergence of meaning in the study of sociolinguistic
variation. Annual Review of
Anthropology, 411, 87–100.
Engel, A., Grafmiller, J., Rosseel, L., Szmrecsanyi, B., & Van de Velde, F. (2021). How
register-specific is probabilistic grammatical knowledge?: A programmatic sketch and a case study on the dative alternation
with give. In E. Seoane & D. Biber (Eds.), Corpus-based
approaches to register
variation (pp. 51–84). John Benjamins.
Engel, A., & Szmrecsanyi, B. (2022). Variable
grammars are variable across registers: Future temporal reference in English. Language
Variation and
Change, 34(3), 355–378.
Gries, S. T. (2003). Towards
a corpus-based identification of prototypical instances of constructions. Annual Review of
Cognitive
Linguistics, 1(1), 1–27.
Hayes, A. F., & Krippendorff, K. (2007). Answering
the call for a standard reliability measure for coding data. Communication Methods and
Measures, 1(1), 77–89.
Hofman, J. M., Watts, D. J., Athey, S., Garip, F., Griffiths, T. L., Kleinberg, J., Margetts, H., Mullainathan, S., Salganik, M. J., Vazire, S. and Vespignani, A., & Yarkoni, T. (2021). Integrating
explanation and prediction in computational social
science. Nature, 595(7866), 181–188.
Hovy, D., & Yang, D. (2021). The
importance of modeling social factors of language: Theory and
practice. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings
of the 2021 conference of the north American chapter of the Association for Computational Linguistics: Human language
technologies (pp. 588–602). Association for Computational Linguistics.
Hu, J., Mahowald, K., Lupyan, G., Ivanova, A., & Levy, R. (2024). Language
models align with human judgments on key grammatical constructions. Proceedings of the National
Academy of
Sciences, 121(36), e2400917121.
Jenset, G. B., McGillivray, B., & Rundell, M. (2018). The
dative alternation revisited: Fresh insights from contemporary British spoken
data. In V. Brezina, R. Love, & K. Aijmer (Eds.), Corpus
approaches to contemporary British
speech (pp. 185–208). Routledge.
Mahowald, K., Ivanova, A. A., Blank, I. A., Kanwisher, N., Tenenbaum, J. B., & Fedorenko, E. (2024). Dissociating
language and thought in large language models. Trends in Cognitive
Sciences, 28(6), 517–540.
Malmsten, M., Börjeson, L., & Haffenden, C. (2020). Playing
with words at the National Library of Sweden — Making a Swedish
BERT. arXiv, arXiv:2007.01658.
Nguyen, D., Rosseel, L., & Grieve, J. (2021). On
learning and representing social meaning in NLP: A sociolinguistic
perspective. In K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, & Y. Zhou (Eds.), Proceedings
of the 2021 conference of the north American chapter of the Association for Computational Linguistics: Human language
technologies (pp. 603–612). Association for Computational Linguistics.
Peters, M. E., Neumann, M., Zettlemoyer, L., & Yih, W. -T. (2018). Dissecting
contextual word embeddings: Architecture and representation. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings
of the 2018 conference on empirical methods in Natural Language
Processing (pp. 1499–1509). Association
for Computational Linguistics.
Pijpops, D., Franco, K., Speelman, D., & Van de Velde, F. (2024). Introduction:
What are alternations and how should we study them? Linguistics
Vanguard, 10(1), 1–7.
R Core Team (2022). R: A language and
environment for statistical computing. R Foundation for Statistical
Computing. [URL].
Rudinger, R., Naradowsky, J., Leonard, B., & Van Durme, B. (2018). Gender
bias in coreference resolution. In M. Walker, H. Ji, & A. Stent (Eds.), Proceedings
of the 2018 conference of the north American chapter of the Association for Computational Linguistics: Human language
technologies, Volume 2 (Short
Papers) (pp. 8–14). Association for Computational Linguistics.
Salazar, J., Liang, D., Nguyen, T. Q., & Kirchhoff, K. (2020). Masked
language model scoring. In D. Jurafsky, J. Chai, N. Schluter, & J. Tetreault (Eds.), Proceedings
of the 58th annual meeting of the Association for Computational
Linguistics (pp. 2699–2712). Association for Computational Linguistics.
Sundararajan, M., Taly, A., & Yan, Q. (2017). Axiomatic
attribution for deep networks. Proceedings of the 34th International Conference on Machine
Learning, 701, 3319–3328. https://proceedings.mlr.press/v70/sundararajan17a.html.
Szmrecsanyi, B., Grafmiller, J., Bresnan, J., Rosenbach, A., Tagliamonte, S., & Todd, S. (2017). Spoken
syntax in a comparative perspective: The dative and genitive alternation in varieties of
English. Glossa: A Journal of General
Linguistics, 2(1), 86.
Tagliamonte, S. A., Durham, M. & Smith, J. (2014). Grammaticalization
at an early stage: Future be going to in conservative British
dialects. English Language and
Linguistics, 18(1), 75–108.
Tenney, I., Das, D., & Pavlick, E. (2019). BERT
Rediscovers the classical NLP pipeline. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings
of the 57th annual meeting of the Association for Computational
Linguistics (pp. 4593–4601). Association for Computational Linguistics.
Theijssen, D., Ten Bosch, L., Boves, L., Cranen, B., & Van Halteren, H. (2013). Choosing
alternatives: Using Bayesian networks and memory-based learning to study the dative
alternation. Corpus Linguistics and Linguistic
Theory, 9(2), 227–262.
Van den Bosch, A., & Bresnan, J. (2015). Modeling
dative alternations of individual children. In R. Berwick, A. Korhonen, A. Lenci, T. Poibeau, & A. Villavicencio (Eds.), Proceedings
of the sixth workshop on cognitive aspects of computational language
learning (pp. 103–112). Association for Computational Linguistics.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention
is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances
in neural information processing
systems, Volume 301 (pp. 5998–6008). NeurIPS. [URL].
Williams, R. S. (1994). A
statistical analysis of English double object alternation. Issues in Applied
Linguistics, 5(1), 37–58.
Wolk, C., Bresnan, J., Rosenbach, A., & Szmrecsanyi, B. (2013). Dative
and genitive variability in Late Modern English: Exploring cross-constructional variation and
change. Diachronica, 30(3), 382–419.