Applying computing innovations to bilingual corpus analysis
Diana Carter | University of British Columbia | Centre for Research on Bilingualism, Bangor University
Mirjam Broersma | Centre for Language Studies, Radboud University | Max Planck Institute for Psycholinguistics
Kevin Donnelly | Centre for Research on Bilingualism, Bangor University
With current innovations in corpus analysis, it is now possible to extract and analyze large amounts of monolingual and bilingual data in minutes, as opposed to the numerous hours previously needed to manually analyze a much smaller quantum of data. In this chapter, we review innovative techniques in bilingual corpus building and analysis, which include the use of automated glossing to allow the extraction of data that can then be statistically analyzed using mixed-effects models. We discuss the application of these techniques, among others, and provide examples from three bilingual corpora. We end by suggesting how researchers may benefit from the increasingly powerful computing capability that is now available.
Article outline
- 1.Introduction
- 2.Triggered codeswitching
- 3.The Miami, Patagonia, and Siarad corpora
- 3.1Participants
- 3.2Transcription
- 4.Automatic glossing
- 5.Data preparation
- 6.Data analysis and results
- 6.1Data analysis
-
6.2Results
- 7.Tips and tricks for processing corpus data
- 8.Conclusions
-
Acknowledgements
-
Notes
-
References
References (41)
References
Baayen, R. (2008). Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. 

Baayen, R., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. 

Becker, R., & Chambers, J. (1984). S: An Interactive Environment for Data Analysis and Graphics. Ithaca, NY: CRC Press.
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media.
Broersma, M. (2009). Triggered codeswitching between cognate languages. Bilingualism: Language and Cognition, 12, 447–462. 

Broersma, M., & de Bot, K. (2006). Triggered codeswitching: A corpus-based evaluation of the original triggering hypothesis and a new alternative. Bilingualism: Language and Cognition, 9, 1–13. 

Broersma, M., Isurin, L., Bultena, S., & de Bot, K. (2009). Triggered codeswitching: Evidence from Dutch-English and Russian-English bilinguals. In L. Isurin, D. Winford, & K. de Bot (Eds.), Multidisciplinary Approaches to Codeswitching (pp. 103–128). Amsterdam: John Benjamins. 

Carter, D., Broersma, M., Donnelly, K., & Konopka, A. (2015). How cognates affect codeswitching: A large-scale study of Welsh-English bilinguals. Ms. in Preparation.
Carter, D., Deuchar, M., Davies, P., & Parafita Couto, M. C. (2011). A systematic comparison of factors affecting the choice of matrix language in three bilingual communities. Journal of Language Contact, 4, 153–183. 

Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning & Verbal Behavior, 12, 335–359. 

Clyne, M. (1967). Transference and Triggering: Observations on the Language Assimilation of Postwar German-speaking Migrants in Australia. The Hague: Martinus Nijhoff.
Clyne, M. (2003). Dynamics of Language Contact: English and Immigrant Languages. Cambridge: Cambridge University Press. 

Crawley, M. (2005). Statistics: An Introduction Using R. Chichester: Wiley & Sons. 

Deuchar, M., Davies, P., & Donnelly, K. (2016). Building and Using the Siarad Corpus of Spoken Welsh: Bilingual Conversations in Welsh and English. Manuscript in preparation.
Deuchar, M., Davies, P., Herring, J., Parafita Couto, M.C., & Carter, D. (2014). Building bilingual corpora. In E. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp.93–110). Bristol: Multilingual Matters.
Donnelly, K., & Deuchar, M. (2011a). The Bangor Autoglosser: A Multilingual Tagger for Conversational Text. Paper presented at Internet Technologies and Applications, 11. Wrexham, Wales.
Donnelly, K., & Deuchar, M. (2011b). Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop. Riga, Latvia: NEALT Proceedings Series, Tartu.
Douglas, K., & Douglas, S. (2003). PostgreSQL: A Comprehensive Guide to Building, Programming, and Administering PostgreSQL Databases. Indianapolis, IN: Sams Publishing.
Duran Eppler, E. (2010). Emigranto: The Syntax of a German/English Mixed Code. Vienna: Braumüller.
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. London: Sage.
Fernández Fuertes, R., Liceras, J. M., Pérez-Tattam, R., Martínez, C., Alba de la Fuente, A., & Carter, D. (2006). The Nature of the Pronominal System and Verbal Morphology in Bilingual Spanish/English Child Data: Linguistic Theory and Learnability Issues. Paper presented at the Hispanic Linguistic Symposium. London: University of Western Ontario.
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press. 

Gries, S. (2013). Statistics for Linguistics with R: A Practical Introduction (2nd ed.). Berlin: Mouton de Gruyter. 

Gries, S. (2009). Quantitative Corpus Linguistics with R: A Practical Introduction. London: Routledge. 

Herring, J., Deuchar, M., Parafita Couto, M. C., & Moro Quintanilla, M. (2010). ‘I saw the madre’: Evaluating predictions about codeswitched determiner-noun sequences using Spanish-English and Welsh-English data. International Journal of Bilingual Education and Bilingualism, 13, 553–573. 

Jaeger, T. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. 

Karlsson, F. (1990). Constraint grammar as a framework for parsing unrestricted text. In H. Karlgren, (Ed.), Proceedings of the 13th International Conference of Computational Linguistics, 3, (pp. 168–173). Stroudsurg, PA: Association for Computational Linguistics. 

Karlsson, F., Voutilainen, A., Juha Heikkilä, J., & Anttila A. (1995).
Constraint grammar: A language-independent system for parsing running text
. Natural Language Processing, 4. Berlin: Mouton de Gruyter. 

Labov, W. (1972). Some principles of linguistic methodology. Language and Society, 1, 97–120. 

MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
MacWhinney, B. (2009). Enriching CHILDES for morphosyntactic analysis. Department of Psychology. Paper 175 Enriching CHILDES for morphosyntactic analysis <[URL]>
Matthew, N., & Stones, R. (2005). Beginning Databases with PostgreSQL: From Novice to Professional. New York, NY: Apress.
Milroy, L. (1987). Language and Social Networks. Oxford: Blackwell.
Myers-Scotton, C. (2002). Contact Linguistics: Bilingual Encounters and Grammatical Outcomes. Oxford; NY: Oxford University Press. 

Quené, H., & van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random effects and with binomial data. Journal of Memory and Language, 59, 413–425. 

Streiter, O., Scannell, K., & Stuflesser. M. (2006). Implementing NLP projects for non-central languages: Instructions for funding bodies, strategies for developers. Machine Translation, 20, 267–289. 

Tagliamonte, S., & Baayen, R. (2012). Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change, 24, 135–178. 

Wilson, G., Aruliah, D., Brown, C., Hong, N., Davis, M., Guy, R., … Wilson, P. (2012). Best Practices for Scientific Computing. arXiv preprint arXiv:1210.0530.
Zuur, A., Saveliev, A., & Ieno, E. (2012). Zero Inflated Models and Generalized Mixed Models with R. Scotland: Highland Statistics.
Cited by (2)
Cited by two other publications
Broersma, Mirjam, Diana Carter, Kevin Donnelly & Agnieszka Konopka
2020.
Triggered codeswitching: Lexical processing and conversational dynamics.
Bilingualism: Language and Cognition 23:2
► pp. 295 ff.

This list is based on CrossRef data as of 30 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.