Applying computing innovations to bilingual corpus analysis
Diana Carter | University of British Columbia | Centre for Research on Bilingualism, Bangor University
Mirjam Broersma | Centre for Language Studies, Radboud University | Max Planck Institute for Psycholinguistics
Kevin Donnelly | Centre for Research on Bilingualism, Bangor University
With current innovations in corpus analysis, it is now possible to extract and analyze large amounts of monolingual and bilingual data in minutes, as opposed to the numerous hours previously needed to manually analyze a much smaller quantum of data. In this chapter, we review innovative techniques in bilingual corpus building and analysis, which include the use of automated glossing to allow the extraction of data that can then be statistically analyzed using mixed-effects models. We discuss the application of these techniques, among others, and provide examples from three bilingual corpora. We end by suggesting how researchers may benefit from the increasingly powerful computing capability that is now available.
Article outline
- 1.Introduction
- 2.Triggered codeswitching
- 3.The Miami, Patagonia, and Siarad corpora
- 3.1Participants
- 3.2Transcription
- 4.Automatic glossing
- 5.Data preparation
- 6.Data analysis and results
- 6.1Data analysis
-
6.2Results
- 7.Tips and tricks for processing corpus data
- 8.Conclusions
-
Acknowledgements
-
Notes
-
References
References (41)
References
Baayen, R. (2008). Analyzing Linguistic Data. A Practical Introduction to Statistics Using R. Cambridge: Cambridge University Press. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Baayen, R., Davidson, D., & Bates, D. (2008). Mixed-effects modeling with crossed random effects for subjects and items. Journal of Memory and Language, 59, 390–412. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Becker, R., & Chambers, J. (1984). S: An Interactive Environment for Data Analysis and Graphics. Ithaca, NY: CRC Press.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Broersma, M. (2009). Triggered codeswitching between cognate languages. Bilingualism: Language and Cognition, 12, 447–462. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Broersma, M., & de Bot, K. (2006). Triggered codeswitching: A corpus-based evaluation of the original triggering hypothesis and a new alternative. Bilingualism: Language and Cognition, 9, 1–13. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Broersma, M., Isurin, L., Bultena, S., & de Bot, K. (2009). Triggered codeswitching: Evidence from Dutch-English and Russian-English bilinguals. In L. Isurin, D. Winford, & K. de Bot (Eds.), Multidisciplinary Approaches to Codeswitching (pp. 103–128). Amsterdam: John Benjamins. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Carter, D., Broersma, M., Donnelly, K., & Konopka, A. (2015). How cognates affect codeswitching: A large-scale study of Welsh-English bilinguals. Ms. in Preparation.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Carter, D., Deuchar, M., Davies, P., & Parafita Couto, M. C. (2011). A systematic comparison of factors affecting the choice of matrix language in three bilingual communities. Journal of Language Contact, 4, 153–183. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Clark, H. (1973). The language-as-fixed-effect fallacy: A critique of language statistics in psychological research. Journal of Verbal Learning & Verbal Behavior, 12, 335–359. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Clyne, M. (1967). Transference and Triggering: Observations on the Language Assimilation of Postwar German-speaking Migrants in Australia. The Hague: Martinus Nijhoff.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Clyne, M. (2003). Dynamics of Language Contact: English and Immigrant Languages. Cambridge: Cambridge University Press. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Crawley, M. (2005). Statistics: An Introduction Using R. Chichester: Wiley & Sons. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Deuchar, M., Davies, P., & Donnelly, K. (2016). Building and Using the Siarad Corpus of Spoken Welsh: Bilingual Conversations in Welsh and English. Manuscript in preparation.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Deuchar, M., Davies, P., Herring, J., Parafita Couto, M.C., & Carter, D. (2014). Building bilingual corpora. In E. Thomas & I. Mennen (Eds.), Advances in the Study of Bilingualism (pp.93–110). Bristol: Multilingual Matters.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Donnelly, K., & Deuchar, M. (2011a). The Bangor Autoglosser: A Multilingual Tagger for Conversational Text. Paper presented at Internet Technologies and Applications, 11. Wrexham, Wales.
Donnelly, K., & Deuchar, M. (2011b). Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop. Riga, Latvia: NEALT Proceedings Series, Tartu.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Douglas, K., & Douglas, S. (2003). PostgreSQL: A Comprehensive Guide to Building, Programming, and Administering PostgreSQL Databases. Indianapolis, IN: Sams Publishing.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Duran Eppler, E. (2010). Emigranto: The Syntax of a German/English Mixed Code. Vienna: Braumüller.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Field, A., Miles, J., & Field, Z. (2012). Discovering Statistics Using R. London: Sage.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Fernández Fuertes, R., Liceras, J. M., Pérez-Tattam, R., Martínez, C., Alba de la Fuente, A., & Carter, D. (2006). The Nature of the Pronominal System and Verbal Morphology in Bilingual Spanish/English Child Data: Linguistic Theory and Learnability Issues. Paper presented at the Hispanic Linguistic Symposium. London: University of Western Ontario.
Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/hierarchical Models. Cambridge: Cambridge University Press. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gries, S. (2013). Statistics for Linguistics with R: A Practical Introduction (2nd ed.). Berlin: Mouton de Gruyter. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gries, S. (2009). Quantitative Corpus Linguistics with R: A Practical Introduction. London: Routledge. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Herring, J., Deuchar, M., Parafita Couto, M. C., & Moro Quintanilla, M. (2010). ‘I saw the madre’: Evaluating predictions about codeswitched determiner-noun sequences using Spanish-English and Welsh-English data. International Journal of Bilingual Education and Bilingualism, 13, 553–573. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Jaeger, T. (2008). Categorical data analysis: Away from ANOVAs (transformation or not) and towards logit mixed models. Journal of Memory and Language, 59, 434–446. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Karlsson, F. (1990). Constraint grammar as a framework for parsing unrestricted text. In H. Karlgren, (Ed.), Proceedings of the 13th International Conference of Computational Linguistics, 3, (pp. 168–173). Stroudsurg, PA: Association for Computational Linguistics. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Karlsson, F., Voutilainen, A., Juha Heikkilä, J., & Anttila A. (1995).
Constraint grammar: A language-independent system for parsing running text
. Natural Language Processing, 4. Berlin: Mouton de Gruyter. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Labov, W. (1972). Some principles of linguistic methodology. Language and Society, 1, 97–120. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
MacWhinney, B. (2000). The CHILDES Project: Tools for Analyzing Talk (3rd Ed.). Mahwah, NJ: Lawrence Erlbaum Associates.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
MacWhinney, B. (2009). Enriching CHILDES for morphosyntactic analysis. Department of Psychology. Paper 175 Enriching CHILDES for morphosyntactic analysis <[URL]>
Matthew, N., & Stones, R. (2005). Beginning Databases with PostgreSQL: From Novice to Professional. New York, NY: Apress.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Milroy, L. (1987). Language and Social Networks. Oxford: Blackwell.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Myers-Scotton, C. (2002). Contact Linguistics: Bilingual Encounters and Grammatical Outcomes. Oxford; NY: Oxford University Press. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Quené, H., & van den Bergh, H. (2008). Examples of mixed-effects modeling with crossed random effects and with binomial data. Journal of Memory and Language, 59, 413–425. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Streiter, O., Scannell, K., & Stuflesser. M. (2006). Implementing NLP projects for non-central languages: Instructions for funding bodies, strategies for developers. Machine Translation, 20, 267–289. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tagliamonte, S., & Baayen, R. (2012). Models, forests, and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change, 24, 135–178. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Wilson, G., Aruliah, D., Brown, C., Hong, N., Davis, M., Guy, R., … Wilson, P. (2012). Best Practices for Scientific Computing. arXiv preprint arXiv:1210.0530.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zuur, A., Saveliev, A., & Ieno, E. (2012). Zero Inflated Models and Generalized Mixed Models with R. Scotland: Highland Statistics.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Cited by (2)
Cited by two other publications
Broersma, Mirjam, Diana Carter, Kevin Donnelly & Agnieszka Konopka
2020.
Triggered codeswitching: Lexical processing and conversational dynamics.
Bilingualism: Language and Cognition 23:2
► pp. 295 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 16 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.