Special issue articles
The challenges and benefits of annotating oral bilingual corpora
The Spanish in Texas Corpus Project
This article describes efforts to collect, process, and automatically annotate a corpus of Spanish as spoken in Texas. It elaborates the protocols for the development of the corpus and the procedures for automatic annotation, illustrating the common pitfalls to language identification in bilingual corpora and potential methods for circumventing them. The benefits of a comparative corpus approach to contact varieties is illustrated by a case study of a putative verbal calque from the Spanish in Texas data. It is demonstrated that the relative frequency of the verb is much higher than in its source Mexican variety and that the verb selects different complements in Texas than it does in other varieties. The article concludes with a discussion of how computational tools might be fruitfully exploited to resolve long-standing debates about language variation in contact settings.
Article outline
- 1.Introduction
- 2.The Spanish in Texas Corpus
- 2.1Protocols for developing the Corpus
- 2.2Procedures for annotation
- 3.The benefits of a corpus approach to contact phenomena
- 4.Case Study: Is an innovation contact induced or internally motivated?
- 4.1Detection of the potentially innovative uses of the verb
- 4.2Possibilities and limitations of a computational approach to calques
- 5.Discussion and conclusion
- Notes
-
References
References (54)
References
Adamou, Evangelia. 2016. A corpus-driven approach to language contact: Endangered languages in a comparative perspective. Walter de Gruyter GmBH & Co KG.
Bullock, Barbara E. & A. Jacqueline Toribio. 2013. The Spanish in Texas Corpus project. Center for Open Education Resources and Language Learning (COERLL), the University of Texas at Austin. [URL].
Bybee, Joan L. 2007. Frequency of use and the organization of language. New York & Oxford: Oxford University Press.
Çentinoğlu, Özlem, Sarah Schulz, and Ngoc Thang Vu. “Challenges of computational processing of codeswitching.” arXiv preprint arXiv:1610.02213 (2016).
Coetsem, Frans van. 1990. Review of Thomason and Kaufman (1988), Lehiste (1988), and Wardhaugh (1987), Language in Society 191. 260–268.
Deuchar, Margaret & Jonathan R. Stammers. 2012. What IS the “Nonce Borrowing Hypothesis” anyway? Bilingualism: Language and Cognition 151. 649–650.
Davies, Mark. 2002. Corpus del Español: 100 million words, 1200s-1900s. [URL]. (12 March 2014.)
Diab, Mona & Ankit Kamboj. 2011. Feasibility of leveraging crowd sourcing for the creation of a large scale annotated resource for Hindi English code switched data: a pilot annotation. 9th Workshop on Asian Language Resources, 36–40. Chiang Mai, Thailand.
Donnelly, Kevin & Margaret Deuchar. 2011. Using constraint grammar in the Bangor Autoglosser to disambiguate multilingual spoken text. In Constraint Grammar Applications: Proceedings of the NODALIDA 2011 Workshop, Riga, Latvia, 17–25.
Elfardy, Heba, Mohamed Al-Badrashiny & Mona Diab. 2013. Code switch point detection in Arabic. In Elisabeth Métais, Farid Meziane, Mohamad Sararee, Vijayan Sugumaran & Sunil Vadera (eds.) Natural Language Processing and Information Systems: Proceedings of the 18th International Conference on Applications of Natural Language to Information Systems (NLDB2013), Salford, UK, 412–416. Heidelberg: Springer.
González-Vilbazo, Kay & Luis López. 2011. Some properties of light verbs in code-switching. Lingua 1211. 832–850.
Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio. 2017. Moving code-switching research towards more empirically grounded methods. CDH 2017 Corpora in the Digital Humanities, CEUR Workshop Proceedings, 1–9.
Guzmán, Gualberto, Joseph Ricard, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio. 2017. Metrics for modeling code-switching across corpora. Proceedings of Interspeech 2017, 67–71.
Guzmán, Gualberto, Jacqueline Serigos, Barbara E. Bullock & Almeida Jacqueline Toribio. 2016. Simple tools for exploring variation in code-switching for linguists. Proceedings of EMNLP (Empirical Methods in Natural Language Processing 2016), Second Workshop on Computational Approaches to Code-switching, 12–20. Association for Computational Linguistics.
Jarvis, Scott & Scott Crossley. 2012. Approaching language transfer through text classification: Explorations in the detection-based approach. Bristol, UK: Multilingual matters.
Jarvis, Scott & Aneta Pavlenko. 2008. Crosslinguistic influence in language and cognition. New York & London: Routledge.
Jenkins, Devin. 2003. Bilingual verb constructions in southwestern Spanish. Bilingual Review 271. 195–204.
King, Ben & Steven Abney. 2013. Labeling the languages of words in mixed-language documents using weakly supervised methods. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1110–1119. Association for Computational Linguistics.
Koehn, Philipp. 2005. Europarl: A parallel corpus for statistical machine translation. Machine Translation Summit 2005, 79–86.
Li, Ying, Yue Yu & Pascale Fung. 2012. A Mandarin-English code-switching corpus. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC 2012), Istanbul, Turkey, 2515–2519. European Language Resources Association.
LIPPS Group. 2000. The LIDES coding manual: A document for preparing and analyzing language interaction data. International Journal of Bilingualism 41. 131–270.
Lipski, John M. 1985. Linguistic aspects of Spanish-English language switching. Tempe: Arizona State University Center for Latin American Studies.
Lipski, John M. 2008. Varieties of Spanish in the United States. Washington, DC: Georgetown University Press.
Mackey, William F. 1970. Interference, integration and the synchronic fallacy. In James E. Alatis (ed.) Bilingualism and Language Contact: Anthropological, Linguistic, Psychological, and Sociological Aspects. Monograph Series on Languages and Linguistics (Georgetown University Round Table on Languages and Linguistics), vol. 231, 195–227. Washington: Georgetown University School of Languages and Linguistics.
MacWhinney, Brian. 2007. The TalkBank Project. In Joan C. Beal, Karen P. Corrigan & Hermann L. Moisl (eds.), Creating and Digitizing Language Corpora: Synchronic Databases, vol. 11, 163–180. Houndmills, UK: Palgrave-MacMillan.
Mougeon, Raymond, Terry Nadasdi & Katherine Rehner. 2005. Contact-induced linguistic innovations on the continuum of language use: The case of French in Ontario. Bilingualism: Language and Cognition 81. 99–115.
Muysken, Pieter. 2000. Bilingual speech: A typology of code-mixing. Cambridge, UK: Cambridge University Press.
Otheguy, Ricardo. 1995. When contact speakers talk, linguistic theory listens. In Ellen Contini-Morava & Barbara S. Goldberg (eds.), Meaning as explanation: Advances in linguistic sign theory (Trends in Linguistics, Studies and Monographs), vol. 841, 213–242. Berlin: Mouton de Gruyter.
Otheguy, Ricardo & Nancy Stern. 2011. On so-called Spanglish. International Journal of Bilingualism 151. 85–100.
Otheguy, Ricardo & Ana Celia Zentella. 2012. Spanish in New York: Language contact, dialectal leveling, and structural continuity. New York & Oxford: Oxford University Press.
Polinsky, Maria & Olga Kagan. 2007. Heritage languages: In the ‘wild’ and in the classroom. Language and Linguistics Compass 11. 368–395.
Poplack, Shana. 1980. Sometimes I’ll start a sentence in Spanish y termino en español: Toward a typology of code-switching. Linguistics 181. 581–618.
Poplack, Shana. 2012. What does the Nonce Borrowing Hypothesis hypothesize? Bilingualism: Language and Cognition 151. 644–648.
R Development Core Team. 2009. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL [URL]
Roggia, Aaron B. 2011. Unaccusativity and word order in Mexican Spanish: An examination of syntactic interfaces and the split intransitivity hierarchy. Ph.D. dissertation. State College, Pennsylvania: The Pennsylvania State University.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. In Proceedings of international conference on new methods in language processing, Manchester, UK, 44–49.
Sebba, Mark. 1998. A congruence approach to the syntax of codeswitching. International Journal of Bilingualism 2(1). 1–19.
Serigos, Jacqueline Larsen. 2013. The social stratification of loanwords: A computational and corpus-based approach to Anglicisms in Argentina. Austin, TX: University of Texas at Austin master’s report.
Silva-Corvalán, Carmen. 1994/2000. Language contact and change. Oxford: Clarendon Press.
Solorio, Thamar & Yang Liu. 2008a. Learning to predict code-switching points. The Conference Empirical Methods on Natural Language Processing, EMNLP 2008, 973–981. Honolulu, HI: Association for Computational Linguistics.
Solorio, Thamar & Yang Liu. 2008b. Part-of-speech tagging for English-Spanish code-switched text. The Conference on Empirical Methods in Natural Language Processing, EMNLP 2008, 1051–1060. Honolulu, HI: Association for Computational Linguistics.
Solorio, Thamar, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang & Pascale Fung. 2014. Overview for the first shared task on language identification in code-switched data. First Workshop on Computational Approaches to Code Switching. Proceedings of the Workshop. EMNLP 2014, 62–72. Doha, Qatar: Association for Computational Linguistics.
Stammers, Jonathan & Margaret Deuchar. 2012. Testing the Nonce Borrowing Hypothesis: Counter-evidence from English-origin verbs in Welsh. Bilingualism: Language and Cognition 151. 630–643.
Thomason, Sarah & Terrence Kaufman. 1988. Language contact, creolization, and genetic linguistics. Berkeley, CA: University of California Press.
Torres Cacoullos, Rena & Catherine E. Travis. 2010. Testing convergence via code-switching: Priming and the structure of variable subject expression. International Journal of Bilingualism 141. 1–27.
Tortora, Christina, Beatrice Santorini, Frances Blanchette & C. E. A. Diertani. 2017. The Audio-Aligned and Parsed Corpus of Appalachian English (AAPCAppE). [URL].
Villa, Daniel J. 2005. Back to patrás: A process of grammaticalization in a contact variety of Spanish. In James Cohen, Kara T. McAlister, Kellie Rolstad & Jeff MacSwan (eds.) Proceedings of the 4th International Symposium on Bilingualism, 2310–2316. Somerville, MA: Cascadilla Press.
Vossen, Piek (ed.). 1998. EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht: Kluwer.
Wang, William S-Y. 1969. Competing changes as a cause of residue. Language 451. 9–25.
Wohlgemuth, Jan. 2009. A Typology of Verbal Borrowings. New York, Berlin: Mouton de Gruyter
Zenner, Eline, Dirk Speelman & Dirk Geeraerts. 2012. Cognitive sociolinguistics meets loanword research: Measuring variation in the success of Anglicisms in Dutch. Cognitive Linguistics 231. 749–792.
Cited by (2)
Cited by two other publications
Alvero, AJ & Rebecca Pattichis
2024.
Multilingualism and mismatching: Spanish language usage in college admissions essays.
Poetics 105
► pp. 101903 ff.
Parra, María Luisa & Ellen J Serafini
2021.
“Bienvenidxs todes”: el lenguaje inclusivo desde una perspectiva crítica para las clases de español.
Journal of Spanish Language Teaching 8:2
► pp. 143 ff.
This list is based on CrossRef data as of 25 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.