This report outlines the development of a new corpus, which was created by refining and modifying the largest open-access L2 English learner database – the EFCAMDAT. The extensive data-curation process, which can inform the development and use of other corpora, included procedures such as converting the database from XML to a tabular format, and removing problematic markup tags and non-English texts. The final dataset contains two corresponding samples, written by similar learners in response to different prompts, which represents a unique research opportunity when it comes to analyzing task effects and conducting replication studies. Overall, the resulting corpus contains ~406,000 texts in the first sample and ~317,000 texts in the second sample, written by learners representing diverse L1s and a large range of L2 proficiency levels.
Alexopoulou, T., Michel, M., Murakami, A., & Meurers, D. (2017). Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, 67(S1), 180–208.
Callies, M. (2015). Learner corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The Cambridge handbook of learner corpus research (pp. 35–56). Cambridge: Cambridge University Press.
Feinerer, I., & Hornik, K. (2018). tm: Text Mining Package. Retrieved from [URL]
Geertzen, J., Alexopoulou, T., Baker, R., Hendriks, H., Jiang, S., & Korhonen, A. (2013). The EF Cambridge Open Language Database (EFCAMDAT). User Manual Part I: Written Production. Retrieved from [URL]
Geertzen, J., Alexopoulou, T., & Korhonen, A. (2014). Automatic linguistic annotation of large scale L2 databases: The EF-Cambridge Open Language Database (EFCamDat). In R. T. Millar, K. I. Martin, C. M. Eddington, A. Henery, N. M. Miguel, & A. Tseng (Eds.), Selected proceedings of the 2012 Second Language Research Forum (pp. 240–254). Somerville, MA: Cascadilla Proceedings Project.
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30.
Huang, Y., Geertzen, J., Baker, R., Korhonen, A., & Alexopoulou, T. (2017). The EF Cambridge Open Language Database (EFCAMDAT): Information for users (pp. 1–18). Retrieved from [URL]
Huang, Y., Murakami, A., Alexopoulou, T., & Korhonen, A. (2018). Dependency parsing of learner English. International Journal of Corpus Linguistics, 23(1), 28–54.
Kaliyaperumal, S. K., Kuppusamy, M., Arumugam, S., Kannan, K. S., Manoj, K., & Arumugam, S. (2015). Labeling methods for identifying outliers. International Journal of Statistics and Systems, 10(2), 231–238.
Lang, D. T.. (2020). XML: Tools for parsing and generating XML within R and S-Plus. Retrieved from [URL]
McEnery, T., Brezina, V., Gablasova, D., & Banerjee, J. (2019). Corpus linguistics, learner corpora, and SLA: Employing technology to analyze language use. Annual Review of Applied Linguistics, 391, 74–92.
Murakami, A. (2013). Individual variation and the role of L1 in the L2 development of English grammatical morphemes: Insights from learner corpora (Unpublished doctoral dissertation). Cambridge University.
Murakami, A. (2016). Modeling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning, 66(4), 834–871.
Ooms, J. (2018). cld2: Google’s compact language detector 2 (Version 1.2). R package. Retrieved from [URL]
Shatz, I. (2019). How native language and L2 proficiency affect EFL learners’ capitalisation abilities: A large-scale corpus study. Corpora, 14(2), 173–202.
Van der Loo, M. P. J. (2014). The stringdist package for approximate string matching. The R Journal, 6(1), 111–122. Retrieved from [URL]
Wickham, H., François, R., Henry, L., Müller, K., & RStudio. (2019). dplyr: A grammar of data manipulation. Retrieved from [URL]
Wickham, H., & RStudio. (2019). stringr: Simple, consistent wrappers for common string operations. Retrieved from [URL]
Cited by (9)
Cited by nine other publications
Alzahrani, Alaa & Lawrence Jun Zhang
2024. Utility of Kolmogorov complexity measures: Analysis of L2 groups and L1 backgrounds. PLOS ONE 19:4 ► pp. e0301806 ff.
Derkach, Kateryna & Theodora Alexopoulou
2024. Definite and indefinite article accuracy in learner English: A multifactorial analysis. Studies in Second Language Acquisition 46:3 ► pp. 710 ff.
Eguchi, Masaki & Kristopher Kyle
2024. Building custom NLP tools to annotate discourse-functional features for second language writing research: A tutorial. Research Methods in Applied Linguistics 3:3 ► pp. 100153 ff.
2024. The potential influence of cross-linguistic lexical similarity on lexical diversity in L2 English writing. Corpora 19:2 ► pp. 131 ff.
Du, Xiangtao, Muhammad Afzaal & Hind Al Fadda
2022. Collocation Use in EFL Learners’ Writing Across Multiple Language Proficiencies: A Corpus-Driven Study. Frontiers in Psychology 13
Hnatkowska, Bogumila & Damian Wawrzyniak
2022. Proficiency Level Classification of Foreign Language Learners Using Machine Learning Algorithms and Multilingual Models. In Computational Collective Intelligence [Lecture Notes in Computer Science, 13501], ► pp. 261 ff.
Murakami, Akira & Nick C. Ellis
2022. Effects of Availability, Contingency, and Formulaicity on the Accuracy of English Grammatical Morphemes in Second Language Writing. Language Learning 72:4 ► pp. 899 ff.
This list is based on CrossRef data as of 6 january 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.