This report introduces the University of Pittsburgh English Language Institute Corpus (PELIC;
Juffs et al., 2020), a publicly available 4.2-million-word learner corpus of
written texts. Collected over seven years in the University of Pittsburgh’s Intensive English Program, these texts were produced
by more than 1,100 students with diverse linguistic backgrounds and proficiency levels. Unlike most learner corpora which are
cross-sectional, PELIC is longitudinal, offering greater opportunities for tracking development in a natural classroom setting.
This potential is illustrated in an overview of the research conducted to date with these data. The report also provides a
description of PELIC’s creation and contents, including how the texts have been managed to facilitate natural language processing.
Overall, the corpus contributes to the field of learner corpus research by adding to the pool of freely and publicly available
learner corpora, supplemented by a useful set of Python tools and tutorials for accessing these data.
Bird, S., Loper, E. & Klein, E. (2009). Natural
language processing with Python. O’Reilly Media.
Blanchard, D., Tetreault, J., Higgins, D., Cahill, A., & Chodorow, M. (2014). ETS
Corpus of Non-Native Written English LDC2014T06. Linguistic Data Consortium.
Callies, M. (2015). Learner
corpus methodology. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 35–56). Cambridge University Press.
Centre for English Corpus
Linguistics. (2021a). Longitudinal Database of Learner English
(LONGDALE). Université catholique de Louvain. [URL]
Centre for English Corpus
Linguistics. (2021b). Learner corpora around the
world. Université catholique de Louvain. [URL]
Davies, M. (2008–). The
Corpus of Contemporary American English (COCA): 560 million words, 1990-present. [URL]
Dunlap, S. (2012). Orthographic
quality in English as a second language (Unpublished doctoral
dissertation). University of Pittsburgh.
Etaiwi, W., & Naymat, G. (2017). The
impact of applying different preprocessing steps on review spam detection. Procedia Computer
Science,
113
1, 273–279.
Gablasova, D., Brezina, V., & McEnery, T. (2017). Exploring
learner language through corpora: Comparing and interpreting corpus frequency
information. Language
Learning67
(1), 130–154.
Gilquin, G. (2015). From
design to collection of learner corpora. In S. Granger, G. Gilquin, & F. Meunier (Eds.), The
Cambridge handbook of learner corpus
research (pp. 9–34). Cambridge University Press.
Granger, S., Dupont, M., Meunier, F., Naets, H. & Paquot, M. (2020). The
International Corpus of Learner English. Version 3. Presses universitaires de Louvain. [URL]
Honnibal, M. (2013). A
good part-of-speech tagger in about 200 lines of Python. Explosion. [URL]
Juffs, A. (2020). Aspects
of language development in an intensive English
program. Routledge.
Juffs, A., & Han, N-R. (2019, March12). Combining
formal and usage-based theories with data science techniques in measuring the development of syntactic complexity in written
production. Paper presented at the International Conference of the
American Association of Applied Linguistics, Atlanta, GA.
Juffs, A., Han, N-R., & Naismith, B. (2020). The
University of Pittsburgh English Language Corpus (PELIC) [Data
set].
Leńko-Szymańska, A. (2019). Defining
and assessing lexical proficiency. Routledge.
Marcus, M. P., Santorini, B., Marcinkiewicz, M. A., & Taylor, A. (1999). Treebank-3
LDC99T42 [Web Download]. Linguistic Data Consortium. [URL]
Meunier, F. (2016). Introduction
to the LONGDALE Project. In E. Castello, K. Ackerley, & F. Coccetta (Eds.), Studies
in learner corpus linguistics. Research and applications for foreign language teaching and
assessment (pp. 123–126). Peter Lang.
Naismith, B., Han, N.-R., Juffs, A., Hill, B. L., & Zheng, D. (2018). Accurate
measurement of lexical sophistication with reference to ESL learner
data. In K. E. Boyer & M. Yudelson (Eds), Proceedings
of the 11th International Conference on Educational Data
Mining (pp. 259–265).
Naismith, B., & Juffs, A. (2021). Finding
the sweet spot: Learners’ productive knowledge of mid-frequency lexical items. Language
Teaching Research.
Nation, I. S. P. (2013). Learning
vocabulary in another language (2nd ed.). Cambridge University Press.
Tidball, F., & Treffers-Daller, J. (2008). Analysing
lexical richness in French learner language: what frequency lists and teacher judgements can tell us about basic and advanced
words. Journal of French Language
Studies,
18
(3), 299–313.
van Rooy, B., & Schäfer, L. (2009). The
effect of learner errors on POS tag errors during automatic POS tagging. Southern African
Linguistics and Applied Language
Studies,
20
(4), 325–335.
Vercellotti, M. L. (2017). The
development of complexity, accuracy and fluency in second language performance. Applied
Linguistics,
38
1, 90–111.
Vercellotti, M. L., Juffs, A., & Naismith, B. (2021). Multiword
sequences in L2 English language learners’ speech: The relationship between trigrams and lexical variety across
development. System,
98
1.
Cited by (6)
Cited by six other publications
Cong, Yan
2024. AI Language Models: An Opportunity to Enhance Language Learning. Informatics 11:3 ► pp. 49 ff.
Cong, Yan
2025. Demystifying large language models in second language development research. Computer Speech & Language 89 ► pp. 101700 ff.
Kyle, Kristopher & Masaki Eguchi
2024. Evaluating NLP models with written and spoken L2 samples. Research Methods in Applied Linguistics 3:2 ► pp. 100120 ff.
Martin, Katherine I.
2024. How a Phonics-Based Intervention, L1 Orthography, and Item Characteristics Impact Adult ESL Spelling Knowledge. Education Sciences 14:4 ► pp. 421 ff.
Xu, Wei
2023. 2023 IEEE 4th Annual Flagship India Council International Subsections Conference (INDISCON), ► pp. 1 ff.
Zhao, Hui, Kexin Jin, Jing Wang & Abid Yahya
2022. Automatic Recognition and Extraction of English Verb Types Based on Index Line Clustering. Mobile Information Systems 2022 ► pp. 1 ff.
This list is based on CrossRef data as of 17 october 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.