Chapter 4
PoS-tagging a Spanish oral learner corpus
Criteria, procedure, and a sample analysis
This chapter explains the methodology that was followed to Part of Speech tag the Spanish oral learner corpus CORELE (Corpus Oral de Español como Lengua Extranjera; Campillos Llanos 2014). The data consist of forty interviews with learners at lower intermediate level from more than nine mother tongue (L1) backgrounds, and four interviews with native speakers (control group). The annotation was performed with the GRAMPAL tagger (Moreno & Guirao 2006). The learner corpus amounted to 52,759 lexical units (LUs), and the native corpus, to 8,643 LUs. The interface is available online and allows the user to explore learners’ interlanguage by searching data according to word form, lemma, L1, and/or proficiency level. I present a sample study on learners’ production of articles following the Contrastive Interlanguage Analysis approach (Granger 1996).
Article outline
- 1.Introduction
- 2.A brief overview of previous work
- 2.1Part of Speech tagging learner corpora
- 2.2Studies on articles in learner Spanish
- 3.Methodology
- 3.1Corpus data
-
3.2Part-of-Speech (PoS) tagging
- 3.3Count of lexical units
- 3.4The corpus interface
- 4.A sample analysis of learners’ production of Spanish articles
- 5.Discussion
- 6.Conclusions
-
Acknowledgments
-
Notes
-
References
References
Aarts, J. & Granger, S.
1998 Tag sequences in learner corpora: A key to interlanguage grammar and discourse. In
Learner English on Computer,
S. Granger (ed.), 132–141. London: Addison Wesley Longman.
Bickerton, D.
1981 Roots of Language. Ann Arbor MI: Karoma Press.
Bley-Vroman, R.
1983 The comparative fallacy in interlanguage studies: The case of systematicity.
Language Learning 33: 1–17.
Brucart, J.M.
2012 La adquisición del artículo: Flujo informativo y cohesión discursiva. Presentation held at the XI Encuentro de Profesores de ELE, Barcelona, 21 December (2012).
[URL] (14 April 2015).
Campillos Llanos, L.
2012a Designing a search interface for a Spanish learner spoken corpus: the end-user’s evaluation. In
Proc. of LREC 2012, 23–25 May 2012, Istanbul (Turkey),
N. Calzolari,
K. Choukri,
T. Declerck,
M. Uğur Doğan,
B. Maegaard,
J. Mariani,
J. Odijk &
S. Piperidis (eds), 241–248. Paris: ELRA.
Campillos Llanos, L.
2012b La expresión oral en español como lengua extranjera: interlengua y análisis de errores basado en corpus. Unpublished PhD dissertation, Universidad Autónoma de Madrid.
Campillos Llanos, L.
2014 A Spanish learner oral corpus for computer aided error analysis.
Corpora 9 (2): 207–238. DOI:
Corder, P.
1971 Idiosyncratic dialects and error analysis.
International Review of Applied Linguistics in Language Teaching 9 (2): 147–160.
Council of Europe
2001 Common European Framework of Reference for Languages: Learning, teaching, assessment. Cambridge: CUP.
Dagneaux, E., Denness, S. & Granger, S.
1998 Computer-aided error analysis.
System 26 (2): 163–174.
Díaz-Negrillo, A., Meurers, D., Valera, S. & Wunsch, H.
2010 Towards interlanguage POS annotation for effective learner corpora in SLA and FLT.
Language Forum 36 (1–2): 139–154.
Special Issue on
Corpus Linguistics for Teaching and Learning. In Honour of John Sinclair,
M. Moreno Jaén &
C. Pérez Basanta (eds).
Díaz-Negrillo, A. & Thompson, P.
2013 Learner corpora: Looking towards the future. In
Automatic Treatment and Analysis of Learner Corpus Data [Studies in Corpus Linguistics 59],
A. Díaz-Negrillo,
N. Ballier &
P. Thompson (eds), 9–30. Amsterdam: John Benjamins.
Dickinson, M. & Ragheb, M.
2009 Dependency annotation for learner corpora. In
Proc. of the 8th International Workshop on Treebanks and Linguistic Theories (TLT-8),
M. Passarotti,
A. Przepiórkowski,
S. Raynaud &
F. Van Eynde (eds), 59–70.
Dickinson, M. & Raheb, M.
2011 Dependency Annotation of Coordination for Learner Language. In
Proc. of the International Conference on Dependency Linguistics (Depling 2011), 5–7 September 2011, Barcelona (Spain), 135–144.
Díez-Bedmar, M.B. & Papp, S.
2008 The use of the English article system by Chinese and Spanish learners. In
Linking up Contrastive and Learner Corpus Research,
G. Gilquin,
S. Papp &
M.B. Díez-Bedmar (eds), 147–175. Amsterdam: Rodopi.
Díez-Bedmar, M.B. & Pérez Paredes, P.
Dryer, M.S.
2013a Definite Articles. In
The World Atlas of Language Structures Online,
M.S. Dryer &
M. Haspelmath (eds. Leipzig: Max Planck Institute for Evolutionary Anthropology.
[URL]
Dryer, M.S.
2013b Indefinite articles. In
The World Atlas of Language Structures Online,
M.S. Dryer &
M. Haspelmath (eds). Leipzig: Max Planck Institute for Evolutionary Anthropology.
[URL]
Fernández, S.
1990 Análisis de errores e interlengua en el aprendizaje del español como lengua extranjera. PhD dissertation, Universidad Complutense. Published as:
Interlengua y análisis de errores en el aprendizaje del español como lengua extranjera (1997) Madrid: Edelsa.
Fitzpatrick, E. & Seegmiller, M.S.
2004 The Montclair Electronic Language Database Project. In
Applied Corpus Linguistics: A Multidimensional Perspective,
U. Connor &
T.A. Upton (eds), 223–237. Amsterdam: Rodopi.
Gaillat, T.
2013 This and that in native and learner English: From typology of use to tagset characterisation. In
Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead [Corpora and Language in Use 1],
S. Granger,
G. Gilquin &
F. Meunier (eds), 167–177. Louvain-la-Neuve: Presses universitaires de Louvain.
Gaillat, T., Sébillot, P. & Ballier, N.
2014 Automated classification of unexpected uses of this and that in a learner corpus of English. In
Recent Advances in Corpus Linguistics: Developing and Exploiting Corpora,
L. Vandelanotte,
K. Davidse,
C. Gentens &
D. Kimps (eds), 309–324. Amsterdam: Rodopi.
García-Mayo, P. & Hawkins, R.
Godenzzi, J.C.
1995 The Spanish language in contact with Quechua and Aymara: The use of the article. In
Spanish in Four Continents: Studies in Language Contact and Bilingualism,
C. Silva-Corvalán (ed.), 101–116. Washington DC: Georgetown University Press.
Goitia, L.
2007 Un estudio del uso del artículo definido por parte de estudiantes estadounidenses de español como lengua extranjera mediante un inventario de frases correctas e incorrectas.
Interlingüística 17: 409–418.
Goldsmith, J.
2007 Probability for linguists.
Mathématiques et Sciences Humaines. Mathematics and Social Sciences 180: 73–98.
Granger, S.
1996 From CA to CIA and back. An integrated approach to computerized bilingual and learner corpora. In
Languages in Contrast,
K. Aijmer,
B. Altenberg &
M. Johansson (eds), 37–51. Lund: Lund University Press.
Granger, S. & Rayson, P.
1998 Automatic profiling of learner texts. In
Learner English on Computer,
S. Granger (ed.), 119–131. London: Addison Wesley Longman.
Granger, S.
2004 Computer learner corpus research: Current status and future prospects. In
Applied Corpus Linguistics. A Multidimensional Perspective,
U. Connor &
T.A. Upton (eds), 123–145. Amsterdam: Rodopi.
Granger, S., Kraifa, O., Pontona, C., Antoniadis, G. & Zampa, V.
2007 Integrating learner corpora and natural language processing: A crucial step towards reconciling technological sophistication and pedagogical effectiveness.
ReCALL 19: 252–268.
Granger, S. Dagneaux, E., Meunier, F. & Paquot, M.
2009 The International Corpus of Learner English, Version 2. Handbook and CD ROM. Louvain la Neuve: Presses universitaires de Louvain.
Hasselgård, H. & Johansson, S.
Hawkins, J.A.
1978 Definiteness and Indefiniteness: A Study in Reference and Grammaticality Prediction. London: Croom Helm.
Hawkins, J.A.
1991 On (in)definite articles: Implicatures and (un)grammaticality prediction.
Journal of Linguistics 27: 405–442.
Hirschmann, H., Lüdeling, A., Rehbein, I., Reznicek, M. & Zeldes, A.
2013 Underuse of syntactic categories in Falko. A case study on modification. In
Twenty Years of Learner Corpus Research: Looking Back, Moving Ahead,
S. Granger,
G. Gilquin &
F. Meunier (eds), 223–234. Louvain-la-Neuve: Presses universitaires de Louvain.
Huebner, T.
1983 A Longitudinal Analysis of the Acquisition of English. Ann Arbor MI: Karoma.
Ionin, T.
2003 Article Semantics in Second Language Acquisition. PhD dissertation, MIT.
Jarvis, S.
2000 Methodological rigor in the study of transfer: Identifying L1 influence in the interlanguage lexicon.
Language Learning 50 (2): 245–309.
Jie, S.
2012 El artículo en la enseñanza de ELE. Estudiantes de origen chino. PhD dissertation, Universidad de Barcelona.
Krivanek, J. & Meurers, D.
2011 Comparing rule-based and data-driven dependency parsing of learner language. In
Proc. of the International Conference on Dependency Linguistics (Depling 2011), 5–7 September 2011, Barcelona (Spain), 310–317.
Laca, B.
1999 Presencia y ausencia de determinante. In
Gramática descriptiva de la lengua Española, Vol. 1,
I. Bosque &
V. Demonte (eds), 891–928. Madrid: Espasa Calpe, S.A.
Leonetti, M.
1999 El artículo. In
Gramática descriptiva de la lengua Española, Vol. 1,
I. Bosque &
V. Demonte (eds), 787–890. Madrid: Espasa Calpe, S.A.
Li, C.N. & Thompson, S.A.
1990 Chinese. In
The World’s Major Languages, Chapter 41,
B. Comrie (ed.), 811–833. Oxford: OUP.
Lin, T-J.
2005 La adquisición y el uso del artículo por alumnos chinos. PhD dissertation, Universidad de Alcalá.
Lu, H.-C.
1997 El uso del artículo en español: Errores e implicaciones pedagógicas. In
Actas del VIII Congreso Internacional de ASELE,
K. Alonso,
F.M. Fernández &
M.G. Bürmann (eds), 519–525. Alcalá de Henares: Universidad de Alcalá.
Lu, H.-C. & Hsueh, L. L.
2012 Estudio del uso del artículo a partir de un corpus paralelo de aprendices, CPATEI.
Revista de Lingüística y Lenguas Aplicadas 7: 193–202. DOI:
Lüdeling, A., Zeldes, A., Reznicek, M., Rehbein, I. & Hirschmann, H.
2010 Syntactic misuse, overuse and underuse: A study of a parsed learner corpus and its target hypothesis. In
Proc. of the 9th International Workshop on Treebanks and Linguistic Theories, 3–4 December 2010, University of Tartu (Estonia),
M. Dickinson,
K. Müürisep &
M. Passarotti (eds), 1–4.
MacWhinney, B.
2000 The CHILDES Project: Tools for Analyzing Talk, 3rd edn. Mahwah NJ: Lawrence Erlbaum Associates.
McEnery, T., Xiao, R. & Tono, Y.
2006 Corpus-based Language Studies. An Advanced Resource Book. London: Routledge.
Mendikoetxea, A.
2013 Corpus-based research in second language Spanish. In
The Handbook of Spanish Second Language Acquisition,
K.L. Geeslin (ed.), 11–29. Hoboken NJ: Wiley-Blackwell.
Meurers, D.
2015 Learner corpora and natural language processing. In
The Cambridge Handbook of Learner Corpus Research,
S. Granger,
G. Gilquin &
F. Meunier (eds). Cambridge: CUP.
Milton, J. & Tsang, E.
1993 A corpus-based study of logical connectors in EFL students’ writing: Directions for future research. In
Studies in Lexis. Proc. of a Seminar on Lexis Organized by the Language Centre of the HKUST, 6–7 July 1992, Hong Kong,
R. Pemberton &
E. Tsang (eds), 215–246. Hong Kong: Language Centre, HKUST.
Mitchell, R., Domínguez, L., Arche, M.J., Myles, F. & Marsden, E.
2008 SPLLOC: A new database for Spanish second language acquisition research. In
EUROSLA Yearbook 8,
L. Roberts,
F. Myles &
A. David (eds), 287–304. Amsterdam: John Benjamins.
de Mönnink, I.
2000 Parsing a learner corpus? In
Corpus Linguistics and Linguistic Theory,
C. Mair &
M. Hundt (eds), 81–90. Amsterdam: Rodopi.
Moreno, A. & Guirao, J. M.
Morimoto, Y.
2011 El artículo en español. Madrid: Castalia.
Myles, F.
2005 Interlanguage corpora and second language acquisition research.
Second Language Research 21 (4): 373–391
Ott, N. & Ziai, R.
2010 Evaluating dependency parsing performance on German learner language. In
Proc. of the 9th International Workshop on Treebanks and Linguistic Theories, 3–4 December 2010, University of Tartu, (Estonia),
M. Dickinson,
K. Müürisep &
M. Passarotti (eds), 175–186.
Pęzik P.
2012 Towards the PELCRA Learner English Corpus. In
Corpus Data across Languages and Disciplines [Lodz Studies in Language 28],
P. Pęzik (ed.), 33–42. Frankfurt: Peter Lang.
Ragheb, M. & Dickinson, M.
2011 Avoiding the comparative fallacy in the annotation of learner corpora. In
Selected Proceedings of the 2010 Second Language Research Forum,
G. Granena,
J. Koeth,
S. Lee-Ellis,
A. Lukyanchenko,
G. Prieto Botana &
E. Rhoades (eds), 114–124. Somerville MA: Cascadilla Proceedings Project.
Ramírez-Mayberry, M.
1998 The acquisition of the Spanish definite articles by English-speaking learners of Spanish.
Texas Papers on Foreign Language Education 3 (5): 1–57.
Rastelli, S.
2006 ISA 0.9 – Written Italian of Americans: Syntactic and semantic tagging of verbs in a learner corpus.
Studi Italiani di Linguistica Teorica e Applicata (SILTA) 1: 73–100.
Rastelli, S.
2009 Learner corpora without error tagging.
Linguistik Online 38 (2).
[URL]
Reznicek, M., Lüdeling, A. & Hirschmann, H.
van Rooy, B. & Schäfer, L.
2002 The effect of learner errors on POS tag errors during automatic POS tagging.
Southern African Linguistics and Applied Language Studies 20 (4): 325–335.
Rosen, A., Hana, J., Štindlová, B., Škodová, S. & Feldman, A.
2014 Evaluating and automating the annotation of a learner corpus.
Language Resources and Evaluation 48: 65–92.
Rosén, V. & De Smedt, K.
2010 Syntactic annotation of learner corpora. In
Systematisk, variert, men ikke tilfeldig,
H. Johansen,
A. Golden,
J.E. Hagen &
A.-K. Helland (eds), 120–132. Oslo: Novus forlag.
Said-Mohand, A.
2007 La adquisición del artículo definido: Evidencia oral y escrita.
RedELE 10: 1–15.
Santos, I.
1991 La enseñanza de segundas lenguas. Análisis de errores en la expresión escrita de estudiantes de español cuya lengua nativa es el serbo-croata. PhD dissertation. Madrid, Universidad Complutense.
Scott, M.
2012 WordSmith Tools. Liverpool: Lexical Analysis Software.
Seco, M., Andrés, O. & Ramos, G.
1999 Diccionario del español actual. Madrid: Aguilar.
Snape, N.
2009 Exploring Mandarin Chinese speakers’ L2 article use. In
Representational Deficits in SLA: Studies in Honor of Roger Hawkins [Language Acquisition and Language Disorders 47],
N. Snape,
Y-K. I. Leung &
M. S. Smith (eds), 27–52. Amsterdam: John Benjamins. DOI:
Tarrés, I.
2002 El uso del artículo por estudiantes polacos de ELE. MA dissertation, Universidad de Barcelona.
[URL]
Tenfjord, K., Meurer, P. & Hofland, K.
2006 The ASK corpus: A language learner corpus of Norwegian as a second language. In
Proc. of the 5th International Language Resources and Evaluation Conference 22–28 May, Genova (Italy), 1821–1824.
Thouësny, S.
2011 Increasing the reliability of a part-of-speech tagging tool for use with learner language. In
Proc. from Pre-conference (AALL’09) Workshop on Automatic Analysis of Learner Language, Arizona State University, Tempe, AZ.
Tono, Y.
2000 A corpus-based analysis of interlanguage development: analysing POS tag sequences of EFL learner corpora. In
Practical Applications in Language Corpora,
B. Lewandowska-Tomaszczyk &
P.J. Melia (eds), 323–343. Frankfurt: Peter Lang.
Tono, Y.
2002 The Role of Learner Corpora in SLA Research and Foreign Language Teaching: The Multiple Comparison Approach. PhD dissertation, University of Lancaster.
Valverde, M.P. & Ohtani, A.
2014 Annotating article errors in Spanish learner texts: design and evaluation of an annotation scheme. In Proc. of the 28th Pacific Asia Conference on Language, Information and Computation (PACLIC), 12–14 December 2014, Phuket (Thailand),
W. Aroonmanakun,
T. Supnithi &
P. Boonkwan (eds), 234–243.
Vázquez, G.
1991 Análisis de errores y aprendizaje de español/lengua extranjera [Studia Romanica et Linguistica 25]. Frankfurt: Peter Lang.
Zeldes, A., Ritz, J., Lüdeling, A. & Chiarcos, C.
2009 ANNIS: A search tool for multi-layer annotated corpora. In Proc. of the 5th International Corpus Linguistics Conference 2009, 20–23 July 2009, Liverpool (United Kingdom),
M. Mahlberg,
V. González-Díaz &
C. Smith (eds. Liverpool: University of Liverpool.
Cited by
Cited by 1 other publications
Spina, Stefania, Irene Fioravanti, Luciana Forti & Fabio Zanda
2023.
The CELI corpus: Design and linguistic annotation of a new online learner corpus.
Second Language Research
This list is based on CrossRef data as of 21 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.