Automatic analysis of caregiver input and child production
Insight into corpus-based research on child language development in Korean
The present study explores the applicability of Natural Language Processing (NLP) techniques to investigate child
corpora in Korean. We employ caregiver input and child production data in the CHILDES database, currently the largest and
open-access Korean child corpus data, and apply NLP techniques to the data in two ways: automatic Part-of-Speech tagging by
adapting a machine learning algorithm, and (semi-)automatic extraction of constructional patterns expressing a transitive event
(active transitive and suffixal passive). As the first empirical report on NLP-assisted analysis of Korean child corpora, this
study is expected to reveal its advantages and drawbacks, thereby opening the window to furthering corpus-mediated research on
child language development in Korean. Implications of this study’s findings will also contribute to research practice regarding
developmental studies on Korean through child corpora, ensuring the reproducibility of procedures and results, which is often
lacking in previous corpus-based research on child language development in Korean.
Article outline
- 1.Introduction
- 2.Research on child corpora in Korean
- 3.Towards automatic processing of child corpora: POS tagging
- 3.1Issues with POS tagging in Korean
- 3.2Developing a POS tagger for Korean child corpora
- 3.2.1Pre-processing
- 3.2.2Machine learning algorithm for POS tagging: Perceptron
- 3.2.3Model performance
- 3.3Results and discussion
- 4.Towards automatic processing of child corpora: Construction identification
- 4.1Challenges in automatic processing of active transitives and suffixal passives in Korean
- 4.2Construction identification: Caregiver input
- 4.3Construction identification: Child production
- 4.4Model performance
- 4.5Results and discussion
- 4.5.1Accuracy of pattern-finder
- 4.5.2Use of active transitives and suffixal passives: caregiver input
- 4.5.2.1By-construction use
- 4.5.2.2By-marker use
- 4.5.2.3Summary of findings: Caregiver input
- 4.5.3Use of active transitives and suffixal passives: Child production
- 5.Conclusion: Implications on automatic processing of Korean child corpora for developmental research on Korean
- Notes
- Abbreviations
-
References
References (57)
References
Abbot-Smith, Kirsten, Franklin Chang, Caroline Rowland, Heather Ferguson & Julian Pine. 2017. Do two and three year old children use an incremental first-NP-as-agent bias to process active transitive and passive sentences?: A permutation analysis. PloS one 12.10. e0186129. 

Alishahi, Afra & Suzanne Stevenson. 2008. A computational model of early argument structure acquisition. Cognitive Science 32.5. 789–834. 

Allan, Lorraine G. 1980. A note on measurement of contingency between two binary variables in judgment tasks. Bulletin of the Psychonomic Society 15.3. 147–149. 

Ambridge, Ben, Evan Kidd, Caroline F. Rowland & Anna L. Theakston. 2015. The ubiquity of frequency effects in first language acquisition. Journal of Child Language 42.2. 239–273. 

Behrens, Heike. 2006. The input-output relationship in first language acquisition. Language and Cognitive Processes 21.1–3. 2–24. 

Behrens, Heike. 2009. Usage-based and emergentist approaches to language acquisition. Linguistics 47.2. 383–411. 

Cameron-Faulkner, Thea, Elena Lieven & Michael Tomasello. 2003. A construction based analysis of child directed speech. Cognitive Science 27.6. 843–873. 

Cameron-Faulkner, Thea, Elena Lieven & Anna Theakston. 2007. What part of no do children not understand? A usage-based account of multiword negation. Journal of Child Language 34.2. 251–282. 

Cho, Sook Whan. 1982. The acquisition of word order in Korean. MA thesis, University of Calgary.
Choi, Soonja. 1999. Early development of verb structures and caregiver input in Korean: Two case studies. International Journal of Bilingualism 3.2–3. 241–265. 

Choi, Jinho D. & Martha Palmer. 2011. Statistical dependency parsing in Korean: From corpus generation to automatic parsing. In Proceedings of the second workshop on statistical parsing of morphologically rich languages, 1–11.
Choo, Miho & Kwak, Hye-Young. 2008. Using Korean. Cambridge: Cambridge University Press. 

Chung, Gyeonghee No. 1994. Case and its acquisition in Korean. Ph.D. dissertation, University of Texas at Austin.
Collins, Michael & Nigel Duffy. 2002. New ranking algorithms for parsing and tagging: Kernels over discrete structures, and the voted perceptron. In Proceedings of the 40th annual meeting on association for computational linguistics, 263–270.
Dąbrowska, Ewa. 2008. The effects of frequency and neighbourhood density on adult speakers’ productivity with Polish case inflections: An empirical test of usage-based approaches to morphology. Journal of Memory and Language 58.4. 931–951. 

Daumé III, Hal. 2015. A Course in machine learning (Ch3. The perceptron). [URL]
Desagulier, Guillaume. 2016. A lesson from associative learning: asymmetry and productivity in multiple-slot constructions. Corpus Linguistics and Linguistic Theory 12.2. 173–219. 

Dittmar, Miriam, Kirsten Abbot-Smith, Elena Lieven & Michael Tomasello. 2008. German children’s comprehension of word order and case marking in causative sentences. Child Development 79.4. 1152–1167. 

Ellis, Nick. C. 2002. Frequency effects in language processing: A review with implications for theories of implicit and explicit language acquisition. Studies in Second Language Acquisition 241. 143–188. 

Ellis, Nick C. & Fernando Ferreira-Junior. 2009. Construction learning as a function of frequency, frequency distribution, and function. The Modern Language Journal 93.3. 370–385. 

Freund, Yoav & Robert E. Schapire. 1999. Large margin classification using the perceptron algorithm. Machine Learning 37.3. 277–296. 

Ghosh, Devyani, John B. Carter & Hal Daumé III. 2008. Perceptron-based Coherence Predictors. In Proceedings of the 2nd Workshop on chip multiprocessor memory systems and interconnects.
Goldberg, Adele E., Devin M. Casenhiser & Nitya Sethuraman. 2004. Learning argument structure generalizations. Cognitive Linguistics 15.3. 289–316. 

Honnibal, Matthew. 2013. A good part-of-speech tagger in about 200 lines of Python. [URL]
Honnibal, Matthew, Yoav Goldberg & Mark Johnson. 2013. A non-monotonic arc-eager transition system for dependency parsing. In Proceedings of the 7th Conference on Computational Natural Language Learning, 163–172.
Honnibal, Matthew & Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, 1373–1378. 

Huang, Yi Ting, Xiaobei Zheng, Xiangzhi Meng & Jesse Snedeker. 2013. Children’s assignment of grammatical roles in the online processing of Mandarin passive sentences. Journal of Memory and Language 69.4. 589–606. 

Jin, Kyong-sun, Min Ju Kim & Hyun-joo Song. 2015. The development of Korean preschooler’ ability to understand transitive sentences using case-markers. The Korean Journal of Cognitive and Biological Psychology 28.3. 75–90.
Kim, Hung-gyu, Beom-mo Kang & Jungha Hong. 2007. 21seyki seycongkyeyhoyk hyentaykwuke kichomalmwungchi sengkwawa cenmang [21st century Sejong modern Korean corpora: Results and expectations]. In Proceedings of annual conference on human and language technology
31
1, 311–316.
Kim, Meesook. 2010. Syntactic priming in children’s production of passives. Korean Journal of Applied Linguistics 26.2. 271–290.
Kim, Seongchan, William O’Grady & Sookeun Cho. 1995. The acquisition of case and word order in Korean: A note on the role of context. Language Research 31.4. 687–695.
Kim, Shin-Young, Jee Eun Sung & Dongsun Yim. 2017. Sentence comprehension ability and working memory capacity as a function of syntactic structure and canonicity in 5-and 6-year-old children. Communication Sciences & Disorders 22.4. 643–656. 

Kim, Wansu & Cheol Young Ock. 2015. hankwuke kyekthul sacenkwa uymiyek pinto cengpolul sayonghan hankwuke uymiyek kyelceng [Korean semantic role labeling using case frame and frequency]. Journal of Korean Institute of Information Technology 11.2. 161–167.
Lee, Chungmin & Sook Whan Cho. 2009. Acquisition of the subject and topic nominals and markers in the spontaneous speech of young children in Korean. In The Handbook of East Asian Psycholinguistics
3
1 ed by Chungmin Lee, Greg Simpson and Youngjin Kim, 23–33. New York, NY: Cambridge University Press. 

Lee, Hee Ran. 2004. 2sey hankwuk atonguy cwue paltal thukseng [A study of early subject acquisition in Korean]. Communication Sciences and Disorders 9.2. 19–32.
Lee, Ikseop. 2011. kwukehakkaysel [Introduction to Korean linguistics]. Seoul: Hakyensa.
Lee, Sun-Ar & Jin-Tak Choi. 2013. hankwuke Verb_OntoNetuy selkyeywa kwuchwuk [Design and implementation of Korean Verb_OntoNet]. Journal of Korean Institute of Information Technology 11.2. 161–167.
MacWhinney, Brian. 2000. The CHILDES Project: Tools for analyzing talk. Third Edition. Mahwah, NJ: Lawrence Erlbaum Associates.
No, Gyeong Hee. 2009. Acquisition of case markers and grammatical functions. In The Handbook of East Asian Psycholinguistics
3
1 ed by Chungmin Lee, Greg Simpson and Youngjin Kim, 23–33. New York, NY: Cambridge University Press. 

Park, Jungyeul, Jeen-Pyo Hong & Jeong-Won Cha. 2016. Korean language resources for everyone. In JProceedings of the 30th Pacific Asia conference on language, information and computation: Oral Papers, 49–58.
Petrov, Slav, Dipanjan Das & Ryan McDonald. 2012. A universal part-of-speech tagset. In Proceedings of the 8th International Conference on Language Resources and Evaluation, 2089–2096.
Qi, Peng, Timothy Dozat, Yuhao Zhang & Christopher D. Manning. 2018. Universal dependency parsing from scratch. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 160–170. 

Rosenblatt, Frank. 1958. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review 65.6. 386–408. 

Shin, Gyu-Ho. 2020. Connecting input to comprehension: First language acquisition of active transitives and suffixal passives by Korean-speaking preschool children. Ph.D. dissertation, University of Hawai‘i at Mānoa.
Shin, Seo-in. 2006. kwumwun pwunsek malmwungchilul iyonghan hankwuke mwunhyeng yenkwu [A study on Korean sentence patterns using a parsed corpus]. Ph.D. dissertation, Seoul National University.
Sinclair, Hermina & Jean-Paul Bronckart. 1972. S.V.O A linguistic universal? A study in developmental psycholinguistics. Journal of Experimental Child Psychology 141. 329–348. 

Slobin, Dan I. & Thomas G. Bever. 1982. Children use canonical sentence schemas: A crosslinguistic study of word order and inflections. Cognition 12.3. 229–265. 

Sohn, Ho Min. 1999. The Korean language. Cambridge University Press.
Song, Sanghoun & Jae-Woong Choe. 2007. Type hierarchies for passive forms in Korean. In Proceedings of the 14th international conference on Head-Driven Phrase Structure Grammar, Stanford Department of Linguistics and CSLI’s LinGO Lab, 250–270. 

Stefanowitsch, Anatol. 2011. Constructional preemption by contextual mismatch: A corpus-linguistic investigation. Cognitive Linguistics 22.1. 107–129. 

Stoll, Sabine, Kirsten Abbot-Smith & Elena Lieven. 2009. Lexically restricted utterances in Russian, German, and English child-directed speech. Cognitive Science 33.1. 75–103. 

Straka, Milan & Jana Straková. 2017. Tokenizing, POS Tagging, lemmatizing and parsing UD 2.0 with UDPipe. In Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, 88–99. 

Theakston, Anna L. 2004. The role of entrenchment in children’s and adults’ performance on grammaticality judgment tasks. Cognitive Development 19.1. 15–34. 

Tomasello, Michael. 1992. First verbs: A case study of early grammatical development. New York, NY: Cambridge University Press. 

Tomasello, Michael. 2003. Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.
Wonnacott, Elizabeth, Jeremy K. Boyd, Jennifer Thomson & Adele E. Goldberg. 2012. Input effects on the acquisition of a novel phrasal construction in 5 year olds. Journal of Memory and Language 66.3. 458–478. 

Yeon, Jaehoon. 2015. Passives. In The handbook of Korean linguistics ed by Lucien Brown & Jaehoon Yeon, 116–136. Oxford: John Wiley & Sons. 

Cited by (2)
Cited by two other publications
Shin, Gyu-Ho & Seongmin Mun
2025.
Modelling child comprehension: A case of suffixal passive construction in Korean.
Computer Speech & Language 90
► pp. 101701 ff.

Shin, Gyu‐Ho & Seongmin Mun
2023.
Explainability of neural networks for child language: Agent‐First strategy in comprehension of Korean active transitive construction.
Developmental Science 26:6

This list is based on CrossRef data as of 10 january 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.