Constructing a corpus-informed list of Arabic formulaic sequences (ArFSs) for language pedagogy and technology

Alghamdi, Ayman; Atwell, Eric

doi:10.1075/ijcl.16088.alg

Article published In:

International Journal of Corpus Linguistics
Vol. 24:2 (2019) ► pp.202–228

Constructing a corpus-informed list of Arabic formulaic sequences (ArFSs) for language pedagogy and technology

Ayman Alghamdi | Umm Al-Qura University

Eric Atwell | University of Leeds

This study aims to construct a corpus-informed list of Arabic Formulaic Sequences (ArFSs) for use in language pedagogy (LP) and Natural Language Processing (NLP) applications. A hybrid mixed methods model was adopted for extracting ArFSs from a corpus, that combined automatic and manual extracting methods, based on well-established quantitative and qualitative criteria that are relevant from the perspective of LP and NLP. The pedagogical implications of this list are examined to facilitate the inclusion of ArFSs in the process of learning and teaching Arabic, particularly for non-native speakers. The computational implications of the ArFSs list are related to the key role of the ArFSs as a novel language resource in the improvement of various Arabic NLP tasks.

Keywords: lexical resources, Arabic formulaic sequences, multi-word expressions, language pedagogy, mixed methods

Article outline

1.Introduction
2.Formulaic Sequences in language pedagogy and technology
- 2.1Corpus-informed pedagogical formulaic sequences
- 2.2Arabic computational MWEs research
3.Methodology: A hybrid model for FSs extraction
- 3.1Issues of frequency, extent and identification
- 3.2The corpus source of the language data
- 3.3The selection criteria
- 3.4Stages of constructing the FSs list
  - 3.4.1Statistical phase
  - 3.4.2Qualitative phase
  - 3.4.3Linguistic analysis and classification phase
4.Results and discussion
5.Conclusions
Acknowledgements
Note
References

Published online: 5 August 2019

https://doi.org/10.1075/ijcl.16088.alg

References (81)

References

Abou-Saad, A. (1987). A Dictionary of Arabic Idiomatic Expressions. Beirut: Dar ElIlm Lilmalayin.

Alfaifi, A., Atwell, E. & Hedaya, I. (2014). Arabic Learner Corpus (ALC) v2: A new written and spoken corpus of Arabic learners. In S. Ishikawa (Ed.), Proceedings of Learner Corpus Studies in Asia and the World, (pp. 77–89). Kobe: Kobe University. Retrieved from [URL] (last accessed April 2019).

Alghamdi, A., Atwell, E., & Brierley, C. (2016). An empirical study of Arabic formulaic sequence extraction methods. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of LREC’2016 Language Resources and Evaluation Conference (pp. 502–506). Portoroz: LREC. Retrieved from [URL] (last accessed April 2019).

Alghamdi, A., & Atwell, E. (2017). نحو معجم حاسوبي للمتالزمات اللفظية في اللغة العربية المعاصرة. [Towards a Computational Lexicon for Arabic Formulaic Sequences]. In Proceedings of TICAM The International Conference on Information and Communication Technologies. Retrieved from [URL] (last accessed April 2019).

Alrabiah, M., Al-Salman, A., Atwell, E., & Alhelewh, N. (2014). KSUCCA: A key to exploring Arabic historical linguistics. International Journal of Computational Linguistics, 5 (2), 27–36.

Alrehaili, S., & Atwell, E. (2017). Extraction of multi-word terms and complex terms from the Classical Arabic text of the Quran. International Journal on Islamic Applications in Computer Science and Technology, 5 (3), 15–27.

Alshutayri, A., Atwell, E., Alosaimy, A., Dickins, J., Ingleby, M., & Watson, J. (2016). Arabic language WEKA-based dialect classifier for Arabic automatic speech recognition transcripts. In P. Nakov, M. Zampieri, L. Tan, N. Ljubešić, J. Tiedemann & S. Malmasi (Eds.) Proceedings of VarDial’2016 Third Workshop on NLP for Similar Languages, Varieties and Dialects, pp. 204–211. Osaka: COLING. Retrieved from [URL] (last accessed April 2019).

Alshutayri, A., & Atwell, E. (2017). Exploring Twitter as a source of an Arabic dialect corpus. International Journal of Computational Linguistics, 8 (2), 37–44.

(2019). A social media corpus of Arabic dialect text. In C. Wigham & E. Stemle (Eds.), Computer-Mediated Communication and Social Media Corpora. Clermont-Ferrand: Presses Universitaires Blaise Pascal.

Al-Sulaiti, L., Abbas, N., Brierley, C., Atwell, E., & Alghamdi, A. (2016). Compilation of an Arabic children’s corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of LREC’2016 Language Resources and Evaluation Conference (pp. 1808–1812). Portoroz: LREC. Retrieved from [URL] (last accessed April 2019).

Attia, M. A. (2006). Accommodating multiword expressions in an Arabic LFG grammar. In T. Salakoski, F. Ginter, S. Pyysalo, T. Pahikkala (Eds.), Advances in Natural Language Processing (pp. 87–98). Berlin: Springer.

Atwell, E. (1982). LOB Corpus Tagging Project: Manual Postedit Handbook. Research report, University of Lancaster. Retrieved from [URL] (last accessed April 2019).

Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D. (2003). An empirical model of multiword expression decomposability. In F. Bond, A. Korhonen, D. McCarthy & A. Villavicencio (Eds.), Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment (pp. 89–96). Sapporo: Association for Computational Linguistics. Retrieved from [URL] (last accessed April 2019).

Baldwin, T., & Kim, S. N. (2010). Multiword expressions. In N. Indurkhya & F. J. Damerau (Eds.), Handbook of Natural Language Processing (2nd ed., pp. 267–292). Boca Raton, FL: Chapman and Hall/CRC.

Biber, D., Conrad, S., & Cortes, V. (2004). If you look at…: Lexical bundles in university teaching and textbooks. Applied Linguistics, 25 (3), 371–405.

Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. Harlow: Longman.

Boers, F., Eyckmans, J., Kappel, J., Stengers, H., & Demecheleer, M. (2006). Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test. Language Teaching Research, 10 (3), 245–261.

Capel, A. (2010). A1–B2 vocabulary: Insights and issues arising from the English Profile Wordlists project. English Profile Journal, 1 1, e3.

Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16 (1), 22–29.

Coulmas, F. (1979). On the sociolinguistic relevance of routine formulae. Journal of Pragmatics, 3 (3–4), 239–266.

Coxhead, A. (2000). A new academic wordlist. TESOL Quarterly, 34 (2), 213–238.

Davies, M., & Gardner, D. (2010). A Frequency Dictionary of American English: Word Sketches, Collocates and Thematic Lists. Abingdon: Routledge.

Dawood, M. (2003). A Dictionary of Arabic Contemporary Idioms. Cairo: Dar Ghareeb.

Dorgeloh, H., & Wanner, A. (2009). Formulaic argumentation in scientific discourse. In R. Corrigan, E. A. Moravcsik, H. Ouali, & K. Wheatley (Eds.), Formulaic Language Volume 2. Acquisition, Loss, Psychological Reality, and Functional Explanations (pp. 523–544). Amsterdam/Philadelphia, PA: John Benjamins.

Dukes, K., & Atwell, E. (2012). LAMP: A multimodal web platform for collaborative linguistic analysis. In N. Calzolari, K. Choukri, T. Declerck, M. Dogan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of LREC’2012 Language Resources and Evaluation Conference (pp. 3268–3275). Istanbul: LREC. Retrieved from [URL] (last accessed April 2019).

Durrant, P. (2009). Investigating the viability of a collocation list for students of English for academic purposes. English for Specific Purposes, 28 (3), 157–169.

Erman, B., & Warren, B. (2000). The idiom principle and the open choice principle. Text, 20 (1), 29–62.

Fayed, W. K. (2007). A Dictionary of Arabic Contemporary Idioms. Cairo: Abu Elhoul.

Fellbaum, C. (1998). WordNet. Cambridge: MIT Press.

Firth, J. R. (1957). Papers in Linguistics 1934–1951. London: Oxford University Press.

Gralinski, F., Savary, A., Czerepowicka, M., & Makowiecki, F. (2010). Computational lexicography of multi-word units: How efficient can it be? In É. Laporte, P. Nakov, C. Ramisch, A. Villavicencio (Eds), Proceedings of MWE’2010 Workshop on Multiword Expressions: From Theory to Applications (pp. 19–27). Beijing: COLING. Retrieved from [URL] (last accessed April 2019).

Habash, N., & Rambow, O. (2005). Arabic tokenization, morphological analysis, and part-of-speech tagging in one fell swoop. In K. Knight, H. T. Ng & K. Oflazer (Eds.), Proceedings of the Conference of American Association for Computational Linguistics (pp. 578–580). Ann Arbor, MI: ACL. Retrieved from [URL] (last accessed April 2019).

Hassan, H., Daud, N., & Atwell, E. (2013). Connectives in the World Wide Web Arabic corpus. World Applied Sciences Journal (Special Issue of Studies in Language Teaching and Learning), 21 1, 67–72.

Hawwari, A., Attia, M., & Diab, M. (2014). A framework for the classification and annotation of multiword expressions in dialectal Arabic. In N. Habash & S. Vogel (Eds.), Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP) (pp. 48–56). Retrieved from [URL] (last accessed April 2019).

Hawwari, A., Bar, K., & Diab, M. (2012). Building an Arabic multiword expressions repository. Paper presented at the ACL 2012 joint workshop on statistical parsing and semantic processing of morphologically rich languages, Jeju.

Hunston, S. (2002). Corpora in Applied Linguistics. Cambridge: Cambridge University Press.

Hyland, K. (2008). As can be seen: Lexical bundles and disciplinary variation. English for Specific Purposes, 27 (1), 4–21.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., Rychlý, P. & Suchomel, V. (2014). The Sketch Engine: Ten years on. Lexicography, 1 (1), 7–36.

Kjellmer, G. (1990). A mint of phrases. In K. Aijmer & B. Altenberg (Eds.), English Corpus Linguistics: Studies in Honour of Jan Svartvik (pp. 111–127). London: Longman.

Leech, G. N., Rayson, P., & Wilson, A. (2001). Word Frequencies in Written and Spoken English: Based on the British National Corpus. Harlow: Longman.

Leech, G., Garside, R., & Atwell, E. S. (1983). The automatic grammatical tagging of the LOB corpus. ICAME Journal, 7 1, 13–33.

Li, W., Zhang, X., Niu, C., Jiang, Y., & Srihari, R. (2003, July). An expert lexicon approach to identifying English phrasal verbs. In E. Hinrichs & D. Roth (Eds.), Proceedings of the 41st Annual Meeting on Association for Computational Linguistics–Volume 1 1 (pp. 513–520). Sapporo: Association for Computational Linguistics. Retrieved from [URL] (last accessed April 2019).

Martinez, R. (2011). The Development of a Corpus-informed List of Formulaic Sequences for Language Pedagogy (Unpublished doctoral dissertation). University of Nottingham, Nottingham.

Martinez, R., & Murphy, V. A. (2011). Effect of frequency and idiomaticity on second language reading comprehension. TESOL Quarterly, 45 (2), 267–290.

Martinez, R., & Schmitt, N. (2012). A phrasal expressions list. Applied Linguistics, 33 (3), 299–320.

Meghawry, S., Elkorany, A., Salah, A., & Elghazaly, T. (2015). Semantic extraction of Arabic multiword expressions. Computer Science & Information Technology, 5 (2), 21–31.

Mel’ćuk, I. (1998). Collocations and lexical functions. In A. Cowie (Ed.), Phraseology: Theory, Analysis, and Applications (pp. 23–53). Oxford: Clarendon Press.

Milton, J. (2009). Measuring Second Language Vocabulary Acquisition. Bristol: Multilingual Matters.

Nation, I. S. P. (2001). Learning Vocabulary in Another Language. Cambridge: Cambridge University Press.

Nation, P., & Waring, R. (1997). Vocabulary size, text coverage and word lists. Vocabulary: Description, Acquisition and Pedagogy, 14 1, 6–19.

Nerima, L., Seretan, V., & Wehrli, E. (2003). Creating a multilingual collocation dictionary from large text corpora. In A. Copestake & J. Hajic (Eds.), Proceedings of the Tenth Conference on European chapter of the Association for Computational Linguistics – Volume 2. (pp. 131–134). Stroudsburg, PA: Association for Computational Linguistics. Retrieved from [URL] (last accessed April 2019).

Ohlrogge, A. (2009). Formulaic expressions in intermediate EFL writing assessment. In R. Corrigan, E. A. Moravcsik, H. Ouali, & K. Wheatley (Eds.), Formulaic Language Volume 2. Acquisition, Loss, Psychological Reality, and Functional Explanations (pp. 387–404). Amsterdam/Philadelphia, PA: John Benjamins.

O’Keeffe, A., McCarthy, M., & Carter, R. (2007). From Corpus to Classroom: Language Use and Language Teaching. Cambridge: Cambridge University Press.

Omar, A. (2007). Arabic Multi-word Expressions and Language Resources. Tunis: National Publishing Complex.

Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., & Roth, R. (2014). MADAMIRA: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk & S. Piperidis (Eds.), Proceedings of LREC’2014 Ninth International Conference on Language Resources and Evaluation (pp. 1094–1101). Reykjavic: LREC. Retrieved from [URL] (last accessed April 2019).

Pawley, A., & Syder, F. H. (1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In J. Richards & R. Schmidt (Eds.), Language and Communication (pp. 191–227). London: Longman.

Peters, A. M. (1983). The Units of Language Acquisition. Cambridge: Cambridge University Press.

Ramisch, C. (2015). State of the art in MWE processing. In C. Ramisch (Ed.), Multiword Expressions Acquisition (pp. 53–102). Berlin: Springer.

Ramisch, C., De Araujo, V., & Villavicencio, A. (2012). A broad evaluation of techniques for automatic acquisition of multiword expressions. In J. Cheung, J. Hatori, C. Henriquez & A. Irvine (Eds.), Proceedings of ACL 2012 Student Research Workshop (pp. 1–6). Jeju: Association for Computational Linguistics.

Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In A. Gelbukh (Ed.), Proceedings of CICLing’2002 Computational Linguistics and Intelligent Text Processing (pp. 1–15). Berlin: Springer.

Sawalha, M., & Atwell, E. (2013). Accelerating the processing of large corpora: Using grid computing for lemmatizing the 176 million words Arabic internet corpus. In E. Atwell (Ed.), Proceedings of WACL-2 – 2nd Workshop of Arabic Corpus Linguistics. Lancaster: Lancaster University. Retrieved from [URL] (last accessed April 2019).

Schmitt, N. (2010). Researching Vocabulary: A Vocabulary Research Manual. Basingstoke: Palgrave Macmillan.

Schmitt, N., & Martinez, R. (2012). A Phrasal Expressions List. Applied Linguistics, 33 (3), 299–320.

Schneider, N. (2014). Lexical Semantic Analysis in Natural Language Text (Unpublished doctoral dissertation). University of Melbourne, Melbourne.

Scott, M. (2016). WordSmith Tools (Version 6) [Computer software]. Stroud: Lexical Analysis Software.

Seeny, M., Mokhtar, A., & Sayyed, A. (1996). A Contextual Dictionary of Idioms [almu’jm alsyaqi lelta’birat alastlahiah]. Beirut: Librairie du Liban Publishers.

Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), WaCky Working Papers on the Web as Corpus (pp. 63–98). Bologna: GEDIT.

Siyanova-Chanturia, A., Conklin, K., & Schmitt, N. (2011). Adding more fuel to the fire: An eye-tracking study of idiom processing by native and non-native speakers. Second Language Research, 27 (2), 251–272.

Smadja, F., McKeown, K. R., & Hatzivassiloglou, V. (1996). Translating collocations for bilingual lexicons: A statistical approach. Computational Linguistics, 22 (1), 1–38.

Stubbs, M. (1995). Collocations and semantic profiles: On the cause of the trouble with quantitative studies. Functions of Language, 2 (1), 23–55.

Taylor, J. (2006). Polysemy and the lexicon. In G. Kristiansen, M. Achard, R. Dirven & F. Ruiz de Mendoza Ibanez (Eds.), Cognitive Linguistics: Current Applications and Future Perspectives (pp. 51–80). Berlin: Mouton de Gruyter.

Underwood, G., Schmitt, N., & Galpin, A. (2004). The eyes have it: An eye-movement study into the processing of formulaic sequences. In N. Schmitt (Ed.), Formulaic Sequences: Acquisition, Processing and Use, (pp. 153–172). Amsterdam/Philadelphia, PA: John Benjamins.

West, M. (1953). A General Service List of English Words. London: Longman.

Wood, D. (2010). Formulaic Language and Second Language Speech Fluency: Background, Evidence, and Classroom Applications. London/New York, NY: Continuum.

(2015). Fundamentals of Formulaic Language: An Introduction. London: Bloomsbury Academic.

Wray, A. (2002). Formulaic language in computer-supported communication: Theory meets reality. Language Awareness, 11 (2), 114–131.

(2009). Identifying formulaic language: Persistent challenges and new opportunities. In R. Corrigan, E. A. Moravcsik, H. Ouali, & K. Wheatley (Eds.), Formulaic Language Volume 1. Distribution and Historical Change (pp. 27–51). Amsterdam/Philadelphia, PA: John Benjamins.

(2013). Formulaic language. Language Teaching, 46 (3), 316–334.

Wray, A., & Namba, K. (2003). Use of formulaic language by a Japanese-English bilingual child: A practical approach to data analysis. Japan Journal of Multilingualism and Multiculturalism, 9 1, 24–51.

Wulff, S., Swales, J. M., & Keller, K. (2009). “We have about seven minutes for questions”: The discussion sessions from a specialized conference. English for Specific Purposes, 28 (2), 79–92.

Yang, D., Lee, I., & Cantos, P. (2002). On the corpus size needed for compiling a comprehensive computational lexicon by automatic lexical acquisition. Computers and the Humanities, 36 (2), 171–190.