Identifying Bengali Multiword Expressions using semantic clustering
Tanmoy Chakraborty | Department of Computer Science & Engineering, Indian Institute of Technology, Kharagpur, India — 721302
Dipankar Das | Department of Computer Science & Engineering, National Institute of Technology, Meghalaya, India
Sivaji Bandyopadhyay | Department of Computer Science & Engineering, Jadavpur University, Kolkata, India
One of the key issues in both natural language understanding and generation is the appropriate processing of Multiword Expressions (MWEs). MWEs pose a huge problem to the precise language processing due to their idiosyncratic nature and diversity in lexical, syntactical and semantic properties. The semantics of a MWE cannot be expressed after combining the semantics of its constituents. Therefore, the formalism of semantic clustering is often viewed as an instrument for extracting MWEs especially for resource constraint languages like Bengali. The present semantic clustering approach contributes to locate clusters of the synonymous noun tokens present in the document. These clusters in turn help measure the similarity between the constituent words of a potentially candidate phrase using a vector space model and judge the suitability of this phrase to be a MWE. In this experiment, we apply the semantic clustering approach for noun-noun bigram MWEs, though it can be extended to any types of MWEs. In parallel, the well known statistical models, namely Point-wise Mutual Information (PMI), Log Likelihood Ratio (LLR), Significance function are also employed to extract MWEs from the Bengali corpus. The comparative evaluation shows that the semantic clustering approach outperforms all other competing statistical models. As a byproduct of this experiment, we have started developing a standard lexicon in Bengali that serves as a productive Bengali linguistic thesaurus.
References
Agarwal, A., Ray, B., Choudhury, M., Sarkar, S. & Basu, A
(
2004)
Automatic extraction of multiword expressions in Bengali: An approach for miserly resource scenario. In
Proceedings of International Conference on Natural Language Processing (ICON)
, pp. 165–174.
Agirree, E., Aldezabal, I., & Pociello, E
(
2006)
Lexicalization and multiword expressions in the Basque WordNet. In
Proceedings of Third International WordNet Conference
. Jeju Island (Korea).
Baldwin, T., Bannard, C., Tanaka, T., & Widdows, D
(
2003)
An empirical model of multiword expression decomposability. In
Proceedings of the Association for Computational Linguistics-2003, Workshop on Multiword Expressions: Analysis, Acquisition and Treatment
, Sapporo, Japan, pp. 89–96.
Baldwin, T., & Kim, S.N
(
2010)
Multiword expressions. In
Nitin Indurkhya, &
Fred J. Damerau (Eds.),
Handbook of natural language processing (2nd ed., pp. 267–292). Boca Raton, USA: CRC Press.
Benson, M
(
1990)
Collocations and general-purpose dictionaries.
International Journal of Lexicography, 3(1), 23–35.
Bhtnariu, C., Kim, S.N., Nakov, P., Oseaghdha, D., Szpakowicz., S., & Veale, T
(
2009)
SemEval-2010 Task 9: The interpretation of noun compounds Using paraphrasing verbs and prepositions. In
Proceedings of the NAACL Workshop on Semantic Evaluations: Recent Achievements and Future Directions (SEW-2009) at NAACL
, pp. 100–105.
Calzolari, N., Fillmore, C., Grishman, R., Ide, N., Lenci, A., Macleod, C., & Zampolli, A
(
2002)
Towards best practice for multiword expressions in computational lexicons. In
proceedings of the Third International Conference on Language Resources and Evaluation (LREC)
, pp. 1934–1940.
Chakraborty, T
(
2012)
Authorship identification in Bengali literature: A comparative analysis. In
Proceedings of 24th International Conference on Computational Linguistics (Coling, 2012)
, Mumbai, India, pp. 41–50.
Chakraborty, T
(
2012)
Multiword expressions: Towards identification to applications. LAP LAMBERT Academic Publishing GmbH & Co., KG Heinrich-Bocking-Str. 6-8, 66121, Saarbrucken, Germany, ISBN: 978-3-659-24956-3.
Chakraborty, T., & Bandyopadhyay, S
(
2010)
Identification of reduplication in Bengali corpus and their semantic analysis: A rule based approach. In
Proceedings of the 23rd International Conference on Computational Linguistics (COLING 2010), Workshop on Multiword Expressions: from Theory to Applications (MWE 2010)
. Beijing, China, pp. 72–75.
Chakraborty, T., Das, D., & Bandyopadhyay, S
(
2011)
Semantic clustering: An attempt to extract multiword expressions in Bengali. In
Proceedings of Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011)
, Portland, Oregon, USA, pp. 8–11.
Chakraborty, T., Pal, S., Mondal, T., Saikh, T., & Bandyopadhyay, S
(
2011)
Shared task system description: Measuring the compositionality of bigrams using statistical methodologies. In
Proceedings of Distributional Semantics and Compositionally (DiSCo), The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011)
, Portland, Oregon, USA, pp. 38–42
Chattopadhyay, S.K
(
1992)
Bhasa-Prakash Bangala Vyakaran (3rd ed.).
Church, K.W. & Hans, P
(
1990)
Word association norms, mutual information and lexicography. In
Proceedings of 27th Association for Computational Linguistics (ACL)
, 16(1), 22–29.
Cohen, J
(
1960)
A coefficient of agreement for nominal scales.
Educational and Psychological Measurement, 21, 37–46.
Das, D., Pal, S., Mondal, T., Chakraborty, T., & Bandhopadhyay, S
(
2010)
Automatic extraction of complex predicates in Bengali. In
Proceedings of Multiword Expressions: from Theory to Applications (MWE 2010), The 23rd International Conference on Computational Linguistics (COLING 2010)
, Beijing, China, pp. 37–45.
Dasgupta, S., Khan, N., Sarkar, A.I., Pavel, D.S.H., & Khan, M
(
2005)
Morphological analysis of inflecting compound words in Bengali. In
Proceedings of the 8th International Conference on Computer and Information Technology (ICCIT)
, Bangladesh.
Diab, M., & Bhutada, P
(
2009)
Verb noun construction MWE token supervised classification. In
Proceedings of the Workshop on Multiword Expressions
, Singapore, pp. 17–22.
Dias, G.H
(
2003)
Multiword unit hybrid extraction. In
Proceedings of the First Association for Computational Linguistics, Workshop on Multiword Expressions: Analysis, Acquisition and Treatment
, pp. 41–48.
Dunning, T
(
1993)
Accurate method for the statistic of surprise and coincidence.
Computational Linguistics, 61–74.
Dwork, C., Kumar, R., Naor, M., & Sivakumar, D
(
2001)
Rank aggregation methods for the web. In
Proceedings of Conference on the World Wide Web (WWW)-ACM
, New York, pp. 613–622.
Ekbal, A., Haque, R., & Bandyopadhyay, S
(
2008)
Maximum entropy based Bengali part of speech tagging. In
Proceedings of Advances in Natural Language Processing and Applications Research in Computing Science
, pp. 67–78.
kFazly, A., & Stevenson, S
(
2007)
Distinguishing subtypes of multiword expressions using linguistically-motivated statistical measures. In
Proceedings of Association for Computational Linguistics, Workshop on a Broader Perspective on Multiword Expressions
, Prague, Czech Republic, pp. 9–16.
Fellbaum, C
(
1998)
WordNet: An electronic lexical database. Cambridge, USA: MIT Press.
Kilgarriff, A., & Rosenzweig, J
(
2000)
Framework and results for English SENSEVAL. Computers and the Humanities
.
Senseval Special Issue, 34(1-2), 15–48.
Korkontzelos, I., & Manandhar, S
(
2009)
Detecting compositionality in multi-word expressions. In
Proceedings of the Association for Computational Linguistics-IJCNLP
, Singapore, pp. 65–68.
Kunchukuttan, F.A., & Damani, O.P
(
2008)
A system for compound noun multiword expression extraction for Hindi. In
proceedings of 6th International Conference on Natural Language Processing (ICON)
, pp. 20–29.
Passonneau, R.J
(
2006)
Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation.
Language Resources and Evaluation.
Rayson, P., Piao, S., Sharoff, S., Evert, S., & Moriron, B.V
(
2010)
Multiword expressions: hard going or plain sailing? Language Resources and Evaluation, 441, 1–5.
Sag, I., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D
(
2002)
Multiword expressions: A pain in the neck for NLP. In
Proceedings of Conference on Intelligent Text Processing and Computational Linguistics (CICLING)
, pp. 1–15.
Singhal, A
(
2001)
Modern information retrieval: A brief overview.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 24(4), 35–43.
Tanaka, T., & Baldwin, T
(
2003)
Noun-noun compound machine translation: A feasibility study on shallow processing. In
Proceedings of the Association for Computational Linguistics, Workshop on Multiword Expressions: Analysis, Acquisition and Treatment
, Sapporo, Japan, pp. 17–24.
Venkatpathy, S., & Joshi, A
(
2009)
Measuring the relative compositionality of verb-noun (V-N) collocations by integrating features. In
Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Association for Computational Linguistics
, pp. 899–906.
Wu, Z., & Palmar, M
(
1994)
Verb semantics and lexical selection. In
32nd Annual Meeting of the Association for Computational Linguistics
, pp. 133–138.
Cited by
Cited by 2 other publications
Ahmad, Adnan & Mohammad Ruhul Amin
2016.
2016 19th International Conference on Computer and Information Technology (ICCIT),
► pp. 425 ff.
Gupta, Vaishali, Nisheeth Joshi & Iti Mathur
2019.
Advanced Machine Learning Techniques in Natural Language Processing for Indian Languages. In
Smart Techniques for a Smarter Planet [
Studies in Fuzziness and Soft Computing, 374],
► pp. 117 ff.
This list is based on CrossRef data as of 3 april 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.