Associative lexical cohesion as a factor in text complexity
In this paper we present an application of associative lexical cohesion to the analysis of text complexity as determined by expert-assigned US school grade levels. Lexical cohesion in a text is represented as a distribution of pairwise positive normalized mutual information values. Our quantitative measure of lexical cohesion is Lexical Tightness (LT), computed as average of such values per text. It represents the degree to which a text tends to use words that are highly inter-associated in the language. LT is inversely correlated with grade levels and adds significantly to the amount of explained variance when estimating grade level with a readability formula. In general, simpler texts are more lexically cohesive and complex texts are less cohesive. We further demonstrate that lexical tightness is a very robust measure. We compute lexical tightness for a whole text and also across segmental units of a text. While texts are more cohesive at the sentence level than at the paragraph or whole-text levels, the same systematic variation of lexical tightness with grade level is observed for all levels of segmentation. Measuring text cohesion at various levels uncovers a specific genre effect: informational texts are significantly more cohesive than literary texts, across all grade levels.
References (83)
Bamberg, B. (1983). What makes a text coherent? College Composition and Communication, 34(4), 417–429.
Baroni, M., & Lenci, A. (2010). Distributional memory: A general framework for corpus-based Semantics. Computational Linguistics, 36(4), 673–721.
Barzilay, R., & Elhadad, M. (1997). Using lexical chains for text summarization.
Proceedings of ACL Intelligent Scalable Text Summarization Workshop
(pp. 10–17). [URL]
Barzilay, R., & Lapata, M. (2008). Modeling local coherence: An entity-based approach. Computational Linguistics, 34(1), 1–34.
Beigman Klebanov, B., & Flor, M. (2013a). Word association profiles and their use for automated scoring of essays.
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics
, (pp. 1148–1158). [URL]
Beigman Klebanov, B., & Flor, M. (2013b). Associative texture is lost in translation.
Proceedings of the Workshop on Discourse in Machine Translation (DiscoMT at ACL2013)
(pp. 27–32). [URL]
Beigman Klebanov, B., & Shamir, E. (2006). Reader-based exploration of lexical cohesion. Language Resources and Evaluation, 40(2), 109–126.
Bouma, G. (2009). Normalized (Pointwise) mutual information in collocation extraction. In Chiarcos, Eckart de Castilho & Stede (Eds.), Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically (pp. 31–40). Proceedings of the Biennial GSCL Conference 2009. Tübingen: Gunter Narr Verlag.
Budanitsky, A., & Hirst, G., (2006). Evaluating wordnet-based measures of semantic distance. Computational Linguistics, 32(1), 13–47.
Bullinaria, J., & Levy, J. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 391, 510–526.
Chall, J.S. (1996). Varying approaches to readability measurement. Revue québécoise de linguistique, 25(1), 23–40.
Chall, J.S., & Dale, E. (1995). Readability revisited: The new Dale-Chall readability formula. Cambridge, Massachusetts: Brookline Books.
Church, K., & Hanks, P. (1990). Word association norms, mutual information and lexicography. Computational Linguistics, 16(1), 22–29.
Coleman, M., & Liau, T.L. (1975). A computer readability formula designed for machine scoring. Journal of Applied Psychology, 601, 283–284.
Common Core State Standards Initiative (CCSSI). (2010). Common core state standards for English language arts and literacy in history/social studies, science and technical subjects. Washington, DC: CCSSO and National Governors Association. [URL]
Crossley, S.A., Greenfield, J., & McNamara, D.S. (2008) Assessing text readability using cognitively based indices. TESOL Quarterly, 421, 475–493.
Davies, N. (2001). Bat loves the night. Cambridge, MA: Candlewick.
DuBay, W.H. (2004). The principles of readability. Costa Mesa, CA: Impact Information. [URL]
Evert, S. (2008). Corpora and collocations. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook, article 58. Berlin: Mouton de Gruyter.
Feng, L., Jansche, M., Huenerfauth, M., & Elhadad, N. (2010). A comparison of features for automatic readability assessment.
Proceedings of COLING 2010
, (Poster Volume1, pp. 276–284). [URL]
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 321, 221–233.
Flor, M. (2013). A fast and flexible architecture for very large word n-gram datasets. Natural Language Engineering, 19(1), 61–93.
Flor, M., Beigman Klebanov, B., & Sheehan, K.M. (2013). Lexical tightness and text complexity.
Proceedings of the 2nd workshop Natural Language Processing for Improving Textual Accessibility (NLP4ITA)
(pp. 29–38). NAACL HLT 2013 Conference, Atlanta, USA. [URL]
Foltz, P.W., Kintsch, W., & Landauer, T.K. (1998). The measurement of textual coherence with Latent Semantic Analysis. Discourse Processes, 251, 285–307.
Fountas, I., & Pinnell, G.S. (2001). Guiding readers and writers, grades 3–6. Portsmouth, NH: Heinemann.
Freebody, P., & Anderson, R.C. (1981). Effects of vocabulary difficulty, text cohesion, and schema availability on reading comprehension. Technical Report No. 225, Center for the Study of Reading. Champaign, IL: University of Illinois.
Graesser, A.C., McNamara, D.S., & Kulikowich, J.M. (2011). Coh-Metrix: Providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.
Graff, D., & Cieri, C. (2003). English Gigaword. LDC2003T05. Philadelphia, PA: Linguistic Data Consortium.
Green, S. (1998). Automated link generation: Can we do better than term repetition? Computer Networks, 301, 75–84.
Grosz, B., Joshi, A., & Weinstein, S. (1995). Centering: A framework for modelling the local coherence of discourse. Computational Linguistics, 21(2), 203–226.
Guinaudeau, C., Gravier, G., & Sébillot, P. (2012). Enhancing lexical cohesion measure with confidence measures, semantic relations and language model interpolation for multimedia spoken content topic segmentation. Computer Speech and Language, 261, 90–104.
Gunning, R. (1952). The technique of clear writing. New York: McGraw-Hill.
Gurevych, I., & Strube, M. (2004). Semantic similarity applied to spoken dialogue summarization.
Proceedings of COLING 2004
(pp. 764–770). [URL]
Halliday, M.A.K., & Hasan, R. (1976), Cohesion in English. London: Longman.
Halliday, M.A.K., & Matthiessen, C.M.I.M. (2004). An introduction to functional grammar (3rd ed.). London: Arnold.
HaveFunTeaching.com (2013). [URL], Last accessed May 9, 2013.
Hiebert, E.H., (2013). Text Project. [URL]. Last accessed May 9, 2013.
Hiebert, E.H., (2012). Readability and the common core’s staircase of text complexity. Santa Cruz, CA: TextProject Inc.
Hiebert, E.H. (2011). Using multiple sources of information in establishing text complexity. Reading Research Report 11.03. Santa Cruz, CA: TextProject Inc.
Hoey, M. (1991). Patterns of lexis in text. Oxford: Oxford University Press.
Hoey, M. (2005). Lexical priming: A new theory of words and language. London: Routledge.
Kincaid, J.P., Fishburne, R.P. Jr., Rogers, R.L., & Chissom, B.S. (1975). Derivation of new readability formulas for Navy enlisted personnel. Research Branch Report 8-75. Millington, TN: Naval Technical Training, U. S. Naval Air Station, Memphis, TN.
Landauer, T.K., & Dumais, S.T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of the acquisition, induction, and representation of knowledge. Psychological Review, 1041, 211–240.
Lee, M.D., Pincombe, B.M., & Welsh, M.B. (2005). An empirical evaluation of models of text document similarity. In B.G. Bara, L.W. Barsalou & M. Bucciarelli, (Eds.), Proceedings of the 27th Annual Conference of the Cognitive Science Society (pp. 1254–1259). Mahwah, NJ: Erlbaum.
Lenci, A. (2011). Composing and updating verb argument expectations: A distributional semantic model.
Proceedings of the 2nd Workshop on Cognitive Modeling and Computational Linguistics (CMCL)
(pp. 58–66). [URL]
Manning, C., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.
Marathe, M., & Hirst, G. (2010). Lexical chains using distributional measures of concept distance. In A. Gelbukh (Ed.), Computational linguistics and intelligent text processing (Vol. 60081, pp. 291–302). Lecture Notes in Computer Science. Berlin: Springer.
McLaughlin, G.H. (1969). SMOG grading A new readability formula. Journal of Reading, 12(8), 639–646.
McNamara, D.S., Louwerse, M.M., McCarthy, P.M., & Graesser, A.C. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 471, 292–330.
McNamara, D.S., Cai, Z., & Louwerse, M.M. (2007). Optimizing LSA measures of cohesion. In T.K. Landauer, D.S. McNamara, S. Dennis & W. Kintsch (Eds.), Handbook of latent semantic analysis (pp. 379–400). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
McNemar, Q. (1955). Psychological statistics. New York: John Wiley & Sons.
Mitchell, J., & Lapata, M. (2008). Vector-based models of semantic composition.
Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics
(pp. 236–244). [URL]
Mohammad, S., & Hirst, G. (2006). Distributional measures of concept-distance: A task-oriented evaluation.
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP 2006)
(pp. 35–43). [URL]
Morris, J., & Hirst, G. (2005). The subjectivity of lexical cohesion in text. In J. Shanahan, Y. Qu & J. Wiebe (Eds.), Computing attitude and affect in text (pp. 41–48). Dordrecht, The Netherlands: Springer.
Morris, J., & Hirst, G. (2004) Non-Classical Lexical Semantic Relations.
Proceedings of the Computational Lexical Semantics Workshop at HLT-NAACL 2004 conference
. [URL].
Morris, J., & Hirst, G. (1991). Lexical cohesion computed by thesaural relations as an indicator of the structure of text. Computational Linguistics, 17(1), 21–48.
Nelson, J., Perfetti, C., Liben, D., & Liben, M. (2012). Measures of text difficulty: Testing their predictive value for grade levels and student performance. Student Achievement Partners. [URL]
Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources & Evaluation, 441, 137–158.
Petersen, S.E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech and Language, 231, 89–106.
Pitler, E., & Nenkova, A. (2008). Revisiting readability: A unified framework for predicting text quality.
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing
(pp. 186–195). [URL]
Schulte im Walde, S., & Melinger, A. (2008). An in-depth look into the co-occurrence distribution of semantic associates. Rivista di Linguistica, 20(1), 89–128.
Senter, R.J., & Smith, E.A. (1967). Automated readability index. Report AMRL-TR-6620. USA: Wright-Patterson Air Force Base.
Shanahan, T., Fisher, D., & Frey, N. (2012). The challenge of challenging text. Educational Leadership, 69(6), 58–62.
Sheehan, K.M. (2013). Measuring cohesion: An approach that accounts for differences in the degree of integration challenge presented by different types of sentences. Educational Measurement: Issues and Practice, 32(4), 28–37.
Sheehan, K.M., Flor, M., & Napolitano, D. (2013). A two-stage approach for generating unbiased estimates of text complexity.
Proceedings of the 2nd Workshop Natural Language Processing for Improving Textual Accessibility (NLP4ITA)
(pp. 49–58), NAACL HLT 2013 conference. [URL]
Sheehan, K.M, Futagi, Y., Kostin, I., & Flor, M. (2010). Generating automated text complexity classifications that are aligned with targeted text complexity standards. ETS Research Report RR-10-28, Princeton, NJ: ETS. [URL].
Sheehan, K.M., Kostin, I., & Futagi, Y. (2008). When do standard approaches for measuring vocabulary difficulty, syntactic complexity and referential cohesion yield biased estimates of text difficulty? In B.C. Love, K. McRae & V.M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society, Washington, DC.
Sheehan, K.M., Kostin, I., & Futagi, Y. (2007). SourceFinder: A construct-driven approach for locating appropriately targeted reading comprehension source texts.
Proceedings of the 2007 Workshop of the International Speech Communication Association
. Farmington, PA: Special Interest Group on Speech and Language Technology in Education.
Silber, H.G., & McCoy, K. (2002). Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4), 487–496.
Sinclair, J.M. (1991). Corpus, concordance, collocation. Oxford: Oxford University Press.
Štajner, S., Evans, R., Orăsan, C., & Mitkov, R. (2012). What can readability measures really tell us about text complexity?
Proceedings of Workshop on Natural Language Processing for Improving Textual Accessibility (NLP4ITA) at LREC 2012 conference
(pp. 14–22). [URL]
Stokes, N., Carthy, J., & Smeaton, A.F. (2004). SeLeCT: A lexical cohesion based news story segmentation system. AI Communications, 17(1), 3–12.
Taylor, M.D. (1976). Roll of thunder, hear my cry. New York, NY: Phyllis Fogelman Books.
Tierney, R.J., & Mosenthal, J.H. (1983). Cohesion and textual coherence. Research in the Teaching of English, 171, 215–229.
Turney, P.D., & Pantel, P. (2010). From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research, 371, 141–188.
Turney, P.D. (2001). Mining the web for synonyms: PMI-IR versus LSA on TOEFL.
Proceedings of European Conference on Machine Learning
(pp. 491–502). Freiburg, Germany.
Vajjala, S., & Meurers, D. (2012). On improving the accuracy of readability classification using insights from second language acquisition.
Proceedings of the 7th Workshop on the Innovative Use of NLP for Building Educational Applications (BEA-7)
(pp. 163–173). [URL]
Woodsend, K., & Lapata, M. (2011). Learning to simplify sentences with quasi-synchronous grammar and integer programming.
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing
(pp. 409–420). [URL]
Yang, D., & Powers, D.M.W. (2006). Word sense disambiguation using lexical cohesion in the context.
Proceedings of COLING/ACL2006, Main Conference Poster Sessions (pp. 929–936).[URL]
Zhang, Z., Gentile, A.L., & Ciravegna, F. (2012) Recent advances in methods of lexical semantic relatedness a survey. Natural Language Engineering, 19(4), 411–479.
Zwaan, R.A., & Radvansky, G.A. (1998). Situation models in language comprehension and memory. Psychological Bulletin, 123(2), 162–185.
Cited by (2)
Cited by two other publications
Hartmann, Nathan, Livia Cucatto, Danielle Brants & Sandra Aluísio
2016.
Automatic Classification of the Complexity of Nonfiction Texts in Portuguese for Early School Years. In
Computational Processing of the Portuguese Language [
Lecture Notes in Computer Science, 9727],
► pp. 12 ff.
[no author supplied]
2017.
Automatic Text Simplification [
Synthesis Lectures on Human Language Technologies, ],
This list is based on CrossRef data as of 6 august 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.