Mining style and stance for sociocultural insight: Public policy research applications of DocuScope’s linguistic taxonomy

Marcellino, William

doi:10.1075/scl.109.07mar

Part of

Corpora and Rhetorically Informed Text Analysis: The diverse applications of DocuScope
Edited by David West Brown and Danielle Zawodny Wetzel
[Studies in Corpus Linguistics 109] 2023
► pp. 148–166

Public policy research applications
of DocuScope’s linguistic taxonomy

Mining style and stance for sociocultural insight

William Marcellino | RAND Corporation

Computer scientists in natural language processing (NLP) have focused on the lexical level of language: word counts, ratios, distance, and context, and this attention to the lexical level of language is well suited to semantic tasks as well as syntactic analyses. Corpus linguists on the other hand have had a broader focus, also accounting for the lexicogrammatical level of language, and thus their approach is well-suited to pragmatic tasks. DocuScope, with its linguistic taxonomy at the lexicogrammatical level, is thus a unique and complementary tool for the data-driven analysis of large collections of text, addressing the stance and style choices pervasive in linguistic behavior. This chapter looks at how DocuScope’s taxonomy has informed a range of problems in public policy at the RAND Corporation. One section of the chapter examines how the DocuScope taxonomy has been used as a statistical tool to find patterns in text corpora, scaling up human qualitative analysis into a mixed methods text analysis approach, for example analyzing open text responses in a large survey of U.S. special forces operators. The second section shows how the DocuScope taxonomy has improved machine learning efforts, both in terms of accuracy and interpretability, for example in detecting and understanding conspiracy theory discourse over social media. This chapter ultimately calls for humanistic knowledge as a valuable and necessary complement to technical advances in data-centric disciplines like NLP.

Keywords: machine learning, natural language processing, public policy, linguistic stance, text analysis

Article outline

1.Introduction
2.Overview of DocuScope’s usage at RAND
- 2.1The RAND-Lex instantiation of the DocuScope dictionaries: Quantifying stance
  - 2.1.1Machine + human reading: Scaling up qualitative analysis
  - 2.1.2Quantitative representations of stance for machine learning
3.Examples applications of the DocuScope dictionaries in public policy research
- 3.1Scaling up human reading: Analyzing attitudes in survey responses and measuring changes in news presentation
  - 3.1.1Analyzing attitudes in survey responses from special operations members
  - 3.1.2Measuring style at scale: Has U.S. news reporting become more subjective over time?
- 3.2Improving machine reading through linguistic stance
  - 3.2.1Election interference: Understanding Russian trolls and U.S. partisanship
  - 3.2.2Stance across language: Understanding the Arabic Bin Laden archive
  - 3.2.3Hybrid modeling: Improving machine learning performance, and insight with the DocuScope dictionaries
  - 3.2.4Stance’s value is document-length dependent
  - 3.2.5Modeling with stance: Improved interpretability
4.Filling in NLP gaps through humanistic theory
Notes
References

Published online: 29 June 2023

https://doi.org/10.1075/scl.109.07mar

References (26)

References

Allison, S. D., Heuser, R., Jockers, M. L., Moretti, F., & Witmore, M. (2011). Quantitative formalism: An experiment. Stanford Literary Lab.

Aull, L. (2020). How students write: A linguistic analysis. Modern Language Association.

Baker, P. (2004). Querying keywords: Questions of difference, frequency, and sense in keywords analysis. Journal of English Linguistics, 32(4), 346–359.

Bellasio, J., Grand-Clement, S., Iqbal, S., Marcellino, W., Lynch, A., Abdelfatah, Y., Richardson-Golinski, T., Cox, K., & Persi Paoli, G. (2021). Insights from the Bin Laden Archive: Inventory of research and knowledge and initial assessment and characterisation of the Bin Laden Archive. RAND Corporation. Retrieved on 24 January 2023 from [URL]

Brown, R., Marcellino, M., Van Hegewald, E., John, E., Salas, A., & Matthews, M. (2021). Rapid analysis of foreign malign information on COVID-19 in the Indo-Pacific: A proof-of-concept study. RAND Corporation. Retrieved on 24 January 2023 from [URL]

Claes, J., & Ortiz López, L. A. (2011). Restricciones pragmáticas y sociales en la expresión de futuridad en el español de Puerto Rico [Pragmatic and social restrictions in the expression of the future in Puerto Rican Spanish]. Spanish in Context, 8, 50–72.

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.

Etzioni, O., Banko, M., & Cafarella, M. (2006). Machine reading. AAAI, 6, 1517–1519.

Hayles, K. (2010). How we read: Close, hyper, machine. ADE Bulletin, 150(18), 62–79.

Hope, J., & Witmore, M. The very large textual object: A prosthetic reading of Shakespeare. Early Modern Literary Studies, 9(3), 1–36.

Hyland, K. (2005). Metadiscourse. Continuum.

Johnson, C., & Marcellino, W. (2022). Bag-of-words algorithms can supplement transformer sequence classification & improve model interpretability. RAND Corporation. Retrieved on 24 January 2023 from [URL]

Kaufer, D., & Parry-Giles, S. (2017). Hillary Clinton’s presidential campaign memoirs: A study in contrasting identities. Quarterly Journal of Speech, 103(1/2): 7–32.

Kavanagh, J., Marcellino, M., Blake, J. S., Smith, S., Davenport, S., & Gizaw, M. (2019). News in a digital age: Comparing the presentation of news information over time and across media platforms. RAND Corporation. Retrieved on 24 January 2023 from [URL].

Li, Y., Thomas, M., & Liu, D. (2021). From semantics to pragmatics: Where IS can lead in Natural Language Processing (NLP) research. European Journal of Information Systems, 30(5), 569–590.

Marcellino, W. (2014). Talk like a Marine: USMC linguistic acculturation and civil–military argument. Discourse Studies, 16(3), 385–405.

Marcellino, M., Cragin, K., Mendelsohn, J., Cady, A., Magnuson, M., & Reedy, K. (2017). Measuring the popular resonance of Daesh’s propoganda. Journal of Strategic Security, 10(1), 4.

Marcellino, W., Johnson, C., Posard, M. N., & Helmus, T. C. (2020a). Foreign interference in the 2020 election: Tools for detecting online election interference. RAND Corporation. Retrieved on 24 January 2023 from [URL].

Marcellino, W., Cox, K., Galai, K., Slapakova, L., Jaycocks, A., & Harris, R. (2020b). Human-machine detection of online-based malign information. RAND Corporation. Retrieved on 24 January 2023 from [URL].

Marcellino, W., Helmus, T., Kerrigan, J., Reininger, H., Karimov, R., & Lawrence, R. (2021). Detecting conspiracy theories on social media: Improving machine learning to detect and understand online conspiracy theories. RAND Corporation. Retrieved on 24 January 2023 from [URL]

Moretti, F. (2005). Graphs, maps, trees: Abstract models for a literary history. Verso.

Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh University Press.

Rich, M. (2018). Truth decay: An initial exploration of the diminishing role of facts and analysis in American public life. Rand Corporation.

Ronowicz, E., & Rittidech, K. (2006). The Sapir Whorf hypothesis and translation or the power and weakness of language. The Journal of the Faculty Arts, 2(2), 21–32.

Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5), 206–215.

Szayna, T., Larson, E., O’Mahony, A., Robson, S., Gereben, A., Schaefer Matthews, M., Polich, J., Ayer, L., Eaton, D., Marcellino, W., Kraus, L., Posard, M., Syme, J., Winkelman, Z., Wright, C., Cotugno, C., & Welser, W. (2016). Considerations for integrating women into closed occupations in U.S. special operations forces. RAND Corporation. Retrieved on 24 January 2023 from [URL]