Changes in society and language
Charting poverty
This study addresses how societal and linguistic changes can be
detected using historical corpora, with the topics of poverty and industrial revolution
as a case study, based on large historical corpora, in particular EEBO, and CLMET3.0.
The results, based on a rich array of state-of-the art statistical approaches (such as
kernel density estimation), show how poverty, industrial revolution, and urbanization
are associated through, for instance, the associations of war, religion, family,
poverty, and suffering. The study also discusses the importance of data size and
cleanness, the temptations of distant reading, and the necessity for validating the
discovered patterns in close reading and distant reading in interaction.
Article outline
- 1.Introduction
- 2.Data and pre-processing
- 2.1The EEBO Collection as sampler corpus
- 2.2The CLMET3.0 corpus
- 2.3The pre-processing step of spelling normalization
- 3.Methods
- 3.1Data-based and data-driven approaches
- 3.2Document classification
- 3.3Topic modelling
- 3.4Conceptual maps
- 4.Results and discussion
- 4.1Dictionary-based approach
- 4.2Topic modelling
- 4.2.1EEBO early vs. EEBO late
- 4.2.2Adding CLMET3.0 and increasing the number of topics
- 4.3Conceptual maps
- 5.Conclusions
-
Notes
-
References
References
Corpora and software
CLMET3.0 = Corpus of Late
Modern English Texts
version
3.0.
De Smet, Hendrik,
Diller, Hans-Jürgen &
Tyrkkö, Jukka (comps).
[URL]
EEBO = Early English Books
Online. Davies, Mark
Mallet = Machine Learning for LanguagE
Toolkit
VARD2 = Baron, Alistair and Rayson, Paul
2008 VARD
2: A tool for dealing with spelling variation in historical
corpora.
Proceedings of the Postgraduate Conference
in Corpus Linguistics, Aston University, Birmingham, UK
22 May 2008.
Other references
Ananiadou, Sophia, Kell, Douglas B. & Tsujii, Jun-ichi
2006 Text
mining and its potential applications in systems
biology.
Trends in
Biotechnology 24(12): 571–579.
Baroni, Marco & Lenci, Alessandro
2010 Distributional
memory: A general framework for corpus-based
semantics.
Computational
Linguistics 36(4): 673–721.
Bartsch, Sabine & Evert, Stefan
2014 Towards
a Firthian notion of
collocation. In
Vernetzungsstrategien,
Zugriffsstrukturen und automatisch ermittelte Angaben in
Internetwörterbüchern [
OPAL - Online publizierte Arbeiten zur Linguistik 2/2014],
Andrea Abel &
Lothar Leimnitz (eds), 48–61. Mannheim: Institut für Deutsche Sprache.
Blei, David
2012 Probabilistic
topic models.
Communications of the
ACM 55(4): 77–84.
Bybee, Joan
2007 Frequency
of Use and the Organization of
Language. Oxford: OUP.
Church, Kenneth
2000 Empirical
estimates of adaptation: The chance of Two Noriegas is closer to p/2 than
p2
.
Proceedings of the 17th Conference
on Computational
Linguistics, 180–186. Stroudsburg, PA: Association for Computational Linguistics.
C. W.
2013 Did
living standards improve during the Industrial
Revolution? The
Economist,
September 13 2013 <
[URL]> (
30 December 2018).
Daudin, Guillaume, O’Rourke, Kevin H., & Prados de la Escosura, Leandro
2008 Trade
and empire, 1700–1870.
Technical Report # 2008–24,
OFCE: Centre de recherche en économie et sciences po.
[URL]> (
30 December 2018).
De Smet, Hendrik
2005 A
corpus of Late Modern English.
ICAME
Journal 29: 69–82.
Evert, Stefan
2008 Corpora
and
collocations. In
Corpus
Linguistics. An International Handbook,
Anke Lüdeling &
Merja Kytö (eds), 1212–1248. Berlin: Mouton De Gruyter.
Food and Agriculture Organisation of
the United
Nations
.
November 2003 Anti-hunger
Programme.
A twin-track approach to hunger
reduction: priorities for national and international
action.
[URL].
Firth, Rupert
1957 A
synopsis of linguistic theory
1930–1955. In
Studies in
Linguistic Analysis [
Special volume of the Philological
Society],
Rupert Firth (ed.), 1–32. Oxford: Blackwell.
Glynn, Dylan
2010 Corpus-driven
cognitive semantics. Introduction to the
field. In
Quantitative
Methods in Cognitive Semantics: Corpus-Driven
Approaches [
Cognitive Linguistics Research
46],
Dylan Glynn &
Kerstin Fischer (eds), 1–42. Berlin: Mouton de Gruyter.
Grimmer, Justin & Stewart, Brandon
2013 Text
as data: The promise and pitfalls of automatic content analysis methods for
political texts.
Political
Analysis 21(3): 267–297.
Hatton, Timothy J. & Bray, Bernice E.
2010 Long
run trends in the heights of European men, 19th–20th
centuries.
Economics & Human
Biology 8(3): 405–413.
Hilpert, Martin & Gries, Stefan T.
2016 Quantitative
approaches to diachronic corpus
linguistics. In
The
Cambridge Handbook of English Historical
Linguistics,
Merja Kytö &
Päivi Pahta (eds), 36–53. Cambridge: CUP.
Janda, Linda A.
2013 Cognitive
Linguistics: The Quantitative
Turn. Berlin: Mouton de Gruyter.
Jurafsky, Daniel & Martin, James H.
2009 Speech
and Language Processing: An Introduction to Natural Language Processing, Speech
Recognition, and Computational Linguistics, 2nd
edn. Upper Saddle River, NJ: Prentice-Hall.
Komlos, John
1998 Shrinking
in a growing economy? The mystery of physical stature during the industrial
revolution.
Journal of Economic
History 58: 779–802.
Krippendorff, Klaus
2004 Content
Analysis, 2nd
edn. London: Sage.
Michel, Jean-Baptiste, Shen, Yuan Kui, Presser Aiden, Aviva, Veres, Adrian, Gray, Matthew K., The Google Books Team, Pickett, Joseph P., Hoiberg, Dale, Clancy, Dan, Norvig, Peter, Orwant, Jon, Pinker, Steven, Nowak, Martin A. & Lieberman Aiden, Erez
2011 Quantitative
analysis of culture using millions of digitized
books.
Science 331(6014): 176–182.
Moretti, Franco
2013 Distant
Reading. London: Verso.
Oxford English
Dictionary
2010 3rd
edn. Oxford: OUP.
Sahlgren, Magnus
2006 The
Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and
Paradigmatic Relations between Words in High- Dimensional Vector
Spaces. PhD
dissertation, Stockholm University.
Schneider, Gerold
2014 Applying
Computational Linguistics and Language Models: From Descriptive Linguistics to
Text Mining and Psycholinguistics. Cumulative Habilitation, University of Zurich.
Schneider, Gerold
2018 Differences
between Swiss High German and German High German via data-driven
methods. In
Proceedings
of SwissText 2018,
Mark Cieliebak,
Don Tuggener &
Fernando Benites (eds), 6–16.
[URL]> (
30 December 2018).
Schneider, Gerold, Pettersson, Eva & Percillier, Michael
2017 Comparing
rule-based and SMT-based spelling normalisation for English historical
texts.
Proceedings of the NoDaLiDa 2017 Workshop on
Processing Historical Language.
[URL]> (
30 December 2018).
Schwartz, H. Andrew & Ungar, Lyle H.
2015 Data-driven
content analysis of social media: A systematic overview of automated
methods.
The ANNALS of the American Academy of
Political and Social
Science 659(1): 78–94.
Szreter, Simon & Mooney, Graham
1998 Urbanization,
mortality, and the standard of living debate: new estimates of the expectation of
life at birth in nineteenth-century British
cities.
Economic History
Review 51(1): 84–112.
Taavitsainen, Irma & Schneider, Gerold
2019 Scholastic
argumentation in Early English medical writing and its afterlife: New corpus
evidence. In
From Data to
Evidence in English Language Research [
Language and
Computers 83],
Carla Suhr,
Terttu Nevalainen &
Irma Taavitsainen (eds), 191–221. Leiden: Brill.
Webster’s Dictionary of the
English
Language
1961 Springfield, MA: Merriam-Webster Incorporated.
Wüest, Bruno, Schneider, Gerold & Amsler, Michael
2014 Measuring
the public accountability of new modes of
governance.
Proceedings of the ACL 2014 Workshop on
Language Technologies and Computational Social Science, Baltimore,
Maryland, 38–43. Stroudsburg, PA: Association for Computational Linguistics.
Yang, Li-gong, Zhu, Jian & Tang, Shi-ping
2013 Keywords
extraction based on text classification.
Advanced
Materials
Research 765–767: 1604–1609.
Cited by
Cited by 1 other publications
Schneider, Gerold & Maud Reveilhac
This list is based on CrossRef data as of 23 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.