Early newspapers as data for corpus linguistics (and Digital
Humanities)
Issues in using the British Library Newspapers
database as a corpus
The availability of large digital archives has great
potential for corpus linguistic research, but their use is not without
problems. These problems can often be traced to fundamentally different
ideas of what might constitute “good data” in Digital Humanities and in
corpus linguistics, leading to different expectations regarding how the data
is made available to researchers. This chapter discusses the specific
challenges involved in using the British Library Newspapers
database for corpus linguistics and considers potential solutions for them.
It is argued that, to take full advantage of the database, it is necessary
to adopt a flexible approach enabling a critical reflection on the digital
materials, how they have been collected, processed, and made available.
Article outline
- 1.Introduction
- 2.Digital text analysis in the humanities
- 2.1Digital Humanities
- 2.2Corpus linguistics
- 2.3Towards a useful synergy
- 3.Historical newspaper prose and the British Library
Newspapers database
- 3.1Problems with available search tools
- 3.2Sampling, balance, and representativeness
- 3.3Registers and subregisters
- 3.4Optical Character Recognition (OCR)
- 4.Discussion
-
Notes
-
References