Chapter 10
Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of
multilingual Philippine periodicals
A usually mentioned problem in Digital Humanities (DH) is the difficult fit between Humanities
research questions and DH methodologies. This chapter is therefore configured as a meta-chapter that
explains the problems and strategies when exploring the multilingual repository of Philippine periodicals constructed
within the project “Strenghthening Digital Research at the UP System” in order to research the evolution of the image
of China in these periodicals. The two main challenges found for analysing the periodicals to find an answer have been
(1) Problematic OCRs, (2) Research across multi-lingual publications. The chapter lists literature and research
projects that have approached similar questions and challenges in comparable corpora. Some suggestions of tools to
address them will also be provided.
Article outline
- The PhilPeriodicals project
- The research question
- Approaches to studying a country’s representation in the periodical press
- First difficulty: How to prepare a set of plurilingual texts?
- The problem with OCR
- Translation
- Tools
- What would researchers in the humanities need from a periodicals repository in the 21st century?
-
Notes
-
References
References
“
About Newspapers”. n.d.
Trove. Accessed 25 January 2019.
[URL]
“
Aims”. n.d. Accessed 25 January 2019.
[URL]
“
Antwerp Centre for Digital Humanities and Literary Criticism – ACDC – University of Antwerp”. n.d. Accessed 25 January 2019.
[URL]
“
Archivo China España, 1800–1950”. n.d. Accessed 4 November 2018.
[URL]
Benson, Rodney, and Erik Neveu
2005 “
Introduction: Field Theory as a Work in Progress”. In
Bourdieu and the Journalistic Field, 1–24. Cambridge, UK: Polity Press.
“
Bibliographical Data (BiblioData) | DARIAH”. n.d. Accessed 2 February 2020.
[URL]
Calamari-OCR/Calamari
(2018) 2020 Python. Calamari-OCR.
[URL]
Cano, Glòria
2008 De Tartessos a Manila: Siete estudios coloniales y poscoloniales. Edición: 1. València: Publicacions de la Universitat de València.
Castells, P., F. Perdrix, E. Pulido, M. Rico, R. Benjamins, J. Contreras, and J. Lorés
2004 “
Neptuno: Semantic Web Technologies for a Digital Newspaper Archive”. In
The Semantic Web: Research and Applications, edited by
Christoph J. Bussler,
John Davies,
Dieter Fensel, and
Rudi Studer, 445–58. Lecture Notes in Computer Science. Springer Berlin Heidelberg.
Castelvecchi, Davide
2016 “
Deep Learning Boosts Google Translate Tool”.
Nature News.
Chaudhury, K., A. Jain, S. Thirthala, V. Sahasranaman, S. Saxena, and S. Mahalingam
2009 “
Google Newspaper Search Amp;#150; Image Processing and Analysis Pipeline”. In
2009 10th International Conference on Document Analysis and Recognition, 621–25.
Comenge, Rafael
1894 Cuestiones filipinas. 1a. parte. Los Chinos. (Estudio social y político). Manila: Tipolitografía de Chofré y compañía.
Cordell, Ryan
n.d. “
Our Project Team”. Accessed 2 February 2020.
[URL]
Crompton, Constance, Richard J. Lane, and Ray Siemens
2016 Doing Digital Humanities: Practice, Training, Research. Taylor & Francis.
“
D*/DTA Search”. n.d. Accessed 25 January 2019.
[URL]
“
Delpher – Boeken Kranten Tijdschriften”. n.d. Accessed 25 January 2019.
[URL]
Eijnatten, Joris van, Toine Pieters, and Jaap Verheul
2014 “
Using Texcavator to Map Public Discourse”.
Tijdschrift Voor Tijdschriftstudies,
July, 59–65.
Elizalde Pérez-Grueso, María Dolores
2008 “
China – España – Filipinas: percepciones españolas de China – y de los chinos – en el siglo
XIX”.
Huarte de San Juan. Geografía e historia, no. 15: 101–11.
[URL]
Figueroa, José Cardona
(2015) 2018 Contribute to JoseCardonaFigueroa/Sentiment-Analysis-Spanish Development by Creating an Account on
GitHub. R.
[URL]
“
Fire Breaks out at UP Diliman Campus”
2016 Cnn 2016
[URL]
“
Fire Hits National Archives Building”
2018 Philstar.Com. 28 May 2018.
[URL]
GMA News Online
2016 “
Namria Discovers 400 to 500 New Islands in PHL Archipelago” 2016
[URL]
Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O. K. Li
2018 “
Universal Neural Machine Translation for Extremely Low Resource Languages”. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies, Volume 1 (Long Papers), 344–354. New Orleans, Louisiana: Association for Computational Linguistics.
Guenter, Muehlberger, and Guenter Hackl
2019 “
NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.)”. Zenodo.
Haaf, Susanne, Frank Wiegand, and Alexander Geyken
2013 “
Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large
Corpus of TEI-Annotated Historical Text”.
Journal of the Text Encoding Initiative, no. Issue 4 (
March).
Hanumanthappa, M., and Deepa Nagalavi
2015 “
Identification and Extraction of Headlines from Online English Newspaper- Statistical
Approach” 10 (
January): 19–22.
Hébert, David, Thomas Palfray, Stephane Nicolas, Pierrick Tranouez, and Thierry Paquet
2014 “
Automatic article extraction in old newspapers digitized collections”. In
Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH
’14). Association for Computing Machinery, New York, 3–8.
Hedges, Mark, and Stuart Dunn
2017 Academic Crowdsourcing in the Humanities: Crowds, Communities and Co-Production. Chandos Publishing.
“
IIIF Newspapers – Devwiki”. n.d. Accessed 25 January 2019.
[URL]
“
IIIF Newspapers Community Group – IIIF | International Image Interoperability Framework”. n.d. Accessed 25 January 2019.
[URL]
Impresso
2018 “
Moving beyond Digital Filters. How to Integrate the Digitised Press into the Historian’s
Workflow”. Blogpost. Impresso
6 July 2018
[URL]
“
Issue 10: Innovation Agenda”. n.d.
Europeana Pro. Accessed 3 February 2020.
[URL]
Jockers, Matthew Lee
2014 Text Analysis with R for Students of Literature.
Jordana y Morera, Ramón
1888 La inmigración china en Filipinas. Madrid: Tipografía de Manuel G. Hernández.
Kettunen, Kimmo, Tuula Pääkkönen, and Erno Liukkonen
2019 Clipping the Page -Automatic Article Detection and Marking Software in Production of Newspaper Clippings
of a Digitized Historical Journalistic Collection.
“
Kraken – Kraken 2.0.5-4-Gbb42ba5 Documentation”. n.d. Accessed 1 February 2020.
[URL]
La Inmigración China y Japonesa En Filipinas: Documentos
1892 Madrid: Imprenta de Don Luis Aguado.
Lagrama, Eimee Rhea C.
2012 “
Preventing Disaster: Quantifying Risks at the UP Diliman University Library”. In
Libraries, Archives and Museums: Common Challenges, Unique Approaches, 10. Rizal Library. Ateneo de Manila University.
“
LASER NLP Toolkit: Zero-Shot Transfer across 93 Languages”
2019 22 January 2019
[URL]
Li, David Leiwei
2003 Globalization and the Humanities. Hong Kong University Press.
Los chinos en Filipinas: Males que se experimentan actualmente y peligros de esa creciente inmigración
1886 Manila: Establecimiento tipográfico de La Oceanía Española.
“
Netherlands EScience Center”. n.d. Accessed 29 January 2019.
[URL]
Netherlands EScience Center: Shifting Concepts Through Time Project – NLeSC/ShiCo
(2015) 2018 Python. Netherlands eScience Center.
[URL]
Neudecker, C., and A. Antonacopoulos
2016 “
Making Europe’s Historical Newspapers Searchable”. In
2016 12th IAPR Workshop on Document Analysis Systems (DAS), 405–10.
“
OCR”
2019 13.
EuropeanaTech. Europeana.
[URL]
“
On Multilingual Dynamic Topic Modeling”. n.d. Accessed 2 February 2020.
[URL]
Ortuño, Casanova Rocío
2017 “
Philippine Literature in Spanish: Canon Away from Canon”.
Iberoromania 2017 (85): 58–77.
Ortuño Casanova, Rocío and Anna Sarmiento
2020 “
Humanidades Digitales en Filipinas: proyectos, dificultades y oportunidades de la colaboración
Norte-Sur”.
Digital Scholarship in the Humanities, fqz086.
“
Our Research Center”
2014 HathiTrust Digital Library 2014
[URL]
Pa, Win Pa, Ye Kyaw Thu, Andrew Finch, and Eiichiro Sumita
2016 “
A Study of Statistical Machine Translation Methods for Under Resourced Languages”.
Procedia Computer Science, SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09–12 May 2016
Yogyakarta, Indonesia, 81 (January): 250–57.
Palfray, Thomas, David Hebert, Stéphane Nicolas, Pierrick Tranouez, and Thierry Paquet
2012 “
Logical segmentation for article extraction in digitized old newspapers”. In
Proceedings of the 2012 ACM symposium on Document engineering (DocEng ’12). Association for Computing Machinery, New York, 129–132.
“
Philippines”. n.d.
Ethnologue. Accessed 18 September 2018.
[URL]
Piotrkowicz, Alicja, Vania Dimitrova, and Katja Markert
2017 “
Automatic Extraction of News Values from Headline Text”. In
Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the
Association for Computational Linguistics, 64–74. Valencia, Spain: Association for Computational Linguistics.
[URL].
Plale, Beth, Robert McDonald, Yiming Sun, Inna Kouper, Ryan Cobine, J. Stephen Downie, Beth Sandore Namachchivaya, and John Unsworth
2013 “
HathiTrust Research Center: Computational Access for Digital Humanities and Beyond”. In
Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, 395–396.
JCDL ’13. New York, NY, USA: ACM.
Ponce, Mariano
1912 Sun Yat Sen: El Fundador de La República de China. Manila: Imprenta de la Vanguardia y Taliba.
Prado-Fonts, Carles
2018 “
Writing China from the Rest of the West: Travels and Transculturation in 1920s
Spain”.
Journal of Spanish Cultural Studies,
April.
“
READ | EADH – The European Association for Digital Humanities”. n.d. Accessed 25 January 2019.
[URL]
Saldaña, Zoë Wilkinson
2018 “
Sentiment Analysis for Exploratory Data Analysis”.
Programming Historian,
January.
[URL].
Ströbel, Phillip, and Simon Clematide
2019 “
Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR
Models on Low-Resolution Images”. In
Digital Humanities 2019. Utrecht.
Tesseract-Ocr/Tesseract
(2014) 2020 C++. tesseract-ocr.
[URL]
“
Texcavator”. n.d. Accessed 25 January 2019.
[URL]
“
Text Correction Hall of Fame”. n.d.
Trove. Accessed 25 January 2019.
[URL]
Tom
(2014) 2020 Tmbdev/Ocropy. Jupyter Notebook.
[URL]
“
Transatlantis Locations”. n.d.
Translantis. Accessed 25 January 2019.
[URL]
“
Transkribus”. n.d. Accessed 25 January 2019.
[URL]
“
Trove – Digitised Newspapers and More”. n.d.
Trove. Accessed 25 January 2019.
[URL]
“
Unsupervised MT: Fast and Accurate for More Languages”
2018 Facebook Engineering (blog). 31 August 2018.
[URL]
Vanetik, Natalia, and Marina Litvak
2019 Multilingual Text Analysis: Challenges, Models, And Approaches.
Viola, Lorella, and Jaap Verheul
2019 “
The Media Construction of Italian Identity: A Transatlantic, Digital Humanities Analysis of
Italianità, Ethnicity, and Whiteness, 1867–1920”.
Identity 19 (4): 294–312.
“
Welsh Newspapers Online – Home”. n.d. Accessed 25 January 2019.
[URL]
Wijfjes, Huub
2017 “
Digital Humanities and Media History. A Challenge for Historical Newspaper Research”.
Tijdschrift Voor Mediageschiedenis 20 (1): 4–24.
Willems, Marieke, and Rossitza Atanassova
2015 “
Europeana Newspapers: Searching Digitized Historical Newspapers from 23 European
Countries”.
Insights 28 (1): 51–56.
“
Xtas, the EXtensible Text Analysis Suite – Xtas 3.4 Documentation”. n.d. Accessed 29 January 2019.
[URL]
Zosa, Elaine, and Mark Granroth-Wilding
2019 “
Multilingual Dynamic Topic Model”. Edited by
Galia Angelova,
Ruslan Mitkov,
Ivelina Nikolova, and
Irina Temnikova.
RANLP 2019 – Natural Language Processing a Deep Learning World, International conference Recent advances in natural language processing, September, 1388–96.
[URL].
Cited by
Cited by 1 other publications
Roig-Sanz, Diana & Laura Fólica
This list is based on CrossRef data as of 28 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.