Chapter 10. Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Ortuño Casanova, Rocío

doi:10.1075/btl.155.10ort

Part of

Literary Translation in Periodicals: Methodological challenges for a transnational approach
Edited by Laura Fólica, Diana Roig-Sanz and Stefania Caristia
[Benjamins Translation Library 155] 2020
► pp. 247–272

Chapter 10
Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Rocío Ortuño Casanova | University of Antwerp

A usually mentioned problem in Digital Humanities (DH) is the difficult fit between Humanities research questions and DH methodologies. This chapter is therefore configured as a meta-chapter that explains the problems and strategies when exploring the multilingual repository of Philippine periodicals constructed within the project “Strenghthening Digital Research at the UP System” in order to research the evolution of the image of China in these periodicals. The two main challenges found for analysing the periodicals to find an answer have been (1) Problematic OCRs, (2) Research across multi-lingual publications. The chapter lists literature and research projects that have approached similar questions and challenges in comparable corpora. Some suggestions of tools to address them will also be provided.

Keywords: Philippine rare periodicals, multilingual text analysis, representation, low-resource languages, OCR, online repository, challenges in digital humanities

Article outline

The PhilPeriodicals project
The research question
Approaches to studying a country’s representation in the periodical press
First difficulty: How to prepare a set of plurilingual texts?
- The problem with OCR
- Translation
Tools
What would researchers in the humanities need from a periodicals repository in the 21st century?
Notes
References

Available under the Creative Commons Attribution-NonCommercial-NoDerivatives (CC BY-NC-ND) 4.0 license.

For any use beyond this license, please contact the publisher at [email protected].

Published online: 10 December 2020

https://doi.org/10.1075/btl.155.10ort

References

“

About Newspapers”. n.d. Trove. Accessed 25 January 2019. [URL]

“

Aims”. n.d. Accessed 25 January 2019. [URL]

“

Antwerp Centre for Digital Humanities and Literary Criticism – ACDC – University of Antwerp”. n.d. Accessed 25 January 2019. [URL]

“

Archivo China España, 1800–1950”. n.d. Accessed 4 November 2018. [URL]

Benson, Rodney, and Erik Neveu

2005 “Introduction: Field Theory as a Work in Progress”. In Bourdieu and the Journalistic Field, 1–24. Cambridge, UK: Polity Press.

“

Bibliographical Data (BiblioData) | DARIAH”. n.d. Accessed 2 February 2020. [URL]

Calamari-OCR/Calamari

(2018) 2020 Python. Calamari-OCR. [URL]

Cano, Glòria

2008 De Tartessos a Manila: Siete estudios coloniales y poscoloniales. Edición: 1. València: Publicacions de la Universitat de València.

Castells, P., F. Perdrix, E. Pulido, M. Rico, R. Benjamins, J. Contreras, and J. Lorés

2004 “Neptuno: Semantic Web Technologies for a Digital Newspaper Archive”. In The Semantic Web: Research and Applications, edited by Christoph J. Bussler, John Davies, Dieter Fensel, and Rudi Studer, 445–58. Lecture Notes in Computer Science. Springer Berlin Heidelberg.

Castelvecchi, Davide

2016 “Deep Learning Boosts Google Translate Tool”. Nature News.

Chaudhury, K., A. Jain, S. Thirthala, V. Sahasranaman, S. Saxena, and S. Mahalingam

2009 “Google Newspaper Search Amp;#150; Image Processing and Analysis Pipeline”. In 2009 10th International Conference on Document Analysis and Recognition, 621–25.

Comenge, Rafael

1894 Cuestiones filipinas. 1a. parte. Los Chinos. (Estudio social y político). Manila: Tipolitografía de Chofré y compañía.

Cordell, Ryan

n.d. “Our Project Team”. Accessed 2 February 2020. [URL]

Crompton, Constance, Richard J. Lane, and Ray Siemens

2016 Doing Digital Humanities: Practice, Training, Research. Taylor & Francis.

“

D*/DTA Search”. n.d. Accessed 25 January 2019. [URL]

“

Delpher – Boeken Kranten Tijdschriften”. n.d. Accessed 25 January 2019. [URL]

Eijnatten, Joris van, Toine Pieters, and Jaap Verheul

2014 “Using Texcavator to Map Public Discourse”. Tijdschrift Voor Tijdschriftstudies, July, 59–65.

Elizalde Pérez-Grueso, María Dolores

2008 “China – España – Filipinas: percepciones españolas de China – y de los chinos – en el siglo XIX”. Huarte de San Juan. Geografía e historia, no. 15: 101–11. [URL]

Figueroa, José Cardona

(2015) 2018 Contribute to JoseCardonaFigueroa/Sentiment-Analysis-Spanish Development by Creating an Account on GitHub. R. [URL]

“

Fire Breaks out at UP Diliman Campus” 2016 Cnn 2016 [URL]

“

Fire Hits National Archives Building” 2018 Philstar.Com. 28 May 2018. [URL]

GMA News Online

2016 “Namria Discovers 400 to 500 New Islands in PHL Archipelago” 2016 [URL]

Gu, Jiatao, Hany Hassan, Jacob Devlin, and Victor O. K. Li

2018 “Universal Neural Machine Translation for Extremely Low Resource Languages”. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 344–354. New Orleans, Louisiana: Association for Computational Linguistics.

Guenter, Muehlberger, and Guenter Hackl

2019 “NewsEye / READ OCR training dataset from Austrian Newspapers (19th C.)”. Zenodo.

Haaf, Susanne, Frank Wiegand, and Alexander Geyken

2013 “Measuring the Correctness of Double-Keying: Error Classification and Quality Control in a Large Corpus of TEI-Annotated Historical Text”. Journal of the Text Encoding Initiative, no. Issue 4 (March).

Hanumanthappa, M., and Deepa Nagalavi

2015 “Identification and Extraction of Headlines from Online English Newspaper- Statistical Approach” 10 (January): 19–22.

Hébert, David, Thomas Palfray, Stephane Nicolas, Pierrick Tranouez, and Thierry Paquet

2014 “Automatic article extraction in old newspapers digitized collections”. In Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage (DATeCH ’14). Association for Computing Machinery, New York, 3–8.

Hedges, Mark, and Stuart Dunn

2017 Academic Crowdsourcing in the Humanities: Crowds, Communities and Co-Production. Chandos Publishing.

“

IIIF Newspapers – Devwiki”. n.d. Accessed 25 January 2019. [URL]

“

IIIF Newspapers Community Group – IIIF | International Image Interoperability Framework”. n.d. Accessed 25 January 2019. [URL]

Impresso

2018 “Moving beyond Digital Filters. How to Integrate the Digitised Press into the Historian’s Workflow”. Blogpost. Impresso 6 July 2018 [URL]

“

Issue 10: Innovation Agenda”. n.d. Europeana Pro. Accessed 3 February 2020. [URL]

Jockers, Matthew Lee

2014 Text Analysis with R for Students of Literature.

Jordana y Morera, Ramón

1888 La inmigración china en Filipinas. Madrid: Tipografía de Manuel G. Hernández.

Kettunen, Kimmo, Tuula Pääkkönen, and Erno Liukkonen

2019 Clipping the Page -Automatic Article Detection and Marking Software in Production of Newspaper Clippings of a Digitized Historical Journalistic Collection.

“

Kraken – Kraken 2.0.5-4-Gbb42ba5 Documentation”. n.d. Accessed 1 February 2020. [URL]

La Inmigración China y Japonesa En Filipinas: Documentos

1892 Madrid: Imprenta de Don Luis Aguado.

Lagrama, Eimee Rhea C.

2012 “Preventing Disaster: Quantifying Risks at the UP Diliman University Library”. In Libraries, Archives and Museums: Common Challenges, Unique Approaches, 10. Rizal Library. Ateneo de Manila University.

“

LASER NLP Toolkit: Zero-Shot Transfer across 93 Languages” 2019 22 January 2019 [URL]

Li, David Leiwei

2003 Globalization and the Humanities. Hong Kong University Press.

Los chinos en Filipinas: Males que se experimentan actualmente y peligros de esa creciente inmigración

1886 Manila: Establecimiento tipográfico de La Oceanía Española.

“

Netherlands EScience Center”. n.d. Accessed 29 January 2019. [URL]

Netherlands EScience Center: Shifting Concepts Through Time Project – NLeSC/ShiCo

(2015) 2018 Python. Netherlands eScience Center. [URL]

Neudecker, C., and A. Antonacopoulos

2016 “Making Europe’s Historical Newspapers Searchable”. In 2016 12th IAPR Workshop on Document Analysis Systems (DAS), 405–10.

“

OCR” 2019 13. EuropeanaTech. Europeana. [URL]

“

On Multilingual Dynamic Topic Modeling”. n.d. Accessed 2 February 2020. [URL]

Ortuño, Casanova Rocío

2017 “Philippine Literature in Spanish: Canon Away from Canon”. Iberoromania 2017 (85): 58–77.

Ortuño Casanova, Rocío and Anna Sarmiento

2020 “Humanidades Digitales en Filipinas: proyectos, dificultades y oportunidades de la colaboración Norte-Sur”. Digital Scholarship in the Humanities, fqz086.

“

Our Research Center” 2014 HathiTrust Digital Library 2014 [URL]

Pa, Win Pa, Ye Kyaw Thu, Andrew Finch, and Eiichiro Sumita

2016 “A Study of Statistical Machine Translation Methods for Under Resourced Languages”. Procedia Computer Science, SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09–12 May 2016 Yogyakarta, Indonesia, 81 (January): 250–57.

Palfray, Thomas, David Hebert, Stéphane Nicolas, Pierrick Tranouez, and Thierry Paquet

2012 “Logical segmentation for article extraction in digitized old newspapers”. In Proceedings of the 2012 ACM symposium on Document engineering (DocEng ’12). Association for Computing Machinery, New York, 129–132.

“

Philippines”. n.d. Ethnologue. Accessed 18 September 2018. [URL]

Piotrkowicz, Alicja, Vania Dimitrova, and Katja Markert

2017 “Automatic Extraction of News Values from Headline Text”. In Proceedings of the Student Research Workshop at the 15th Conference of the European Chapter of the Association for Computational Linguistics, 64–74. Valencia, Spain: Association for Computational Linguistics. [URL].

Plale, Beth, Robert McDonald, Yiming Sun, Inna Kouper, Ryan Cobine, J. Stephen Downie, Beth Sandore Namachchivaya, and John Unsworth

2013 “HathiTrust Research Center: Computational Access for Digital Humanities and Beyond”. In Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, 395–396. JCDL ’13. New York, NY, USA: ACM.

Ponce, Mariano

1912 Sun Yat Sen: El Fundador de La República de China. Manila: Imprenta de la Vanguardia y Taliba.

Prado-Fonts, Carles

2018 “Writing China from the Rest of the West: Travels and Transculturation in 1920s Spain”. Journal of Spanish Cultural Studies, April.

“

READ | EADH – The European Association for Digital Humanities”. n.d. Accessed 25 January 2019. [URL]

Saldaña, Zoë Wilkinson

2018 “Sentiment Analysis for Exploratory Data Analysis”. Programming Historian, January. [URL].

Ströbel, Phillip, and Simon Clematide

2019 “Improving OCR of Black Letter in Historical Newspapers: The Unreasonable Effectiveness of HTR Models on Low-Resolution Images”. In Digital Humanities 2019. Utrecht.

Tesseract-Ocr/Tesseract

(2014) 2020 C++. tesseract-ocr. [URL]

“

Texcavator”. n.d. Accessed 25 January 2019. [URL]

“

Text Correction Hall of Fame”. n.d. Trove. Accessed 25 January 2019. [URL]

Tom

(2014) 2020 Tmbdev/Ocropy. Jupyter Notebook. [URL]

“

Transatlantis Locations”. n.d. Translantis. Accessed 25 January 2019. [URL]

“

Transkribus”. n.d. Accessed 25 January 2019. [URL]

“

Trove – Digitised Newspapers and More”. n.d. Trove. Accessed 25 January 2019. [URL]

“

Unsupervised MT: Fast and Accurate for More Languages” 2018 Facebook Engineering (blog). 31 August 2018. [URL]

Vanetik, Natalia, and Marina Litvak

2019 Multilingual Text Analysis: Challenges, Models, And Approaches.

Viola, Lorella, and Jaap Verheul

2019 “The Media Construction of Italian Identity: A Transatlantic, Digital Humanities Analysis of Italianità, Ethnicity, and Whiteness, 1867–1920”. Identity 19 (4): 294–312.

“

Welsh Newspapers Online – Home”. n.d. Accessed 25 January 2019. [URL]

Wijfjes, Huub

2017 “Digital Humanities and Media History. A Challenge for Historical Newspaper Research”. Tijdschrift Voor Mediageschiedenis 20 (1): 4–24.

Willems, Marieke, and Rossitza Atanassova

2015 “Europeana Newspapers: Searching Digitized Historical Newspapers from 23 European Countries”. Insights 28 (1): 51–56.

“

Xtas, the EXtensible Text Analysis Suite – Xtas 3.4 Documentation”. n.d. Accessed 29 January 2019. [URL]

Zosa, Elaine, and Mark Granroth-Wilding

2019 “Multilingual Dynamic Topic Model”. Edited by Galia Angelova, Ruslan Mitkov, Ivelina Nikolova, and Irina Temnikova. RANLP 2019 – Natural Language Processing a Deep Learning World, International conference Recent advances in natural language processing, September, 1388–96. [URL].

Cited by

Cited by 1 other publications

Roig-Sanz, Diana & Laura Fólica

2021. Big translation history. Translation Spaces 10:2 ► pp. 231 ff.

This list is based on CrossRef data as of 28 march 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.

Chapter 10Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals

Cited by 1 other publications

Chapter 10
Challenges and strategies for beginners to solve research questions with DH methodologies on a corpus of multilingual Philippine periodicals