The Web has recently been used as a corpus for linguistic investigations, often with the help of a commercial search engine. We discuss some potential problems with collecting data from commercial search engine and with using the Web as a corpus. We outline an alternative strategy for data collection, using a personal Web crawler. As a case study, the university Web sites of three nations (Australia, New Zealand and the UK) were crawled. The most frequent words were broadly consistent with non-Web written English, but with some academic-related words amongst the top 50 most frequent. It was also evident that the university Web sites contained a significant amount of non-English text, and academic Web English seems to be more future-oriented than British National Corpus written English.
2013. Globalization, postcolonial Englishes, and the English language press in Kenya, Singapore, and Trinidad and Tobago. World Englishes 32:3 ► pp. 338 ff.
Perelmutter, Renee
2012. Interactive properties: Modern Russian predicate adjectives in affirmative and negative contexts. Russian Linguistics 36:1 ► pp. 65 ff.
Koteyko, Nelya
2010. Mining the internet for linguistic and social data: An analysis of ‘carbon compounds’ in Web feeds. Discourse & Society 21:6 ► pp. 655 ff.
Baroni, Marco, Silvia Bernardini, Adriano Ferraresi & Eros Zanchetta
2009. The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43:3 ► pp. 209 ff.
Nebot, Esther Monzó
2008. Corpus-based Activities in Legal Translator Training. The Interpreter and Translator Trainer 2:2 ► pp. 221 ff.
This list is based on CrossRef data as of 6 january 2025. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.