Edited by Irene Doval and M. Teresa Sánchez Nieto
[Studies in Corpus Linguistics 90] 2019
► pp. 281–298
This chapter focuses on the normalization of abbreviations and shorthand forms used in French text messages. These forms are difficult to normalize, as they mostly cannot be resolved by typical spell checkers and dictionary lookups. Firstly, we aligned normalized and non-normalized French text messages and built a parallel corpus. We applied two popular approaches for text normalization, namely multilingual word embeddings, and character-based machine translation. We compare our results and observe the efficacy of our models while normalizing deletions, substitutions, repetitions, swaps, and insertions, made to canonical forms. This is the first paper that uses Multivec and the Belgian SMS corpus collected under the SMS4Science Project. The unsupervised machine learning approach makes the system highly flexible, easily adaptable and provides a domain-independent method of text normalization.