Automatic construction of a dictionary of variant forms of Chinese characters
Many Chinese characters have more than one form of writing owing to complex nature of creation and long evolvement
history of writing. Most existing Chinese dictionaries list these variant forms but do not explain in a systematic way why a
specific character is a variant form of another, and only list a few older key bibliographies, many of which are themselves
dictionaries of various forms. In this article we present a new theory and practice of how to determine whether a Chinese
character is a variant of another, and show how we can deduce a dictionary of variant characters automatically from a corpus of
ancient Chinese texts totaling 2.3 billion characters with artificial intelligence techniques. Results show that in over 74,000
instances of identified variant character groups, more than 20,000 new instances are found by our algorithm. We have then compiled
all the instances into a dictionary and call it Dictionary of Chinese Variant Words (異體字詞典, Yiti Zi Cidian). The key insight of our theory
is to find synonymous words with variant characters. The dictionary has already been put online for several years and everyone can
freely access and edit it like the way they do on Wikipedia.
Article outline
- 1.Introduction
- 2.Variant words and text understanding
- 3.Automatic recognition of variant characters based on variant words
- 3.1Searching for candidate variant characters
- 3.2Candidate variant character filtering
- 3.3Identification of variant words and characters
- 4.Automatic generation of a dictionary of Chinese variant words
- 5.Determine the pronunciation for uncommon characters
- 6.Related work
- 7.Conclusions
- Acknowledgement
- Notes
-
References