| Chinese word segmentation is an important and basic task of natural language processing (NLP). But there is still a far way to deal with classical literature because the main handling object of NLP has always been the contemporary Chinese. As a branch of NLP, we know the core task of the information construction for classical literature is segmentation. Only by segmentation, we can really provide digital techniques to classical philology. In this paper, we explore variety methods for ancient Chinese word segmentation by combining the characteristics of Chinese at that time and taking the classic literature named "History of the Han Dynasty" for example. We have some core tasks as follows:Firstly, making a guideline of Chinese word segmentation and discussing some confusing strings. By referencing the existed guideline of word segmentation for Contemporary Chinese, we also make a guideline for our task. According to the part of speech, we totally make14principles and give specific explanations for each speech. Different from the existed principles, in the end of this part, we collect the confusing strings which are hard to set standards and classify them by contrast and analogy. We classify these strings according to their grammatical structure.we explore operation means for these strings on the base of the statistics data by taking one type of them for example.Secondly, we collect three kinds of wordlist through multiple channels. The first kind is existed wordlists, such as the wordlists of people names, place names and pre-qin dynasty; the second kind of wordlist is got by statistical calculations of text’s mutual Information. After a lot of experiments, we know the most appropriate threshold is7.5. The last kind of wordlist is from annotations. We get the words’information and generate the wordlist on the basis of alignment of the original text and its annotations by the string matching algorithms. Different from the previous methods, we add an post-treatment by using the words in the vocabulary to improve the precision of the wordlist. This procedure is simple and rapid. Then, we compare the precision of these different wordlists for segmentation. The result shows that as far as the single wordlist concerned, the one got from annotations is the best. Its F-score achieved83.3%. At the same time, when combining the wordlist of annotations, people names and place names, the result of experiment is optimal. Its F-score is more than85%. Therefore, we think the words from annotations and the list of people and places are the best combination for segmentation based on wordlist.The third task is segmentation based on CRF with the help of some linguistic features. In the experiment, we choose different linguistic features to help segmentation, such as the category of all characters in "History of the Han Dynasty", every character’s sound, rhyme and tone in ancient times and mediaeval times. We add the pronunciation in ancient time of each word as the linguistic features firstly. The result shows that the template based on "1W+2" is universally better. The template "1W+2+C1’5’" is best. By using the best template, the F-score can achieve94.4%%.At last, we describe and analyze the words in "History of the Han Dynasty" briefly from multiple perspectives after segmentation. The result shows that the proportion of monosyllabic word is only24.24%, but the frequency is much more than polysyllable word’s. It means the words in the book are mainly monosyllables. We also verify the developing of word is from monosyllables to polysyllable word. The statistical results of high-frequency words also demonstrate our guess about words and characters in the above. The statistics of four-word Lexical bundles provide some basis for the source of some idioms. We also siphon some representative words correctly which can show the general look of the era when the book produced. |