Font Size: a A A

Research On Automation Of Sentence Segmentation, Punctuation And Word Segmentation Of Agricultural Ancient Books

Posted on:2010-04-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:J N HuangFull Text:PDF
GTID:1223360305986627Subject:History of science and technology
Abstract/Summary:PDF Full Text Request
The ancient books in Chinese culture are the important civilization achievements created by Chinese in thousands of years, which contain the value, mode of thinking and imagination creative power belonging to Chinese nation. They witness China civilization that has been continuing in the long never-failing history. These ancient books are also the treasure of the humankind civilization and Chinese descendants should have the obligation and duty to sort out, protect and develop the ancient books.On arrangement of the Chinese ancient books, China has a long history and tradition. From Six Classic edited by Confucius and Seven Summaries collated by Liu Xiang and his son, to Imperial Collection of Four (Si Ku Quan Shu) and Integration of Classic Books(Gu jin Tu Shu Ji Cheng) compiled by the scholars in Qing Dynasty, the large-scale activities of ancient books arrangement have been going on and they influence the offspring profoundly. The arrangement of the Chinese ancient books has gained great achievements since 1949 that attracts world attention. We have cleaned up and published over 140 volumes of the agriculture ancient books. For all that, sorting and developing agricultural ancient books are still not enough. The arranged books account for 15%of the all books and it is urgent to clean up and publish a number of agricultural ancient books.The study of the sentence segmentation and punctuation of Chinese ancient books has already started not later than Eastern Han. Since the period, this work is continuous, just developing fast or slowly from time to time according to political and economical circumstances. Each book included in Ming Dynasty’s Yong Le Encyclopedia was punctuated, on the contrary, each book included in Qing Dynasty’s Imperial Collection of Four (Si Ku Quan Shu) had no punctuation. Although they are the same oriental encyclopedia edited by the government, they have different ideas and measures, which is surprising. With modern style punctuation rising since the Republic of China set up in 1912, the sentence segmentation and punctuation of Chinese ancient books have always attracted the scholars’attention. The government has made every effort since establishment of the People’s Republic of China. As a result, the number of ancient books by modern style punctuation gradually increases. China issued A Standard on Word Segmentation for Modern Chinese Text in Information Processing in 1989, which deals with modern text originally, but the Standard on word segmentation for ancient Chinese Text is not yet drafted, which should be taken seriously.In view of existing situation, this article takes agriculture ancient books as a research object, studies the history and the current situation of punctuation, word segmentation and indexing, and emphasizes computer technique applied to these domains, designs prototype system of punctuation, word segmentation and indexing. The main contents of the article are as follows:1) Based on technique of pattern matching and sentence construction analysis, it constructs an algorithm of sentence segmentation and punctuation of the agricultural ancient books, and designs a prototype punctuation system of the agricultural ancient books.According to statistics and analysis of 20,000,000 characters from Chinese ancient works, the article summarizes 11 common methods for punctuation. It is proposed that the sentence should be initially segmented by syntax words (like empty word, conjunction and modal words and synonyms indication words). Then antonyms, cited books indications, time sequence, quantifiers, pleonasms and verb object structure are used for further sentence segmentation and punctuation. In addition, the comparative sentence supplies an auxiliary means of judgment of complex sentences and punctuation of clauses. Finally, the terms in agriculture and the stoplist of punctuation are applied to improve the readability of these books after marking the punctuations.According to these methods and rules of sentence segmentation and punctuation, the experiment sets up two knowledge bases, such as primary models table and stoplist models table constructed by artificial and automatic measures. Two knowledge bases assure to carry out function of punctuation. Up to now, we have set up 1,166 rules in primary models table and 184 rules in stoplist models table.Based on these rules of sentence segmentation and punctuation, we make a test of agriculture ancient books. In experiments, the average precision of sentence segmentation and punctuation reaches 60.5%and 40.5%.2) With help of information processing techniques, such as dictionary-based word segmentation and word segmentation by N-gram etc., it constructs an algorithm of word segmentation for the agriculture ancient books and designs the prototype system of word segmentation of agricultural ancient books.In consideration of that, up to now the system of word segmentation has not still had a dictionary for word segmentation, so it is necessary to set up a dictionary for word segmentation. However, it will take a long time to set up a dictionary for word segmentation, and the system adopts a comprehensive method of word segmentation including dictionary-based word segmentation and word segmentation by N-gram, which is an ideal method of word segmentation for the ancient books.According to what mentioned above, this experiment set up two clusters of dictionary for word segmentation, more than 10 databases, in which primary dictionary clusters include tables of personal name, place name, book title, officer name, products etc. and stoplist dictionary clusters include tables of idiom, title of resign, empty words, quantifier and time etc. The dictionaries for word segmentation have a vocabulary 200,000 currently, which satisfies the demand of word segmentation of ancient books.The experiment adopts a comprehensive method of word segmentation including dictionary-based word segmentation and word segmentation by N-gram, and uses some measures for noise reduction such as substring comparison, neighbor comparison, high frequency words, low frequency phrase words etc. Finally, taking 12 agriculture ancient books and 379 Local Chronicle of Guangdong:Products as the example respectively, the experiment makes a test of word segmentation of agricultural ancient books. From corpus of 12 agriculture ancient books, the experiment recognizes 1164 old words that account for 31%of total vocabulary and 2,530 new words that take up 69%of total vocabulary. From corpus of 379 Local Chronicle of Guangdong:Products, the experiment recognizes 6,314 old words that account for 8%of total vocabulary and 754,380 new words that take up 92% of total vocabulary. The words whose term frequency is more than 10 times are up to 8,044, which take up 10%of all words. In the meantime, the words whose term frequency is more than 20 times are up to 3,760 in all words, which take up 5%of the total vocabulary.By analysis on results of word segmentation on 379 Local Chronicle of Guangdong: Products, a fact is discovered that if the level of term frequency is in range (2000,8000), then the product of level of term frequency and frequency is a constant of 23 million. The appearance shows that Zipf s law be the same with the ancient Chinese text.With the help of the computer, carrying out the function of sentence segmentation and punctuation, word segmentation and indexing of agricultural ancient books text and developing a correspond system are the outcome of studying agriculture history with intelligence science and Chinese information processing technique etc. For it is the preliminary research, this article is still slightly immature and has a necessity for further study.1) Now a primary rules table for sentence segmentation and punctuation includes only more than 1,100 rules. The number of rules is limited, and each rule awaiting optimization. Furthermore, the measures of sentence segmentation and punctuation are mainly pattern recognition, that is to say, they base primarily on the application of phrase and the application of sentence characteristic, which is still limited. This mainly resulted from being short of a ripe corpus of the agriculture ancient books at present, especially lack of the vocabulary attribute corpus of the agriculture ancient books, which makes this experiment hard to make a valid sentence construction analysis. Along with constructing the vocabulary attribute corpus of the agricultural ancient books, word segmentation system will gradually strengthen sentence analysis of ancient books. Based on the vocabulary attribute corpus and the tables of rules of sentence segmentation and punctuation of the agriculture ancient books, sentence segmentation and punctuation can achieve better results.2) Because the experiment adopts the comprehensive methods including dictionary-based word segmentation and word segmentation by N-gram, the recall ratio is still low. The ratio of old words by dictionary-based word segmentation is 31%in the agricultural ancient books while 10%in Local Chronicle of Guangdong:Products. It is obvious that it results from that the words of the dictionaries of word segmentation distributes unevenly in different themes. Consequently, the next step is to continue a study of how to optimize the dictionary of word segmentation.This topic has gained the support of national social science fund and humanities and social sciences fund of Ministry of Education, but with the topic relating very extensively and lacking in time, it is very difficult to make a thorough study, so these problems awaiting solution in future.
Keywords/Search Tags:Agricultural Ancient Books, Arrangement of Ancient Books, Sentence Segmentation, Punctuation, Word Segmentation and Indexing, Intellectual Information Processing
PDF Full Text Request
Related items