Font Size: a A A

A New Method Of Chinese Out-of-Vocabulary Identification And Dictionary Design

Posted on:2012-03-14Degree:MasterType:Thesis
Country:ChinaCandidate:S S WeiFull Text:PDF
GTID:2178330335456665Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
The Chinese word segmentation is a basic subject in the Chinese information processing field, while the identification of Out-of-Vocabulary is one of the bottleneck problems. In order to solve the problem of low efficiency in Out-of-Vocabulary recognition based on the analysis of existing difficulty and the related algorithm, this paper offers a new method of Out-of-Vocabulary identification and dictionary design.This paper is designed in the following parts.First, Out-of-Vocabulary recognition strategy. Through the analysis of current background and the features of Out-of-Vocabulary words, the paper proposes a new method based on network BBS Chinese Out-of-Vocabulary words. Firstly, download and identify BBS web, using spiders, to get clean corpus. Then, get corpus segmented and gain continuous fragments. Compile the appearing frequency of the words to identify the high-frequency concurrence word after the name recognition to the segment fragments. Last, add candidate Out-of-Vocabulary words, which are determined by a newly built function MP (the combination of Mutual Information and Partial Information), into temporary dictionary. Put the words in to the core dictionary when their appearing frequency reaches certain point thus they could be identified during the next segmentation for one time.Second, the design of the dictionary based on the reverse maximum matching algorithm.This paper first scan stay segmentation text before on string matching, find the maximum word-length,then by using the improved reverse maximum matching algorithm according to the word-length to segmentate order by the first word,the last word,and Surplus phrases.In order to reduce the burden of core segmentation dictionary, this paper will divide segmentation dictionary into core dictionary, temporary dictionary and name dictionary and improve them respectively. For Core dictionary:storaging most words, Specially used for matching and segmentating; based on the proposed the improvement on the basis of reverse maximum algorithm, improve its structure, index by the first character, and store the last character as the key word. Matching algorithm of core dictionary is improved with reverse maximal matching algorithm, which has a higher efficiency. For Temporary dictionary:its main function is not for inquiry but for the storage of the unlisted word during the segmentation, calculation of the frequency and the transmission of the new words to the core dictionary. For name dictionary:it is used for the storage of name of people in order to solve the recognition problem of name.Third, Combine the Out-of-Vocabulary recognition strategy and the words segmentation dictionary to realize the Chinese word segmentation system:through the dynamic corpus creation, Out-of-Vocabulary recognition, Out-of-Vocabulary entry and system integration Chinese word segmentation system is achieved. Through the system initialization and performance testing, and the comparison with other words segmentation, the design of this system for identification of Out-of-Vocabulary is proved feasible. And the new recall frequency, segmentation accuracy have been improved.
Keywords/Search Tags:Out-of-Vocabulary, Chinese word segmentation, Word frequency statistic, Core dictionary
PDF Full Text Request
Related items