Font Size: a A A

A Study And Realization Of Double-Array Based Segmentation Dictionary

Posted on:2007-05-16Degree:MasterType:Thesis
Country:ChinaCandidate:P JiangFull Text:PDF
GTID:2178360182460972Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The dictionary mechanism serves as one of the basic component s in Chinese word segmentation systems. Its performance influences the segmentation speed significantly. Many applications, such as texts retrieval on internet, post-process of recognition of Chinese character and speech and preprocess of text to speech, need high-speed segmentation. Thus, it is significant tp construct an effective segment dictionary.Nowadays, there are several dictionary mechanisms for information process, and they are binary-seek-by-word, TRIE indexing tree and binary-seek-by-character. The last two methods have higher inquiry efficiency. All of the above three methods improve their inquiry efficiency using sorted liner table with complex data structures and poor inquiry efficiency. In this paper, advantages and shortcomings are analyzed. In order to satisfy the special inquiry in Chinese segmentation we design and implement a segment dictionary based on double-array and analyze the performance. At last, we conduct comparisons of double-array with other several dictionary mechanisms. Experiments show double-array has higher inquiry efficiency than PAT Tree.The paper finally produced the data storage model of segmentation dictionary, and deeply analyzed this model good and bad points. Main characteristic of the model is devided the data into two kind of different length information, like it may greatly reduce the operation of the text reading and writing, can speed up the segmentation speed. The paper also makes the simple attempt to the question of the not register words, used the PAT tree's dynamic characteristic as well as the statistical model merit, searched the word frequency from the big scale text which is higher than the certain threshold value, thus distinguished one part not register word, then in the partial solved not register words question. The PAT algorithm and the Double-Array algorithm has differently merit, may satisfy the different need, also may combine together to solve dictionary inquiry speed and the dynamic specialty these two more difficult questions.
Keywords/Search Tags:Segmentation Dictionary, Double-Array, PAT, Dictionary Mechanism
PDF Full Text Request
Related items