Font Size: a A A

Research Of A New Dictionary In The Search Engine

Posted on:2011-07-08Degree:MasterType:Thesis
Country:ChinaCandidate:R CaiFull Text:PDF
GTID:2178330332979819Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Search engine is the system which collects specific information using a certain strategy and program on the Internet, organizes and processes them, submits the following informations, and provide search services for users. At present, as an important area of the Chinese information processing, the search engine plays a more important role in practice. The Chinese word segmentation is the primary key of Search engine technology.In natural language, the word is the smallest independently of meaningful activities. Be different from western language, there is no obvious symbol of segmentation between the word and the word, thus, the Chinese word segmentation is a foundation of Chinese information processing. The Chinese word segmentation widely utilized in the text information retrieval, automatic classification, search engines, automatic based, speech synthesis, automatic proofreading, machine translation (MT), etc. Word segmentation method directly affects the real-time and accurate of the performance of Chinese information processing.The existing Chinese word segmentation algorithms can be classified into 3 groups:the method based on String Matching, the method based on Understanding and the method based on Statistics. The segmentation methods based on the string matching also called mechanical word segmentation or the segmentation method based on the dictionary, and the largest matching method is the commonly used methods. This algorithm needs a new dictionary. The structure and search algorithm of the dictionary for the words segmentation algorithm is quite important. Existing dictionary mechanism has 3 kinds:dictionary based on the word binary search, dictionary based on the TRIE tree and dictionary based on the verbatim binary search. The dictionary structure based on the word binary search has simple data structure, takes up the little space, and is easy to maintain. However since the dictionary structure use the word binary search which need many test matches to get final results, the efficiency is low; the dictionary structure based on the TRIE tree has complex data structure, waste more space, and is hard to maintain. However since the dictionary structure use the verbatim search. The efficiency is high; although the dictionary structure based on the verbatim binary search adopted more efficient verbatim search method, but essentially the method is not perfect. Based on the analysis of three kinds of algorithms, this paper puts forward a new segmentation algorithm-layered binary word, it improve efficiency, control the complexity in a certain level, and find a balance between the efficiency and complexity.This paper describes the common segmentation algorithm, and then introduces the new dictionary mechanism which improves the maximal matching speed.Finally the paper compare the two dictionary mechanism and give the experimental analysis.
Keywords/Search Tags:Search engines, related technologies, the Chinese word segmentation, verbatim binary search
PDF Full Text Request
Related items