| In recent decades, a number of commonly used Chinese word segmentationalgorithm has formed through the efforts of researchers of the majority of domestic andforeign experts and scholars. The main mechanical word segmentation algorithm basedon the lexicon, Chinese word segmentation algorithm based on understanding andstatistics-based Chinese word segmentation algorithm. These algorithms have theirrespective advantages and limitations.Under the analysis and research of these Chinese word segmentation algorithms, Idesigned a Chinese word segmentation algorithm based on the dictionary and BayesianTheorem. I build a dictionary which includes commonly used word and other featureswords. This dictionary could be update by the text of the corpus. The need of thealgorithm to quickly find the data is meet by using the Hash table and linked list datastructure to store the dictionary. Use Bayesian Theorem flexibly on the formula tocalculate the probability of word programs, according to the Chinese vocabulary in theword lexicon probability data to calculate the probability of the segmentation program.Use the binary model to resolve the ambiguity processing problem. This algorithm hasboth advantage of lexicon-based Chinese word segmentation and statistics-based Chineseword segmentation algorithm.After fully tested under enough test conditions, the test results show that thealgorithm’s effect in dealing with ambiguity and processing unknown word is better thanother algorithms. This algorithm could meet the basic needs of a Chinese-relatedinformation system. |