Font Size: a A A

Hash Structure-based Mechanical Statistical Word Segmentation System

Posted on:2006-08-12Degree:MasterType:Thesis
Country:ChinaCandidate:Y YangFull Text:PDF
GTID:2208360182468831Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As the base of Chinese information processing, the technology of Chinese word segmentation has attracted a lot of computer experts' interests both here and abroad. At the same time, there come forth a lot of Chinese word segmentation systems. On the basis of comprehensive compare and analysis of the mechanical Chinese word segmentation and traditional Chinese word segmentation which are often used, this thesis puts forward and implements a machine-statistics system based on Hash structure for Chinese word segmentation.In order to close combine and complement disadvantages of this two methods, and to make best use of them, this thesis dose some deep research in the following aspects: In the mechanical Chinese word segmentation, changing the matching length of max matching method dynamically instead of statically in order to reduce the unnecessary matching operations; Making the information of frequency as another standard of Chinese word segmentation in order to cover the shortage of "long word first" standard; Using the segmentation dictionary based on Hash structure to increase the efficiency of word segmentation; In the statistical Chinese word segmentation, in order to increase the efficiency of statistics operation, this thesis generalizes the concept of segmentation unit, mingling the statistics operation and the mechanical Chinese word segmentation operation,meanwhile,using the Hash structure to store the results of the statistics operation,thus the speed of mechanical word segmentation has been raised.After implementing this segmentation system by the way of programming, test this system with a lot of language stuffs. According to a particular analysis of the results by the method of induction and curve fitting,I find that the domain and the length of stuffs will influence the speed and accuracy of segmentation. On the other hand, the performance of this system will be different while using the dictionaries with different vocabulary. As a whole, the segmentation speed of this system can reach more than 12000 Chinese characters per second and this system also shows great capacity for finding the new word which not exists in the dictionary.
Keywords/Search Tags:Chinese Word Segmentation, Mechanical Chinese Word Segmentation, Statistical Chinese Word Segmentation, Hash structure
PDF Full Text Request
Related items