Font Size: a A A

New Word Discovery Based On Large-scale Corpus And Improving Chinese Segmentation System

Posted on:2016-04-19Degree:MasterType:Thesis
Country:ChinaCandidate:L P DuFull Text:PDF
GTID:2308330470474844Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Chinese word segmentation is the fundamental of Chinese natural language processing and information extraction. With rapid development of Web 2.0 technology, internet new words recognition is the main problem and bottleneck for Chinese segmentation. We present an unsupervised method for identifying internet new words from the large scale web corpus, which combines with an improved Point-wise Mutual Information (PMI), PMIk algorithm, and some basic rules. This method can recognize internet new words with length from 2 to n (n could be defined as any number as needed). Experimented based on 257MB Baidu Tieba corpus, the precision of our system achieved 84.8% when the parameter value of PMIk algorithm is equal to 5, the precision Tieba corpus, the precision of our system achieved 84.8% when the parameter value of PMIk algorithm is equal to 5, the precision increased 16.2% comparing to PMI method, the results show that our system is significant and efficient for detecting new word from the large scale web corpus. Detecting the POS of these new words based on the large scale web corpus and then compiling the results of new word discovery and POS detecting into user dictionary and then loading the user dictionary into ICTCLAS (Institute of Computing Technology, Chinese Lexical Analysis System), experimented based on 10KB Baidu Tieba corpus, the precision, the recall and F-Measure were promoted 7.93%、3.73% and 5.91% respectively comparing to ICTCLAS, the result show that new word discovery could improve the performance of segmentation for web corpus significantly.
Keywords/Search Tags:new word extraction, POS detecting, Knowledge acquisition, Point-wise Mutual Information(PMI), Improved Point-wise Mutual Information algorithm(PMI~k), improve segmentation system
PDF Full Text Request
Related items