Font Size: a A A

Research And Implementation Of New Word Recognition Based On N-gram And Hybrid Strategy

Posted on:2019-12-26Degree:MasterType:Thesis
Country:ChinaCandidate:J K FengFull Text:PDF
GTID:2428330590975366Subject:Natural language processing
Abstract/Summary:PDF Full Text Request
Segmentation technology is a fundamental issue in the field of natural language processing.There have been many achievements in the research so far,but there are still many difficulties and challenges that need to be addressed.The development of Chinese word segmentation technology as part of language processing has serious problems affecting its performance because of its language features,the new word problem is one of them.When the word segmentation tool encounters new words in word segmentation,it will have to many“word fragments”,which will lose the information that the original word has.These new words have become the bottleneck restricting the accuracy of Chinese word segmentation.At present,there are many research results in the field of new word recognition,but the method based on rules is limited too much,and the performance of methods based on statistics is low.Therefore,based on the analysis of predecessors' research,this topic proposes a method based on the N-gram statistical model for segmentation of corpus and the use of hybrid strategies for new word acquisition.The hybrid strategy is mainly reflected in two places.The first is to obtain candidate new words based on multiple statistical filters,and the other is to stop the candidate words based on a combination of a stop dictionary based on rules and a common stop dictionary.The statistical method is based on the internal composition and the external environment of a word,using mutual information,left-right neighbor information entropy,and word frequency matching.After processing using the statistical method,use a rule-based dictionary to filter,and then use the normal stop dictionary to filter.Multiple processing of statistical results to improve the performance of new word recognition.The work of this paper mainly starts from the following aspects:(1)Use Web crawler technology to crawl news from large portals and build corpora;(2)Design and implementation of new word recognition methods based on actualapplication scenarios and requirements;(3)Design the experiments based on the statistical magnitude in the method.Examine the impact of each statistic on new word recognition performance.And examine the impact of stop-word on the new word recognition performance.(4)Finally,summing up the work of this paper and the defect of this method.And put forward the research direction of the next step.In the end,the accuracy rate of the new word recognition method proposed in this paper is 80%,the recall rate is 52%,and the F-value is 64%.Since the ordinary stop dictionary in this topic is continuously perfected,after multiple new word recognition,the stop words contained in the dictionary are more comprehensive and the filtering effect will be better,so after multiple times new word recognition,the method performance will get better.
Keywords/Search Tags:New word recognition, N-gram, Mutual information, Information entropy
PDF Full Text Request
Related items