Font Size: a A A

Experimental Study On The Fusion Of Dictionary Segmentation And Model Word Segmentation In Chinese

Posted on:2020-12-18Degree:MasterType:Thesis
Country:ChinaCandidate:T T FangFull Text:PDF
GTID:2405330596474389Subject:Applied statistics
Abstract/Summary:PDF Full Text Request
The process of Chinese word segmentation is to divide a series of fields into separate and recognizable fields according to certain rules.Because Chinese is composed of Chinese characters,there is no separator that is convenient for distinguishing between Chinese characters.So Chinese word segmentation is the first step in natural language processing,and it is a basic step that cannot be avoided.Whether the word segmentation results are ideal or not directly determines whether the final result is idealized.Today in the knowledge update iterations change fast,the flexibility and scientificity of the word segmentation method has higher and more requirements.The popularity of the Internet has allowed a large number of new vocabulary to emerge.The emergence of these new words is not only a reaction to the progress of the Internet.Meanwhile,it also poses new challenges to dictionary word segmentation.How to deal with these new words efficiently and quickly is the stress research object of text dictionary.At present word segmentation system mainly uses dictionary word segmentation or machine learning segmentation to complete the word segmentation task.While dictionary segmentation is controllable and fast,but it can't correctly segment words for unregistered words.The machine learning model CRF can solve the problem of unregistered words.Training CRF requires manual design of many features,and it takes a lot of time to verify the validity of the feature.After the emergence of algorithms for natural language processing based on in-depth learning.But its controllability is not as good as dictionary word segmentation.Failure to solve a new word mode,dictionary word segmentation can quickly add new words to the dictionary to solve problems.For the model,it may be necessary to add a lot of relevant training corpora.These training corpora are often difficult to obtain or the cost of acquisition is very large.In this case,in order to improve the effect of Chinese automatic word segmentation.The dictionary word segmentation module is implemented by using MMseg algorithm and using BI-LSTM+CRF as model word segmentation module.Finally,the two can be combined toachieve the controllability of the dictionary segmentation.It can also solve the problem of unlisted words in dictionary segmentation by model segmentation.This article is tested in the Bakeoff corpus of SIGHAN's Chinese processing evaluation.Firstly,the MMseg dictionary word segmentation module is implemented,and then the algorithm for the model segmentation is first tuned to various parameters,and finally the results of the two are combined.Experiments show that in the results after model fusion,the quasi-de-rate,recall,and F1 values are all improved.Moreover,the model fusion can solve the problem of controllability and unregistered words.
Keywords/Search Tags:dictionary segmentation, Chinese word segmentation, conditional random field, Natural language processing
PDF Full Text Request
Related items