Font Size: a A A

Research On Chinese Word Segmentation And User Identification Based On Feature Alignment

Posted on:2020-04-29Degree:MasterType:Thesis
Country:ChinaCandidate:K L FengFull Text:PDF
GTID:2428330590971699Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Chinese Word Segmentation(CWS)is the basis of Natural Language Processing(NLP).Due to the particularity of Chinese words in texts,a single character is regarded as a unit instead of obvious space separation in English.Therefore,if we need to obtain useful information from texts,texts with precise word segmentation will be necessary and the following work on NLP can better go on.However,the true ambiguity and Out-of-Vocabulary problem,as two technical difficulties of CWS,have not been solved well so far.The technology of user identification,as the essential part of Named Entity Recognition(NER),will play a key role to deal with the complex message of users on Internet.Furthermore,sequence labeling model,in particular Condition Random Fields(CRFs),is effective to deal with CWS and NER.To further improve the performance of CWS and user identification extraction,the thesis proposed a new method based on feature alignment,then a classifier and CRFs are integrated to carry on the task of sequence labeling.To build the sequence labeling model based on feature alignment,the main work of this thesis are as follows:1.Combining a classifier and the algorithm of CRFs,a Chinese Word Segmentation method based on feature alignment is proposed.Firstly,19 features are extracted aimed at bigrams in texts,such as word frequency,information entropy,mutual information,number,punctuation,contextual information and so on.After that,each bigram can be represented as a 19-dimensional vector.Secondly,13 features about frequency in labeled data and unlabeled data are aligned through the method of Earth Mover's Distance(EMD),which can ease the scale difference between the labeled and unlabeled data.Thirdly,the features on labeled data after alignment are regarded as the train set of a classifier XGBoost to predict the word probability of bigrams in unlabeled data.Furthermore,the results of the classifier are regarded as the feature of CRFs and trained as sequence labeling.At last,the results on unlabeled data can be obtained.2.As a practical application of the above method,integrating a classifier with the algorithm of CRFs,a method of user identification based on feature alignment is proposed.The feature attributes of bigrams are obtained according to the feature engineering.Furthermore,after feature alignment,user identification entity can be gained with stacking of a classifier and CRFs.The results of experiment by feature alignment and combination with XGBoost and CRFs demonstrate that the method really works.It can not only handle the overfiting problem by directly adding too many features into CRFs and reduce the training time,but also improve the performance of CWS and user identification.Furthermore,it can lay the foundation for the construction of knowledge map in NLP.
Keywords/Search Tags:Feature Alignment, Chinese Word Segmentation, User Identification, Condition Random Fields, Xgboost
PDF Full Text Request
Related items