Research On Chinese Word Segmentation And User Identification Based On Feature Alignment

Posted on:2020-04-29

Degree:Master

Type:Thesis

Country:China

Candidate:K L Feng

Full Text:PDF

GTID:2428330590971699

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

Chinese Word Segmentation(CWS)is the basis of Natural Language Processing(NLP).Due to the particularity of Chinese words in texts,a single character is regarded as a unit instead of obvious space separation in English.Therefore,if we need to obtain useful information from texts,texts with precise word segmentation will be necessary and the following work on NLP can better go on.However,the true ambiguity and Out-of-Vocabulary problem,as two technical difficulties of CWS,have not been solved well so far.The technology of user identification,as the essential part of Named Entity Recognition(NER),will play a key role to deal with the complex message of users on Internet.Furthermore,sequence labeling model,in particular Condition Random Fields(CRFs),is effective to deal with CWS and NER.To further improve the performance of CWS and user identification extraction,the thesis proposed a new method based on feature alignment,then a classifier and CRFs are integrated to carry on the task of sequence labeling.To build the sequence labeling model based on feature alignment,the main work of this thesis are as follows:1.Combining a classifier and the algorithm of CRFs,a Chinese Word Segmentation method based on feature alignment is proposed.Firstly,19 features are extracted aimed at bigrams in texts,such as word frequency,information entropy,mutual information,number,punctuation,contextual information and so on.After that,each bigram can be represented as a 19-dimensional vector.Secondly,13 features about frequency in labeled data and unlabeled data are aligned through the method of Earth Mover's Distance(EMD),which can ease the scale difference between the labeled and unlabeled data.Thirdly,the features on labeled data after alignment are regarded as the train set of a classifier XGBoost to predict the word probability of bigrams in unlabeled data.Furthermore,the results of the classifier are regarded as the feature of CRFs and trained as sequence labeling.At last,the results on unlabeled data can be obtained.2.As a practical application of the above method,integrating a classifier with the algorithm of CRFs,a method of user identification based on feature alignment is proposed.The feature attributes of bigrams are obtained according to the feature engineering.Furthermore,after feature alignment,user identification entity can be gained with stacking of a classifier and CRFs.The results of experiment by feature alignment and combination with XGBoost and CRFs demonstrate that the method really works.It can not only handle the overfiting problem by directly adding too many features into CRFs and reduce the training time,but also improve the performance of CWS and user identification.Furthermore,it can lay the foundation for the construction of knowledge map in NLP.

Keywords/Search Tags:

Feature Alignment, Chinese Word Segmentation, User Identification, Condition Random Fields, Xgboost

PDF Full Text Request

Related items

1	The Research Of Applying Conditional Random Fields To Chinese Word Segmentation And Part-Of-Speech Tagging
2	Research Of Chinese Word Segmentation With Conditional Random Fields
3	Research And System Implementation Of Chinese Word Segmentation In Specialized Fields Based On Conditional Random Fields
4	Research And Application Of Chinese Word Segmentation Based On Conditional Random Fields
5	Research And Implementation Of Chinese Segmentation System Based On Conditional Random Fields Model
6	The Research On Chinese Word Segmentation Based On Conditional Random Fields In Big Data Environment
7	Excellent Cross-validation Based Model Selection Method For Chinese Word Segmentation System Design And Development
8	Research On The Recognition Of Focus Word In Chinese Question
9	The Research Of Chinese Word Segmentation Based On CRF
10	Research Of Named Entity Recognition Based On Conditional Random Fields