Font Size: a A A

Research On Eigenvector Mapping Algorithm Based On Multi-label

Posted on:2019-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:T WangFull Text:PDF
GTID:2348330542498863Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
Multi-label text classification has been the research difficult problem in the field of Natural Language Processing(NLP),because of the uncertainty number of its class label,but also the biggest problem in the task of text classification.At this stage,multi-label text classification algorithms focus more on the output space of multi-label classifiers,but the research on text vectorization of input space is relatively lacking.Due to the fact that the classification problem takes a lot of manpower for text annotation,a good text representation algorithm is crucial for improving the classification performance when the annotation sample size is relatively small.In this paper,an eigenvector mapping algorithm based on multi-label information is proposed,and based on this,an improved multi-view semi-supervised learning algorithm is proposed to improve the classification performance furthermore,to provide data support for tobacco control public opinion analysis.In this paper,the main work and research can be divided into the following three parts:First of all,crawling a large number of news reports related tobacco control from some major news search engine websites through web crawler,and then,making multi-label manual annotation and text preprocessing for some data.Secondly,analyzing the status quo of text vectorization representation and multi-label classification.In view of some shortcomings at the present stage,some concrete improvement measures are put forward.The text representation of this paper is based on word embeddings,which avoids the problems of uncontrolled vector dimension or lacking of text semantic information in traditional multi-label text classification.In the input space of the classifier,the multi-label is highly correlated,the eigenvectors of the text are mapped based on the feature information of the positive and negative samples corresponding to the multi-label,therefore,the input characteristics of the same news under different labels are mapped to a different vector representation.The effectiveness of this algorithm is verified on a tobacco control dataset.Finally,in order to make full use of unlabeled news data to avoid wasting resources and improve classification performance furthermore,this paper improves semi-supervised learning on the basis of feature mapping vector representation.Using the structural characteristics of news data,this paper constructs a multi-view structure by using different classifiers for news headlines and texts,and takes concrete measures on sample imbalance,drawing lessons from the integrated learning in the final model discrimination stage,and improves the generalization ability of the model.
Keywords/Search Tags:Multi-label text classification, word embeddings, text representation, eigenvector mapping, Semi-Supervised Learning
PDF Full Text Request
Related items