Font Size: a A A

Research On Opionion Target Extraction In Chinese Microblog

Posted on:2017-03-19Degree:MasterType:Thesis
Country:ChinaCandidate:H MeiFull Text:PDF
GTID:2308330488485684Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Along with the rapid development of technology of the We media, broad masses of netizens have been shifting gradually from passive recipients of information to the publisher and distributor. Microblog has vast amounts of text data, which contains a lot of information that has great value on research and application. In recent years, the research on microblog sentiment analysis has been developing rapidly and turning to be more in-depth and detailed. As one of the key tasks, opinion target extraction gradually catch the attention of researchers because of its great application value in text summarization extraction, public opinion analysis, etc. Yet the research on Chinese microblog is not enough, the inherent property of microblog make it hard to study well, such as lacking language standard and sentence structure not clear, etc. Therefore, this article takes Chinese microblog text as research object and opinion target extraction as research content, and strives to explore a more effective opinion target extraction method for Chinese microblog. In this paper, we divided the opinion target extraction into two steps, namely candidate extraction and standard target extraction and improved the two steps respectively. To be specific, the work mainly includes the following four points:Firstly, in the candidates extraction step, we improved the hashtag segmentation and rule based opinion target candidate extraction method in existing research. Existing method segmented the hashtags based on the Symmetrical Conditional Probability, then put the segmentations into user dictionary to further segment the microblog text, then used rule based method to extraction candidates. Because the dictionary of traditional segmentation tools has significant limitations, this method cannot segment the microblog that contain a lot of network buzzwords well. Furthermore, the rule is too simple and rough to extract candidates well. In this paper, we collected several Cell Thesaurus of Chinese input software to build new user dictionaries. In addition, we optimized and extended the extraction rules. Experimental result showed the effect of the new method.Secondly,in the standard target extraction step, we proposed an improved clustering based multiple graph parallel label propagation algorithm. Existing label propagation based opinion target extraction algorithm (LPA) uses all the messages in a topic indiscriminately to build undirected graph and then run LPA to collectively extract opinion target.This method ignored that there exist diffident discussion aspects in a topic. Different discussion aspects have different expression styles and usage of words. Building graph indiscriminately will lead to wrong propagation path and effect, and the error will accumulate in the propagation process. To avoid the above problem, we choose to divide all the messages in one topic to several categories using clustering, then we build undirected graphs for each category, and then run LPA parallel. This method can avoid confidence inequality problem.The experimental result showed that the extraction effect of this improved method improved obviously.Thirdly, in the standard target extraction step, sentence similarity calculation is an important point In this paper, we proposed an improved context and shallow lexical feature based sentence similarity calculation method. Similarity calculation is a very important step in LPA. The similarity calculation accuracy directly affects the propagation process of the whole figure, and thus affects the final extraction effect.The LPA takes a sentence as a node in the undirected graph, and use cosine to indicate the similarity of two sentences under standard VSM. This similarity calculation method lost the context information. For loosely structured text like microblog, the understanding of sentences often rely on the context information, and expression ability of one single sentence is limited. Therefore, in addition to the inherent vocabulary characteristics of sentences, this article also took into consideration of the context information of a sentence and designed a method that integrates context information and shallow vocabulary features.Forthly, in the LPA based opinion target extraction method, the candidate similarity will also influence the propagation process. In this paper we improved the candidate similarity calculation method in existing method. Existing label propagation based algorithm used Jaccard Index to indicate candidate similarity, which counts the number of shared Chinese characters of candidates, however, this method is rough because it only considers the morphological features. This method can easily lead to propagation error and affect the confidence sorting thus reducing the final extraction effect On the basis of existing research, this paper proposed a Tongyici Cilin and morphological characteristic based candidate similarity calculation method for Chinese microblog, this method will integrate morphological and semantic features to calculate similarity of candidates.
Keywords/Search Tags:Hashtag Segmentation, Opinion Target Extraction, Label Propagation
PDF Full Text Request
Related items