Font Size: a A A

Research And Implementation Of Multi-document Automatic Summary System Of Public Opinion Data

Posted on:2019-04-30Degree:MasterType:Thesis
Country:ChinaCandidate:X D HanFull Text:PDF
GTID:2428330545964770Subject:Software engineering
Abstract/Summary:PDF Full Text Request
In the information explosion today,people want to rely on the Internet to access their information of interest has become more and more difficult,even in the same theme,also has a lot of information redundancy,and even more so,public opinion data under the same theme of news,there are many are exactly the same information,and want to get out of these public opinion text under the same theme of different information becomes more and more time consuming and energy.Multi-document automatic summarization technology can solve this problem well.Through multi-document summarization technology,can remove repetitive information,and to pick up the different information on the subject,and generate the text,save the time to mining information of interest.Based on the study of multi-document automatic summarization,this paper proposes a hierarchical clustering method based on semantic dictionary to obtain the method of multi-document automatic summary text.The advantage of the method based on semantic dictionary is that the words in Chinese corpus can be analyzed and processed in the semantic level.The main contents of this system include:1.Calculation of new words similarity based on semantic dictionary.Because new words often appear in the field of public opinion,these new words are usually related to the subject and can be considered to some extent as the subject.But new words to similarity calculation,in order to solve this problem,this paper proposes a new words based on semantic dictionary similarity calculation method,the way by parsing words,match the words to some new words and the semantic dictionary,use of the words in semantic dictionary instead of the new words in similarity calculation.2.Cluster analysis and sentence clustering analysis.With words as the feature of the sentence a space vector,without considering the relationship between the meaning of words,sentence in quantitative,not accurate quantitative according to the words of the sentence,the subsequent sentences cannot guarantee accuracy of clustering results.For this,this article first clustering of words and similar words clustering concept into words,a feature vector of the concept of "word sentence,avoid the influence of the relationship between the words of a sentence clustering.In clustering analysis of a sentence,the use of words concept as a feature vector of the sentence,at the same time using the cosine similarity between sentences,using the clustering algorithm based on density DBSCAN clustering of a sentence,the cluster to form sentences.3.Extraction of abstract sentences based on the importance score.Based on the theme,sub-topic,page structure and other factors,a method of importance grading was proposed based on the results of sentence clustering.According to the score height of sentence cluster sorting,in clusters within the sentence importance rating,the highest score in each sentence cluster as the words of a sentence,according to the order of sentence cluster form the final text document automatic summarization.Based on the above methods,a multi-document automatic summarization system based on public opinion data is developed,which can basically satisfy users' extraction of different information under the same theme.
Keywords/Search Tags:Multi-document automatic summary, new words discovery, cluster analysis, importance rating, similarity calculation
PDF Full Text Request
Related items