Text Clustering Algorithm Based On Semantic Representation Of Attribute Network

Posted on:2021-06-25

Degree:Master

Type:Thesis

Country:China

Candidate:Y Y Li

Full Text:PDF

GTID:2518306458492814

Subject:Computer system architecture

Abstract/Summary:

PDF Full Text Request

With the rapid development of the Internet,online social platforms such as Weibo and We Chat generate a large amount of text data every day.How to dig out valuable information from the massive text data has become an important research topic.As a commonly used text analysis method,cluster analysis first expresses the text as a numerical vector,and then selects an appropriate clustering method for cluster analysis.The text representation is affected by the local information of the words in the text,the global information of the topic,and the link information between articles.Only one or two factors are considered,which makes the effect of the text representation poor and affects subsequent tasks.For massive text information,parallel storage and processing methods are required.At present,the overlapping K-Means clustering calculation based on Spark is sensitive to the initial clustering center.Multiple iterations result in frequent exchange of data between the Master and Worker nodes,which affects the operation of the algorithm.The efficiency and stability of clustering results.This article focuses on the text clustering algorithm.The main research contents are as follows:(1)Aiming at the CLM algorithm,which ignores the link relationship between texts,a text representation method based on the semantic representation of attribute network CLMSA(Collaboratively Improving Semantic and Attribute Information by Non-Negative Matrix Tri-Factorization),which integrates the link information,word information and topic information of the document,so that the three can promote and work together to improve the effect of text representation.Six mainstream text representation methods are selected and the classification effect is evaluated on two real data sets.The experimental results show that the improved CLMSA algorithm has a certain improvement in the text representation effect.(2)For POKM(Parallel Overlapping K-means Cluster)algorithm,Master and Worker nodes frequently exchange data,and the network overhead is huge.The I＿POKM(Improved Parallel Overlapping K-means Cluster)algorithm based on local integration strategy is proposed,which greatly reduces the running time of the algorithm..At the same time,the I＿POKM algorithm is sensitive to the initial clustering center,parallelizes the active selection strategy,and proposes the AI＿POKM(Active Improved Parallel Overlapping K-means Cluster)algorithm based on active selection.Experimental results show that the improved algorithm outperforms the original algorithm on two real data sets and four simulated data sets.

Keywords/Search Tags:

Active learning, Overlap clustering, Topic model, Word embedding, Attribute network

PDF Full Text Request

Related items

1	Research On Text Topic Modeling Based On Word Embedding
2	Research On App Classification Based On Word Embedding And Topic Model
3	Improved Text Topic Representation And Learning Method
4	Research On Topic Evolution Analysis Based On Topic Word Embedding Model
5	Sphere Topic Model Based On Word Embedding In Text Clustering Field
6	Research On Evolution Model Of Microblog Topic Based On Time Sequence
7	Research And Application Of Short Text Clustering Based On Topic Model
8	Network Hot Topic Discovery Based On Topic Model And Clustering Algorithm
9	Research On Short Text Topic Model Based On Semantic Information And Word Triangle
10	Topic Modeling Research Based On Word Embedding And Generative Neural Networks