| With the rapid development of the Internet,online social platforms such as Weibo and We Chat generate a large amount of text data every day.How to dig out valuable information from the massive text data has become an important research topic.As a commonly used text analysis method,cluster analysis first expresses the text as a numerical vector,and then selects an appropriate clustering method for cluster analysis.The text representation is affected by the local information of the words in the text,the global information of the topic,and the link information between articles.Only one or two factors are considered,which makes the effect of the text representation poor and affects subsequent tasks.For massive text information,parallel storage and processing methods are required.At present,the overlapping K-Means clustering calculation based on Spark is sensitive to the initial clustering center.Multiple iterations result in frequent exchange of data between the Master and Worker nodes,which affects the operation of the algorithm.The efficiency and stability of clustering results.This article focuses on the text clustering algorithm.The main research contents are as follows:(1)Aiming at the CLM algorithm,which ignores the link relationship between texts,a text representation method based on the semantic representation of attribute network CLMSA(Collaboratively Improving Semantic and Attribute Information by Non-Negative Matrix Tri-Factorization),which integrates the link information,word information and topic information of the document,so that the three can promote and work together to improve the effect of text representation.Six mainstream text representation methods are selected and the classification effect is evaluated on two real data sets.The experimental results show that the improved CLMSA algorithm has a certain improvement in the text representation effect.(2)For POKM(Parallel Overlapping K-means Cluster)algorithm,Master and Worker nodes frequently exchange data,and the network overhead is huge.The I_POKM(Improved Parallel Overlapping K-means Cluster)algorithm based on local integration strategy is proposed,which greatly reduces the running time of the algorithm..At the same time,the I_POKM algorithm is sensitive to the initial clustering center,parallelizes the active selection strategy,and proposes the AI_POKM(Active Improved Parallel Overlapping K-means Cluster)algorithm based on active selection.Experimental results show that the improved algorithm outperforms the original algorithm on two real data sets and four simulated data sets. |