Font Size: a A A

Clustering Algorithms Based On SimRank And Density

Posted on:2020-10-07Degree:MasterType:Thesis
Country:ChinaCandidate:P Q LiFull Text:PDF
GTID:2428330596473760Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In this era of growing data,many data are usually collected indefinitely(independently of acquisition and mining)and stored irregularly in practical applications,which is one of the main reasons for the difficulty of big data mining.Clustering algorithm has been proved to be an effective method for processing big data.It has been widely used as a branch of machine learning and data mining,such as:depicting the nature of data,understanding the relationship between different data,and classification of data information,etc.Therefore,this paper designs several new clustering algorithms to provide technical support for improving the efficiency of big data mining.The core idea of the clustering algorithm is to cluster the big data so that the intraclass similarity is as high as possible,and the out-of-class similarity is as low as possible.The current clustering analysis algorithms mainly include hierarchical clustering,spectral-based clustering,density-based clustering and so on.Among them,the clustering result of the spectral clustering algorithm depends largely on the quality of the similarity matrix.In order to achieve satisfactory results for spectral clustering algorithm,how to construct a similarity matrix that can better describe the relationship between data is the key point of the success of the spectral clustering algorithm.And the density-based clustering algorithm assumes that the clustering structure can be determined by the closeness of the sample distribution,and clusters according to the density of the data set in the spatial distribution,that is,as long as the sample density in a region is greater than a certain threshold,samples will be divided into a cluster that is close to it.The key idea to density-based clustering is the choice of core objects and the control of density thresholds.Based on the above findings,the following two aspects are studied in this paper that are summarized as follows:(1)A Spectral clustering algorithm that based on SimRank Score(Simplified SCSRS)is proposed.For the traditional graph clustering algorithm,only the distance between the data points is considered when the similarity matrix is established,and the implicit internal relationship between the data points is neglected.Thus a spectral clustering algorithm based on SimRank is proposed.The algorithm combines the related SimRank similar scores,graph segmentation,graph Laplacian matrix,dynamic graph data and other core theories and techniques to propose an effective spectral clustering method.The algorithm first uses the undirected graph data to establish the adjacency matrix and obtains a similarity matrix based on SimRank.Then,the Laplacian matrix expression is established according to the similarity matrix,and normalized,and then spectral decomposition is performed.Finally,k-means clustering is performed on the eigenvectors obtained after decomposition.This paper compare the existing algorithms with the application fields such as analysis and recognition of the graph to verify the effect of the proposed algorithm.Innovations are summarized as follows:1)The algorithm uses the SimRank similarity score method when calculating the similarity matrix,which has obvious advantages over the traditional distance-based method(the distance calculated for high-dimensional data will be invalid).2)The algorithm fully considers the implicit internal relationship between the data.The similarity between the two data points is not determined by their distance,but by the degree of similarity between their neighbors which is much more robust to noise and outliers.3)The proposed clustering method is used in the field of graph analysis,that is,the new spectral clustering method is used to accurately determine the region describing the image features for image research.(2)A sparse learning based clustering by fast search and find of density peaks(SL_CFSFDP)is proposed.A clustering by fast search and find of density peaks(CFSFDP)is a novel clustering algorithm proposed in recent years.This algorithm has the advantages of low computational complexity and good clustering effect,but its truncation distanced_cneeds to be determined according to user's experience,and the smaller the data set is,the higher the error rate will be.To solve these shortcomings,a new fast search and density peak clustering based on sparse learning(SL-CFSFDP)is proposed.Compared with CFSFDP,SL-CFSFDP does not need to setd_cmanually.At the same time,SL-CFSFDP algorithm is used.Sparse learning methods to determine the neighbors of each data point in order to remove the effects of irrelevant data.Firstly the algorithm automatically determines the cluster center by combining the local density and distance together,and then clusters the data points according to the local density and distance.Experimental results on the UCI standard datasets show that SL-CFSFDP is superior to DBSCAN,CFSFDP and etc.In conclusion,in this paper,some shortcomings of the clustering algorithm are improved and two novel clustering algorithms are proposed by using SimRank score,spectral clustering,density peak,truncation distance and sparse learning techniques into the clustering modes.Compering the two clustering algorithms proposed with the current popular clustering algorithms on all kinds of evaluation indicators,the performances of the algorithms designed in this paper are superior to the current mainstream algorithms.In the future work,deep learning will be considered as a pre-processing stage of clustering analysis framework,and then applied to various practical applications.
Keywords/Search Tags:SimRank score, spectral clustering, sparse learning, density clustering
PDF Full Text Request
Related items