Font Size: a A A

A Class Of Density-based Clustering Algorithms

Posted on:2018-04-03Degree:MasterType:Thesis
Country:ChinaCandidate:L PangFull Text:PDF
GTID:2358330518968281Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Density-based clustering methods have played an important role in cluster analysis.It has been widely applied in information filtering,information retrieval,medical treatment and public service,which was a main research direction in cluster analysis area.We combined the features in hierarchical clustering algorithm with the features in density-based clustering algorithm,and we proposed a density-based clustering based on density of hierarchical.Experimental results demonstrate that CODHD(Clusters Optimization Based on Density of Hierarchical Division)performs well in clustering accuracy and clustering efficiency compared with COPS(Clusters Optimization on Preprocessing Stage).According to a new clustering algorithm CFSFDP(Clustering by Fast Search and Find of Density Peaks)proposed by Alex Rodriguez and Alessandro Laio,a parallelization model of the algorithm under MapReduce framework is proposed.The same with other density clustering algorithms,the algorithm can deal with clustering of complex shape in parallel conditions,and the number of classes in the data does not need to be specified in advance.Moreover,the CFSFDP algorithm requires fewer user-specified parameters.Compared with some clustering algorithms that require iteration,the running time of the algorithm is greatly reduced.The main work is summarized as follows:(1)Concern the problem that traditional clustering algorithms cluster the dataset repeatedly and have poor computational efficiency on large datasets,CODHD based on hierarchy partition was proposed to determine the optimal number of clusters and initial centers of clusters.The algorithm focus on the study of computational process,and does not need to cluster datasets repeatedly.First of all,all statistical values of clustering are obtained by scanning dataset;Secondly,different levels of data are partitioned from the bottom up,and the density of each partition is calculated,then the maximum density point of each partition is taken as a initial center,at the same time,minimum distance from center to the data of higher density is calculated,the average of products’ sum of the density of the center and the minimum distance is taken as indicator of effectiveness and a different hierarchy clustering quality curve is built incrementally;At last,the optimal number of clustering and the initial cluster centers are estimated corresponding to the extreme points of curve.Experimental results demonstrate that,CODHD can improve clustering accuracy and clustering efficiency comparing with COPS.(2)Traditional CFSFDP can recognize arbitrary shapes of arbitrary dimensions in space,while when large-scale dataset is processed,it takes too much time to calculate the distance between the data points.To overcome this problem,we proposed a MapReduce-based CFSFDP clustering algorithm called mrCFSFDP which reads the dataset once and hence reduce the running time.Each procedure of mrCFSFDP carried out in many nodes is divided into two steps,Map and Reduce.This algorithm is tested on several cases and the experimental results show that this model is feasible and has good performance of accuracy and efficiency.In this paper,all the datasets are taken from the UCI real data set.According to the classical clustering model,two novel clustering models are established.Compared with other algorithms,it is proved that the proposed algorithm has better performance.
Keywords/Search Tags:Clustering algorithm, clustering validity index, density, distance, MapReduce
PDF Full Text Request
Related items