| Clustering analysis is an important research task in data mining and pattern recognition field.With the further development of IT technology,the extension of the scale and scope of database application for people,the continuous updating of the data sampling technology and increasing ability for people quickly generating and collecting data,the large-scale data sets are now being widespread concerned by the whole society.The coming of the large-scale data sets has brought many great challenges to a lot of algorithms in the field of data clustering analysis,so that some algorithms can not well address large-scale data sets and some algorithms can not even deal with large-scale data sets.It has become a major research focus in the field of data mining that how to make these algorithms effectively analyze large-scale data sets.Based on the summarizing of the researches of existing algorithms and using some real and synthetic data sets as the application background,this paper detailed studies several key issues of data clustering analysis algorithms for large-scale data sets.(1)To solve the problem that k-means clustering algorithm for large-scale data sets is sensitive to initial cluster centers and can not obtain good clustering quality because it is easy to fall into the locale optimal solution,a maximum triangle ruled k-means algorithm and a Maximum triangle ruled semi-supervised k-means clustering algorithm are proposed in this paper.By selecting the initial cluster centers using the Maximum triangle rule and learning the thought of the semi-supervised clustering,the quality and stability of the results of the algorithm for large-scale data sets are improved.(2)To solve the problem that the spectral clustering algorithm for large-scale data sets has high computational complexity,a fast spectral clustering algorithm based on Nystr?m method is proposed in this paper.By using the constrained sampling model and the Nystr?m method,the computational complexity of the spectral clustering algorithm is reduced and the clustering quality is improved.(3)To solve the problem that the classification results of the minimum distance and the nearest neighbor classification methods are poor when the number of the training samples is small and the training samples are far from the cluster centers,the Mean Update(MU)and the MU based Minimum Distance classification model are proposed in this paper.By correcting the misclassification in the MU category process,the classification results are improved.Next,to solve the shortcomings of processing large-scale data of the common clustering methods,a novel partitional clustering method is proposed in this paper.It determines the initial positions of natural clusters centroids by clustering the samples with large enough size selected using the large data sampling method repeatedly.Next it updates the initial using the remaining data to correct the centroids positions deviating from the ideal positions.The experimental results show that this new clustering algorithm can not only give better clustering results than common clustering algorithm,but run fast,and is suitable for large-scale data clustering processing.(4)To solve the shortcomings such as the poor segmentation results and the high computational complexity,of partitioning large-scale color images of the common spectral clustering algorithms,a color image segmentation algorithm based on mean shift and spectral clustering ensemble algorithms is proposed in this paper.It incorporates the advantages of the mean shift and the spectral clustering ensemble algorithms,and considers both the lightness and detail information of the local region pixels.The experimental results in several large-scale color images demonstrate the superiority of the algorithm. |