| With the rapid development of Internet applications, cloud data composed of a large number of application data is stored. The storage of large amounts of data is getting more convenient and quicker, there are vast amounts of data accumulated around the world. These data implicate high-value and high-potential knowledge. It has become the cutting-edge issues of the world that how to obtain valuable information more quickly and efficiently from the mass and therefore using it to support the management and decision for companies, government and other fields. These issues need to be resolved urgently. Researchers have proposed more and more methods and techniques used to process these data to solve this problem, data mining came into being.Clustering can be used as a stand-alone data mining tool or the preprocessing step of other data mining algorithm. Clustering now has been fully applied into various ways, such as e-commerce, biology, geography, Web document classification and other areas.In this paper, K-Means clustering algorithm is improved. There is a data set Dn provided for K-Means clustering algorithm. Firstly, it determines the number of cluster k. Then the algorithm randomly chooses k initial cluster centers from the data set. We calculate the similarity between each data object and k cluster centers and then assign it to the corresponding largest similarity class so that we will get k clusters. Calculate the average of all data objects for each cluster. Make the average as the new cluster center for the corresponding cluster. It needs to circulate the k cluster classes based on the new cluster centers for several times by above steps until the cluster centers are not changed any more or the evaluation function converges. Finally, we will obtain k clusters. The biggest advantage of this algorithm is simple and easy to operate, but there are significant drawbacks:1) K-Means needs people to determine the number of cluster k, which is based on the experience of researchers; 2) As initial cluster centers are randomly chosen, different initial cluster centers may lead to different clustering results; 3) K-Means is easy to fall into local optimal solution; 4) K-Means is mainly applied to the data set which is the law distribution, such as globular clusters.This paper presents an improved cluster algorithm based on K-Means clustering algorithm. For large data sets, we assign an initial value for the initial number of cluster k. After K-Means algorithm run once, we will get k clustering centers. It mergers more similar cluster centers through the minimum spanning tree algorithm acts on k clustering centers. Then we will get k’clustering centers where k’is less than k, which means that the number of cluster gets smaller. Repeat the above steps until the judge function converges with the new number of cluster k’and the cluster centers. Ultimately the clustering result will present a more superior number of clusters.This article briefly describes the background and the significance of this topic and the current research. Then it briefly introduces the basic theory of data mining, including the basic concepts and the basic steps about data mining. It makes a more detailed description on varieties of clustering algorithms abroad currently. This article focuses on K-Means algorithm which is rather classical. Firstly, it analyzes the advantages and disadvantages about this algorithm. To solve these problems, it gives a few improved K-Means clustering algorithm and then analyzes these improved algorithms. Do some research and analysis about above 1),2) two drawbacks of K-Means clustering algorithm. This article proposes an improved K-Means algorithm combining with several other clustering algorithms mentioned above. Experiments show that the performance of the proposed K-Means algorithm has been greatly improved. At the end, this paper summarizes the research and explains future research directions about the clustering mining. |