| With the development of modern society,a large amount of data has challenged the existing computing power and traditional data analysis algorithm,so it is an important research issue to be able to complete the data analysis task of large-scale data in an acceptable time.At present,in the face of large-scale data,people usually use dimension reduction method to reduce the dimension of data,and then carry out data analysis tasks for the reduced dimension data,such as clustering analysis,regression analysis and other tasks.Clustering algorithm is an important field in machine learning.The idea of clustering algorithm is to divide data points into multiple clusters,so as to ensure the high similarity of data points in the same cluster and the low similarity of data points among different clusters.After the clustering of data points is obtained,the potential internal structure of data is also revealed.However,when clustering algorithm is applied to high-dimensional data,it will face the problem that the amount of computation increases dramatically.At this time,we can combine the clustering algorithm with the dimension reduction method.Common dimension reduction methods can be divided into two categories: feature extraction method and feature reconstruction method.In this paper,we mainly study a dimension reduction method called random projection.The theoretical basis of random projection is that there is such a mapping,through which the points in high-dimensional space can be mapped to a low-dimensional space,and at the same time,the distance between the data points before and after the mapping is kept constant with a large probability.This paper studies the application of random projection in K-means,spectral clustering.Experimental results show that after random projection,the performance of clustering algorithm has been effectively improved,and the clustering results have not been significantly affected.In the practical application of clustering algorithm,people often encounter a problem that some clustering algorithms need to specify the number of clusters,but in reality,the number of clusters in the data set is usually unknown.This paper studies this problem and proposes an algorithm to determine the number of clusters based on random projection.The core idea of the algorithm is to use random projection method to project the original sample set many times.Because random projection method can keep the original information of data as much as possible.Then,when the number of clustering clusters matching the internal structure of the data is selected,the clustering algorithm is used to cluster these projected datasets,even under the influence of random projection,the clustering results should still tend to approximate.In this paper,the method is tested on several datasets,and the experimental results show that the method based on random projection to determine the number of clusters can more accurately select the cluster values that meet the real structure of the data.In comparison with other methods to determine the number of clusters,the performance of our method is also in the forefront.At the end of the paper,we do some research experiments on the parameters involved in the method.The results show that increasing the number of projections can effectively improve the performance of this method to determine the number of clusters,but at the cost of increasing the running time. |