| With the development of artificial intelligence and information technology,the amount of data is emerging at all industries: genetic data,medical data,financial data,and so on,human beings are entering the era of data.Facing a large amount of data,one of the main problems is that how to remove noise,redundant data and find the valuable information hidden in the data.Data reduction technology is a good tool to solve this problem.At present,the technology of data reduction mainly concentrates on the reduction of features,and has little research on the reduction of data sets.In view of the existing situation,this paper studies the technique of reducing samples in data,and the clustering effectiveness is analyzed based on this.The main purpose of data reduction is to remove unimportant information from the data set and make the remaining data more conducive to analysis.Aiming at the general characteristics of the data set distribution,in this paper,we propose two methods of data reduction: grid-based data reduction method,data reduction method based on vector angle.Based on the grid method,we divide the data space,and define the absolute density and relative density of the data points in order to achieve the purpose of data reduction.In the vector angle method,we determine the average vector angle of each data point to distinguish the core and boundary objects in the data set,the important data are preserved by deleting the boundary objects step by step.We proves that the proposed algorithm can effectively remove the redundant data points in the dataset and make the structural information of the data set more obvious by experimenting on the artificial data sets and UCI data sets.Because of the characteristic of unsupervised clustering analysis in data mining,it has been widely used in dealing with massive information.However,the effectiveness of clustering analysis has been a hot topic.Determining the correct number of data sets by using the validity of clustering is vulnerable to noise data,class separation and clustering algorithm,the number of categories determined is difficult to guarantee.In this paper,the clustering accuracy and the optimal number of classes are analyzed based on the data reduction on the data sets before and after subtracting.The experiments shows that the subtractive data sets are more separable,it has higher clustering accuracy,and the optimal number of classes is closer to the true class number of the dataset. |