| With the advent of the era of big data,the storage cost of massive data and the performance of instance-based machine learning algorithms are facing great challenges.The instance selection is one of the feasible ways to solve the above problems.However,most existing methods adopt iterative mode,which is time consuming and difficult to balance the relationship between the selected subsets performance and reduction rate.Therefore,this paper proposes an instance selection method based on boundary features and surrogate model to keep the performance of selected instances and reduce the computational cost.The main research contents are as follows:(1)Instance Selection with Dynamic Important Data Finding based on Removability and Boundary Features: an instance selection strategy based on removability and boundary feature for dynamic important data finding is proposed.First,the importance weight of the instance is randomly initialized,and the sample with small weight is selected as the instance to be deleted.The definition of instance selection removability based on K-means clustering is given.Then,the distance between the deleted instance and the original clustering center is calculated,and the boundary features of the instance are measured.The sample selection mechanism based on importance weight and reduction proportion is given.The proposed algorithm is applied to 20 UCI data sets,and the results verifies the effectiveness of the proposed algorithm in maintaining the learning performance of samples.(2)Instance Selection based on Surrogate Model Prediction of Data Importance: Aiming at the time consuming of iterations based instance selection of content(1),an instance selection method based on the importance predicted with surrogate model is proposed.First,the clipping nearest neighbor algorithm is used to remove the noise of the data.Then,the statistical characteristics of the samples to be deleted and the distance from the clustering center to the initial class center after the sample reduction are calculated,and the gaussian process prediction model is constructed to predict the accuracy of instance selection(class center distance).The instance selection strategy of fusing noise reduction and surrogate model is given.The proposed algorithm is applied to the typical data sets of content(1),and the experimental results show its efficiency.The instance selection strategy of dynamically updating the instance importance based on removability and boundary features is first proposed in this paper.The clustering iterative process is used to select the class boundary instances,and the data importance weight is updated according to the removability and boundary features.The instance selection algorithm based on the importance prediction with surrogate model further improves the performance of content(1)by greatly reducing the computation complexity under the condition of ensuring the performance of the screening subsets. |