| Cluster analysis is an important method in the field of data mining.When clustering missing data,the existence of missing values significantly reduces the clustering effect.Usually,when performing cluster analysis on missing data,the imputation method is first used to impute missing values to obtain a complete data,and then cluster analysis is performed on the complete data.Imputation is to replace the missing value by inferring a reasonable new value.The imputation method can retain the observed information and use the known information to estimate the missing value more reasonably.At present,there are the following problems when using imputation method to deal with clustering data with missing values: the error between imputation value and original truth value is large,resulting in low imputation precision;the imputation values have a great influence on the clustering results,which reduces the clustering accuracy;the imputation process takes too much time,resulting in low imputation efficiency.Therefore,for clustering data with missing values,how to quickly and effectively find the substitute value closest to the real value for missing data and realize the optimal clustering is the problem to be solved in this thesis.The main content of this thesis includes two parts:(1)Aiming at the problems of low imputation precision and low clustering accuracy,this thesis proposes the method of trimmed scores regression(TSR)as an imputation method for clustering data with missing values.TSR method is an imputation method based on principal component analysis.This thesis selects three other imputation methods based on principal component analysis and classical imputation methods as comparison methods.For the clustering data with univariate random missing pattern and general random missing pattern,through simulation and real data analysis,the performance of the TSR method is analyzed from two perspectives of imputation precision and clustering accuracy.Results showed that,the imputation precision and clustering accuracy of TSR method are higher than that of comparison methods for clustering data with the same missing pattern.(2)Aiming at the problem of low imputation efficiency,the TSR method is improved and a distributed trimmed scores regression(DTSR)method is proposed.DTSR method is a distributed imputation method.In this thesis,two distributed imputation methods and the TSR method are used as comparison methods.Through simulation and real data analysis,the performance of the DTSR method is analyzed from two perspectives of imputation precision and imputation time.Results showed that,DTSR method can achieve the same or similar imputation precision as TSR method and higher than other comparison methods,and the imputation time is much less than TSR method,which saves time cost and improves the imputation efficiency.The research content of this thesis further solves the problems existing in the current research on the imputation method of clustering data with missing values.Compared with other methods,the TSR method improves the imputation precision and clustering accuracy,and the DTSR method reduces the imputation time and improves the imputation efficiency on the basis of ensuring the imputation precision.The validity and feasibility of the methods proposed in this thesis are verified by experimental analysis. |