| At present,more and more government agencies,enterprises and institutions have begun to open data,and established many public data open platforms to realize data sharing.However,due to the many fields involved in public data and the complex sources,there are more complex quality problems in the aspects of standardization and integrity,which leads to the unsatisfactory results of the commonly used cleaning algorithms for public data,affecting the use effect and utilization rate of public data.Therefore,this paper analyzes the public data quality problems of the open data platform,studies and improves several kinds of commonly used data cleaning algorithms based on the clustering idea,and designs a public data cleaning framework.Firstly,the data on the mainstream public data platform is analyzed and studied,and the universal quality problems of public data are summarized.Then it introduces the basic method of data cleaning and analyzes the demand of public data cleaning,which is the basis of improving the cleaning algorithm.In terms of cleaning duplicate and similar data,the Nearest neighbor sorting algorithm(SNM)is researched and improved.In SNM algorithm,the selection of sorting keywords in the process of repetitive value cleaning affects the result of similar sorting,the sliding window is not easy to control and cannot be scaled,the setting of similarity threshold and other problems greatly affect the result of cleaning.Therefore,the algorithm is improved based on clustering and synthetic attribute weights.Comprehensive attribute weight method to avoid the problem of fixed attribute weight,keyword ranking is replaced by clustering method,the data is matched with the result set after clustering,to solve the problem of data omission caused by window setting.And the public data set is used to make a comparative analysis of the improved algorithm to verify the effectiveness of the improved algorithm.In terms of missing value cleaning,this paper studies and improves the K-means clustering algorithm(K-means)data cleaning algorithm based on the determination of initial k value,the selection of initial clustering center point,the detection and removal of isolated points,etc.,and obtains a more accurate k value based on iteration.By referring to the idea of density-based Spatial Clustering of Applications with Noise(DBSCAN)algorithm,the selection of clustering centers is optimized Based on maximum distance and maximum Density to avoid the influence of discrete points on clustering results.Improve data correlation between classes.The improved algorithm is tested and analyzed with data sets.Finally,based on the improved cleaning algorithm,a set of public data cleaning framework is summarized.The cleaning framework includes the setting stage of cleaning rules and quality rules suitable for public data,the cleaning stage and the evaluation stage.The public data set is selected for data cleaning experiment,and the feasibility of the public data cleaning framework is analyzed and verified,as well as the effectiveness of the improved algorithm in the cleaning framework,the experimental data results and final conclusions are obtained.The experimental data analysis shows that the accuracy of the improved algorithm for public data cleaning is significantly improved,and the proposed public data cleaning framework has good feasibility and effect. |