Research On Clustering Based Public Data Cleaning Algorithm

Posted on:2024-01-30

Degree:Master

Type:Thesis

Country:China

Candidate:J Bai

Full Text:PDF

GTID:2556307067963119

Subject:Electronic information

Abstract/Summary:

PDF Full Text Request

At present,more and more government agencies,enterprises and institutions have begun to open data,and established many public data open platforms to realize data sharing.However,due to the many fields involved in public data and the complex sources,there are more complex quality problems in the aspects of standardization and integrity,which leads to the unsatisfactory results of the commonly used cleaning algorithms for public data,affecting the use effect and utilization rate of public data.Therefore,this paper analyzes the public data quality problems of the open data platform,studies and improves several kinds of commonly used data cleaning algorithms based on the clustering idea,and designs a public data cleaning framework.Firstly,the data on the mainstream public data platform is analyzed and studied,and the universal quality problems of public data are summarized.Then it introduces the basic method of data cleaning and analyzes the demand of public data cleaning,which is the basis of improving the cleaning algorithm.In terms of cleaning duplicate and similar data,the Nearest neighbor sorting algorithm(SNM)is researched and improved.In SNM algorithm,the selection of sorting keywords in the process of repetitive value cleaning affects the result of similar sorting,the sliding window is not easy to control and cannot be scaled,the setting of similarity threshold and other problems greatly affect the result of cleaning.Therefore,the algorithm is improved based on clustering and synthetic attribute weights.Comprehensive attribute weight method to avoid the problem of fixed attribute weight,keyword ranking is replaced by clustering method,the data is matched with the result set after clustering,to solve the problem of data omission caused by window setting.And the public data set is used to make a comparative analysis of the improved algorithm to verify the effectiveness of the improved algorithm.In terms of missing value cleaning,this paper studies and improves the K-means clustering algorithm(K-means)data cleaning algorithm based on the determination of initial k value,the selection of initial clustering center point,the detection and removal of isolated points,etc.,and obtains a more accurate k value based on iteration.By referring to the idea of density-based Spatial Clustering of Applications with Noise(DBSCAN)algorithm,the selection of clustering centers is optimized Based on maximum distance and maximum Density to avoid the influence of discrete points on clustering results.Improve data correlation between classes.The improved algorithm is tested and analyzed with data sets.Finally,based on the improved cleaning algorithm,a set of public data cleaning framework is summarized.The cleaning framework includes the setting stage of cleaning rules and quality rules suitable for public data,the cleaning stage and the evaluation stage.The public data set is selected for data cleaning experiment,and the feasibility of the public data cleaning framework is analyzed and verified,as well as the effectiveness of the improved algorithm in the cleaning framework,the experimental data results and final conclusions are obtained.The experimental data analysis shows that the accuracy of the improved algorithm for public data cleaning is significantly improved,and the proposed public data cleaning framework has good feasibility and effect.

Keywords/Search Tags:

Public data, Data cleaning algorithm, SNM algorithm, K-means algorithm, Data cleaning Framework

PDF Full Text Request

Related items

1	Research On Data Cleaning Framework And Application For Open Government Data
2	Research On Assessment Methods Of The Judge Workload Base On Data Mining
3	Research On Legal Regulation Of Algorithm Discrimination In China In The Era Of Big Data
4	With The The Multidimensional Restrictive Constraints Trading Rules Model And The Mining Algorithm
5	Design And Implementation Of Data Cleaning Framework For Security Industry
6	Research On Legal Regulations For Big Data Maturity Summary
7	The Crisis Of Big Data Algorithm Discrimination And Legal Response
8	Research On Laws And Regulations Of Big Data "Killing Familiar" Driven By Algorithm
9	Operators Use Big Data To Regulate The Precise Premium By Algorithm
10	System Design And Implementation For Department Of Characterization Of Property Criminals Based On Clustering Algorithm