Font Size: a A A

Research Of Hydrologic Data Cleansing Scheme Based On Hadoop

Posted on:2017-05-06Degree:MasterType:Thesis
Country:ChinaCandidate:Q N ChenFull Text:PDF
GTID:2382330566453142Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
With the development of network,shipping data distributed stored accumulates fast and the amount of data increases rapidly.Shipping data is huge and complex.Such huge and complex data contains much important information needed to be found by researchers and contains many problems which make the studies difficult.So it is important to cleansing the dirty data for data mining process.There are many studies on data cleansing,but versatility and tolerance of those systems is not good.They can only complete one cleansing task.At the same time,studies in big data cleansing are few.In this thesis,hydrologic data in shipping data is the subject of our research.Hydrologic data may contain missing data,outliers or duplicates.A data cleansing scheme based on Hadoop is proposed.This data cleansing scheme provides three cleaning strategies.The main work of this thesis contains the following three points:Firstly,to solve the low accuracy problem of missing data cleansing algorithm,a clustering solution centered with the missing data is proposed to realize missing data cleansing.The missing value is calculated by the weights of the complete data from the same cluster.The improved algorithm is distributed with MapReduce.The experiment proves that the improved clustering solution guarantees the correlation between the complete data with the missing data in cluster and improves the accuracy of missing data cleansing algorithm.The rate of accuracy of missing data cleansing is improved about 10%.Secondly,to improve the efficiency of outliers' cleansing algorithm,the way of clustering and pruning is used for outliers' cleansing based on distance.It can help to filter the area without outliers and narrow the range of outliers' detection.The improved algorithm is distributed with MapReduce.The experiment proves that the improved outliers' cleansing algorithm is more efficient by clustering and pruning.The time efficient of outliers' cleansing is improved about 30%Thirdly,duplicates cleansing algorithm is based on Multi-pass sorted neighbor algorithm.Multi-pass sorted neighbor algorithm based on multi-pass and multiwindows improves the number of duplicates found during the process of duplicates cleansing.Boundary value replication and automated partitioning is used in the distributed process of duplicates cleansing algorithm.The boundary value replication is used for data matching in different node.Automated partitioning is used for data redistribution.The experiment proves that the improved duplicates cleansing algorithm improves the number of duplicates.The number of duplicates detected by duplicates cleansing increases about 13%.
Keywords/Search Tags:Missing Data Cleansing, Outliers Cleansing, Duplicates Cleansing, Hadoop
PDF Full Text Request
Related items