Research Of Hydrologic Data Cleansing Scheme Based On Hadoop

Posted on:2017-05-06

Degree:Master

Type:Thesis

Country:China

Candidate:Q N Chen

Full Text:PDF

GTID:2382330566453142

Subject:Information and Communication Engineering

Abstract/Summary:

PDF Full Text Request

With the development of network,shipping data distributed stored accumulates fast and the amount of data increases rapidly.Shipping data is huge and complex.Such huge and complex data contains much important information needed to be found by researchers and contains many problems which make the studies difficult.So it is important to cleansing the dirty data for data mining process.There are many studies on data cleansing,but versatility and tolerance of those systems is not good.They can only complete one cleansing task.At the same time,studies in big data cleansing are few.In this thesis,hydrologic data in shipping data is the subject of our research.Hydrologic data may contain missing data,outliers or duplicates.A data cleansing scheme based on Hadoop is proposed.This data cleansing scheme provides three cleaning strategies.The main work of this thesis contains the following three points:Firstly,to solve the low accuracy problem of missing data cleansing algorithm,a clustering solution centered with the missing data is proposed to realize missing data cleansing.The missing value is calculated by the weights of the complete data from the same cluster.The improved algorithm is distributed with MapReduce.The experiment proves that the improved clustering solution guarantees the correlation between the complete data with the missing data in cluster and improves the accuracy of missing data cleansing algorithm.The rate of accuracy of missing data cleansing is improved about 10%.Secondly,to improve the efficiency of outliers’ cleansing algorithm,the way of clustering and pruning is used for outliers’ cleansing based on distance.It can help to filter the area without outliers and narrow the range of outliers’ detection.The improved algorithm is distributed with MapReduce.The experiment proves that the improved outliers’ cleansing algorithm is more efficient by clustering and pruning.The time efficient of outliers’ cleansing is improved about 30%Thirdly,duplicates cleansing algorithm is based on Multi-pass sorted neighbor algorithm.Multi-pass sorted neighbor algorithm based on multi-pass and multiwindows improves the number of duplicates found during the process of duplicates cleansing.Boundary value replication and automated partitioning is used in the distributed process of duplicates cleansing algorithm.The boundary value replication is used for data matching in different node.Automated partitioning is used for data redistribution.The experiment proves that the improved duplicates cleansing algorithm improves the number of duplicates.The number of duplicates detected by duplicates cleansing increases about 13%.

Keywords/Search Tags:

Missing Data Cleansing, Outliers Cleansing, Duplicates Cleansing, Hadoop

PDF Full Text Request

Related items

1	Research On The Data Cleansing Methods For Bridge Monitoring Data Based On Big-Data Platform
2	Taxi Data Quality Analysis And Processing Based On Hadoop
3	Study On Sedimentation And Cleansing Technology For Pressurized CSO Deep Chamber
4	Automated Load Curve Data Cleansing in Power Systems
5	Research And Implementation Of Data Cleaning And Power Quality Assessment
6	Analysis On The Current Situation And Development Of Men 's Cleansing Products Packaging
7	Research On Anomaly Detection Method For Driving Data From Internet Of Vehicle Based On Isolation Forest
8	Study On Data Cleaning And Feature Automatically Extraction Of The Measured Overvoltage
9	Suburban cleansing: An ecological, economical and social intervention in Tysons Corner, Virginia
10	The temporal and gradational trends of sand infiltration in a gravel bed