Font Size: a A A

A Research Of Hadoop Based Massive Multi-source Heterogeneous Data Cleaning Technology In Petroleum Field

Posted on:2018-02-16Degree:MasterType:Thesis
Country:ChinaCandidate:T ChenFull Text:PDF
GTID:2381330596969809Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the rapid development of global information technology,the digital construction of domestic oil companies has further deepened,with the consequent impact of the data volume was explosive growth.In order to get useful information from the massive,multi-source,heterogeneous data,data cleaning is a must.But the existing cleaning solutions is not useful for the massive,multi-source,heterogeneous data from oil field.In this paper,we studied the characteristics of the semi structured and unstructured data,then converted into XML data,established relevant semantic evaluation matrix,structure evaluation matrix,finally did data cleaning.In structured data,the objects for data cleaning is duplicate records,outliers,missing values.So in this paper,we proposed a cleaning scheme based on a distributed parallel framework,which includes the method based on Hadoop platform with outlier cleaning method for the duplicate records,the method based on Hadoop platform with association rules for the outliers,the method based on Hadoop platform with cluster filling method for the missing data.Finally,this paper did an experiment using the scheme and a traditional scheme,the experimental object comes from Shengli oil field geology research institute,and comes from the MongoDB,Oracle,MySQL database.The experimental results show that the proposed scheme has obvious advantages in dealing with massive data.
Keywords/Search Tags:Data cleaning, Hadoop, Massive Multi-source Heterogeneous Data, Big data
PDF Full Text Request
Related items