Font Size: a A A

Research On Application Of Data Cleaning Framework And Algorithm In Oil Field

Posted on:2022-10-30Degree:MasterType:Thesis
Country:ChinaCandidate:Y H MuFull Text:PDF
GTID:2481306329451154Subject:Software engineering
Abstract/Summary:PDF Full Text Request
With the in-depth application of digital transformation in oil industry in various disciplines,higher requirements are put forward for data quality in oil field.Only high-quality data can provide guarantee for mining more valuable information from massive data and make decision-making more accurate.However,at present,there are data quality problems in the oil field,such as similar duplicate records,anomalies,vacancies,inconsistent data and many other problems.Therefore,data governance has become an important issue in the digital transformation of enterprises.It is important to find these problems,but how to correct these data is even more important.Therefore,the concept of data cleaning appears.At present,the algorithms involved in data cleaning are not only complex,but also various.There are many elements involved in data cleaning,such as data sets to be cleaned,attributes,cleaning rules,cleaning algorithms,processes and cleaning tasks,etc.How to effectively model this complex knowledge and their complex relationships is an important and key scientific problem.In addition,in order to standardize the operation flow of data cleaning,this thesis focuses on the standardized operation flow of conceptual data cleaning in data governance in the data governance standard system of International Data Management Association(DMAM)China Branch,which provides a standard basis for constructing the standardized operation flow ontology of data cleaning.First of all,this thesis focuses on a comprehensive research and analysis of DQM(Data Quality Management),including the data quality,ontology,and data quality management vocabulary involved.In addition,data cleaning and data cleaning frameworks are also studied.The research has laid an important theoretical foundation for the construction of the standardized process ontology of data cleaning operations.Then,in the process of designing the cleaning framework,the quality ontology and data quality management vocabulary in DQM were integrated,and combined with the standard vocabulary in the petroleum field,the concepts and terms related to data cleaning operations were determined,which was constructed with reference to the seven-step method of Stanford University A new data cleaning operation standardized process ontology model,which can separate the data cleaning operation standardized process from the specific data source,and in the task method ontology by calling the specific instance-level cleaning algorithm to the specific data source The data is cleaned,and the ontology can also expand and reuse knowledge,thus solving the problem of complex knowledge representation in data cleaning and flexibility in application.On the algorithm of data cleaning at the concrete instance level,this thesis focuses on how to clean similar duplicate records in Chinese.Firstly,the related knowledge about cleaning similar duplicate records is introduced,including the definition of similar duplicate records and some algorithms for similar duplicate records at present.By comparing the advantages and disadvantages of various algorithms,an improved SNM Chinese semantic duplicate record detection algorithm is proposed.It mainly includes: 1.Using SNM algorithm to sort keywords effectively.2.For the Chinese field part,we use Synonym Forest Extended Edition and Jaccard algorithm to calculate the similarity of words,and at the same time,we use the Chinese word segmentation method of Jieba in Python to segment sentences,so as to optimize the cosine similarity to calculate the similarity of sentences.3.Determine whether it is a similar duplicate record by adjusting different thresholds.Finally,based on the data quality and data cleaning problems of certain Oilfield Company,the oil data cleaning system is designed and implemented according to the previous research results.The system realizes the functions of data cleaning operation formulation,similar duplicate record cleaning,abnormal value cleaning,vacancy value filling and cleaning result analysis visualization.The data model of similar duplicate records is tested by using abandoned well sealing data.Experiments show that the model has high recall and precision for similar duplicate records detection,which verifies the effectiveness and feasibility of the cleaning method proposed in this thesis.
Keywords/Search Tags:Ontology, Detection of similar duplicate records, SNM algorithm, Data cleaning, data quality management
PDF Full Text Request
Related items