Font Size: a A A

Research On Enterprise Risk Data Cleaning Method And Process Based On Multi-source Fusion

Posted on:2024-06-26Degree:MasterType:Thesis
Country:ChinaCandidate:H C JiangFull Text:PDF
GTID:2568307070450534Subject:Engineering
Abstract/Summary:PDF Full Text Request
In recent years,the speed of enterprise digitization has advanced rapidly.With wave after wave of technological advancement,data has become an indispensable and essential factor in the strategic development path of enterprises,empowering business growth.In the big data environment,data cleaning plays a critical role in the data processing process,determining the value and flow of data,and playing an essential role as the gatekeeper for data application.At the same time,as a key element of data quality control,it is also the basis for supporting stable business development.However,as the types of data sources gradually increase,data quality issues are increasingly exposed,and the problem of multi-source data fusion has become a pain point for downstream data users and one of the main directions of current data cleaning research.In order to meet the actual business needs of enterprises,this paper focuses on the multi-source fusion and cleaning methods of government supervision risk data and discusses the processing methods for large-scale data in actual projects.After multiple iterations and actual verification,the proposed data fusion method in this paper can significantly improve data quality.The data cleaning process summarized in this paper effectively reduces the development cost of enterprises.and help to solve the data cleaning problems faced by multi-source data,improve data quality and value.The main work of this paper is as follows:(1)Based on the basic theory of data cleaning,this paper analyzes in-depth the application and improvement of current popular data cleaning methods in actual projects.In response to the hard coding and lack of flexibility issues of traditional regular rule-based cleaning methods,this paper introduces a distributed configuration center combined with state caching for rapid editing and verification,which greatly improves development efficiency and accuracy.(2)Starting from the characteristics of enterprise risk data,this paper proposes a new multi-source data fusion method.This method uses the grouping of custom field priority to perform clustering,and then further obtains the field values to be merged.Combined with the method of dynamic configuration,it corrects and completes the fields,effectively reducing the problem of numerical null value rate.(3)Starting from the technical architecture of the cleaning process and combining with the advanced features of Flink,this paper delves into the solutions of real-time data cleaning and designs and implements a data production framework based on realtime data warehouse.Finally,this paper analyzes the problems to be solved and considers the future technical architecture of the cleaning process.The research results of this paper are expected to provide practical and innovative solutions for multi-source cleaning and fusion of enterprise risk data and lead the development direction of data cleaning process technology architecture.
Keywords/Search Tags:Data cleaning, multi-source fusion, Apache Flink, process design
PDF Full Text Request
Related items