Font Size: a A A

Design And Implementation Of Data Cleaning Framework For Security Industry

Posted on:2021-04-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2416330647964127Subject:Computer technology
Abstract/Summary:PDF Full Text Request
At present,the demand for security and alarm equipment in various industries is increasing.The structured data of security industry shows explosive growth.But there's also a lot of dirty data,especially approximately duplicate record.It brings great trouble to the application of data.At present,the existing data cleaning algorithms are mainly customized for various industries,and its effect is poor in the structured data of security industry.Therefore,this paper develops a data cleaning framework for security industry.According to the data characteristics of security industry,the approximately duplicate record cleaning algorithm is improved,and applied to the security industry data cleaning framework designed in this paper.The main contributions and spark points of this paper are:(1)Reviewing the data cleaning algorithms in recent years,especially the approximately duplicate record cleaning algorithms.The advantages and disadvantages of the current missing value cleaning algorithm and the error value cleaning algorithm are compared respectively,and the improved methods of approximately duplicate record detection algorithm and merge algorithm are analyzed.Finally,the existing data cleaning framework is introduced and the reasons why it is not suitable for security industry are analyzed.(2)Aiming at the detection method of approximately duplicate records,a convolutional neural network is introduced.Through the improvement of LeNet-5 model,two improved models are proposed.One piece is a convolutional neural network model which using word embedding matrix as input,and the other using similarity matrix as input.Through experimental verification,the accuracy,recalling rate and F1 value of the model which regarding word vector matrix as input are all above 0.96.The accuracy rate,recall rate and F1 value of the model with similarity matrix as input are all around 0.98.Finally,the k-fold cross validation is carried out for the two models.It is concluded that both models have strong generalization ability.(3)Aiming at the approximately duplicate record merging algorithm,this paper improves the Multi-Pass Sorted Neighborhood algorithm from four aspects.Respectively,keywords were extracted and sorted by word segmentation to make the positions of approximately duplicate record closer.Expanding the window where approximately duplicate records are clustered in the same class,makes the connected graph more perfect.The record pairs that are detected as approximately duplicate records are tested again,so that the efficiency and recall rate can be improved.A record in a new connected graph of all the Maximum Clique in the connected graph is combined as approximately duplicate records,and the records with low probability of approximately duplicate records are excluded.(4)Designing and developing the data cleaning framework for safety industry.The model and algorithm proposed in this paper are embedded into the framework.It supplies sustainment and reference for the evolution of data cleaning tools in security industry.
Keywords/Search Tags:security industry, data cleaning, approximately duplicate record, LeNet-5, sorted neighborhood
PDF Full Text Request
Related items