| Big data in industrial is supported the future intelligence information.A large number of valuable data and information is essential for the development of enterprises.However,data quality issues don't get enough attention it deserves,there are data incomplete,duplicated data,data invalid and other quality problems,which make the work based on the analysis of data get error analysis results.Therefore,it must pay attention to the quality problem of the data.This paper mainly study the duplicated data in the industrial.Because the traditional cleaning algorithm often has poor performance when dealing with large data,this paper introduces the new method.For data is large,multi-dimensionality and has diverse data types,so this paper put forward the new algorithm to deal the detecting approximately duplicate records,which based on multiple edit distance and attributes weight.It can be used to distinguish the differences between Chinese and western languages and to maximize the field characteristics of data,so as to improve the detection accuracy.It is not necessary to make matching for the whole data set in industrial.Therefore,the concept of length filtering and dynamic expansion window is proposed.Using length filtering method will delete part of records that would not be constitute duplicate records;the similarity between the records in the window is compared by setting the dynamic scaling window,and the whole window is dynamically adjusted in the process to reduce unnecessary records matching.The purpose of this paper is to deepen the research on the recognition of duplicate records and to play a reference role in the application of duplicate records detection in industrial data.The experimental results show that,the improved algorithm has high detection accuracy and high efficiency in identifying duplicate records.It also verifies the value of duplicate record recognition algorithm in industrial big data. |