| With the rapid development of the Internet of Things,mobile Internet and smart phones,data production has grown exponentially,and big data technology has emerged.Due to various factors,the collected data will inevitably have quality problems.Using these “dirty data” for data mining may lead to incorrect knowledge mining and erroneous data analysis,which will bring misleading and loss to researchers and even enterprises.In order to improve data quality,data preprocessing is required for data sets.In particular,similar duplicate data detection and outlier detection are particularly important.Conventional similar duplicate detection algorithms have problems such as inefficient processing of large-scale data,improper selection of field matching algorithms,and detection accuracy depending on feature selection.The detection accuracy of the outlier detection algorithm depends to a large extent on the problem of parameter selection and feature selection.In order to solve the above problems,based on the in-depth study of data preprocessing technology,combined with big data technology,this thesis designs and implements the social engineering data preprocessing system.The research work in this thesis includes the following:(1)Study and analyze the advantages and disadvantages of existing similar duplicate data detection algorithms,and propose a similar repeated data detection algorithm named P-SNM based on partition.Introduce the idea of dividing and treating big data,divide the big data set,assign weights to each attribute by using the hierarchical comprehensive evaluation method,and select keywords according to the weight.A static index pruning technique is introduced to prun the large number of candidate sets generated by the Q-Gram inverted index,and the editing distance is used to calculate the similarity scores of all weighted attributes to implement field matching.Experiments show that the algorithm improves the efficiency of the algorithm while improving the efficiency of the algorithm.(2)Research and summarize the advantages and disadvantages of the existing outlier detection algorithm,and propose a natural neighbor-based outlier data detection algorithm called N-LOF.The introduction of the natural neighbor algorithm enables the N-LOF algorithm to adaptively train appropriate parameters according to different data sets.The PCA algorithm is used to extract the appropriate model features,so that the algorithm can effectively process high-dimensional data sets.The effectiveness of the algorithm is verified by comparison experiments,and the running time of the algorithm is improved.(3)Based on the partition-based similar duplicate data detection algorithm and the natural neighbor-based outlier detection algorithm,this thesis designs and implements the social engineering data preprocessing system.The system uses Hadoop platform and MapReduce programming framework to realize the functions of extracting,pre-processing,storing and querying social workers data,and visualizing the preprocessing results. |