Font Size: a A A

Research And Implementation Of Similar Duplicate Record Detection Optimization Algorithm Based On DBSCAN

Posted on:2024-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:Q XiongFull Text:PDF
GTID:2568307130453244Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Similar duplicate record detection is a crucial component of big data preprocessing.In fact,similar duplicate record detection can be not only used for data cleaning,but also extended to data classification processing.Therefore,the mainstream density-based DBSCAN algorithm is selected as the benchmark for detection,and the algorithm optimization design is carried out in terms of initial point selection,parameter setting and Chinese adaptability,And through experimental verification and prototype system application testing,the effectiveness of the optimization algorithm is verified.The specific research work and research results of the thesis are as follows.(1)The initial point selection based on geodesic distance and the parameter adaptive selection optimization method based on kernel density estimation are designed.The fixed setting of initial points and parameters is a factor that cannot be ignored in affecting the efficiency of DBSCAN algorithm and clustering results.Therefore,based on the significant features with fine granularity of similar repeated record distance,replaces Euclidean distance with geodesic distance that can reflect the true distance between samples,and uses shared nearest neighbor similarity considering spatial distribution features as the calculation basis.The point with the highest local density is selected as the initial point to ensure the quality of the initial point.At the same time,based on the characteristics of uneven density of most data sets,the paper designs a parameter adaptive selection method to conduct kernel density estimation with the data density distribution characteristics,so as to realize that the neighborhood radius Eps and density threshold Min Pts candidate parameter settings change with the data set density,so as to improve the parameter adaptability.The thesis selects the adjusted RAND index ARI The harmonic average V-measure and adjusted mutual information AMI,which can measure the homogeneity and integrity of data distribution,are used as evaluation indicators,and are compared with other improved DBSACN clustering algorithms on four typical datasets.The results show that the improved algorithm proposed can effectively improve the uneven distribution of clustering results.(2)Research on Chinese adaptability optimization method based on N-Gram model based on Chinese characteristics.Chinese adaptability is an innate problem of DBSCAN algorithm,which is mainly manifested in the processing of Chinese data,Chinese abbreviations,close words,and virtual real words often bring detection interference,and the most widely used language model N-Gram also has such a problem.The Chinese word breaker ICTCLAS for data filtering and word segmentation is used,and combines the hierarchical weight conversion method reflecting important fields to construct a repeating matrix that conforms to the characteristics of Chinese.Then,the pair-wise similarity comparison method that can effectively identify spelling errors and abbreviations in the records is used to further improve the applicability and detection accuracy of the N-Gram algorithm for similar duplicate records of Chinese data,and recluster the similar duplicate records of the DBSCAN algorithm.The recall rate and accuracy rate were selected as the evaluation index,and the similar duplicate record detection experiment was carried out on the public dataset,and the results showed that the applicability and detection accuracy of similar duplicate records for Chinese data could be effectively improved.(3)Design and implementation of a user classification system based on the Spark platform.We conducted practical application experiments on the optimization method of the design.The Spark platform to design a user classification prototype system is used.The system requirements analysis,framework design,and main functional module implementation are briefly introduced,and verifies the effectiveness of the prototype system design and optimization method through the system operation effect.
Keywords/Search Tags:DBSCAN clustering algorithm, initial point selection, parameter adaptation, N-Gram model, secondary clustering
PDF Full Text Request
Related items