Font Size: a A A

A Clustering Validity Index Based On Noise Suppr Ession And Its Application

Posted on:2021-12-09Degree:MasterType:Thesis
Country:ChinaCandidate:C CaiFull Text:PDF
GTID:2558307109976029Subject:Computer technology
Abstract/Summary:PDF Full Text Request
As an effective tool for big data analysis,data mining can satisfy people’s need to explore the deep information behind the data.Clustering is a popular research direction in the field of data mining.As a classic clustering algorithm,K-means algorithm is widely used in the process of clustering analysis because of its simple,efficient and easy to implement features.The cluster validity index is an important method to evaluate the clustering results,It has an important influence on the determination of the optimal number of clusters koct and clustering accuracy.The existing clustering validity index does not consider the influence of noise data,and it is of great practical significance to study the clustering validity index under the environment of noise data.The K-means algorithm based on differential privacy is to add random noise to the true value of the data to realiza the mining of the owerall law of the data while protecting individual privacy.Aiming at the problem of using only the structural information of dataset and ignoring the influence of noise data in the traditional clustering validity index,this paper proposes a new clustering validity index similurity fuctor,and then uses this index for the differential privacy Kmeans algorithm Iterative convergence criteria and data partitioning methods are improved,and a differential privacy K-means algorithm based on similarity factors is proposed.The main research contents of this article are as follows.1)By studying the influence of the two factors of data cohesion and separation on the cluster density and separation in a noise environment,a new clustering effectiveness index is proposed:similarity factor,and the noise distance suppression function is introduced to reduce noise The impact of data points on the evaluation results of validity indicators.Based on the new clustering validity index similarity factor,the clustering effect of the traditional K-means algorithm is evaluated,and a method for determining the optimal number of clusters kopt is given.2)Aiming at the iterative convergence criterion and data division method in the differential privacy K-means algorithm,the similarity factor clustering validity index is applied to enhance and improve the availability of clustering results of the differential privacy K-means algorithm.Since the differential privacy K-means algorithm still uses the criterion function SSE of the traditional K-means algorithm,and then uses the local search method to minimize the criterion function,the availability of clustering results is reduced to a certain extent.Based on the clustering validity index proposed in this paper,a new criterion function SCF is constructed and the data points are divided into clusters in the clustering process,so that under the same privacy budget,the improved differential privacy K-means algorithm in this paper has better availability of clustering results.
Keywords/Search Tags:data mining, clustering, K-means, differential privacy, clustering validity index, criterion function
PDF Full Text Request
Related items