Font Size: a A A

Research On Data Clustering Methods With Differential Privacy

Posted on:2021-03-12Degree:MasterType:Thesis
Country:ChinaCandidate:Z F LvFull Text:PDF
GTID:2518306305953559Subject:Master of Applied Statistics
Abstract/Summary:PDF Full Text Request
With the advent of the era of big data,data mining has been applied to many fields.However,the data used in data mining usually contains some personal information of users.Malicious analysts may use data mining technology to obtain some private information,which will cause personal privacy leakage and cause adverse effects on society and individuals.As a strictly defined privacy preserving technology,differential privacy has received widespread attention.As an important research direction in data mining,cluster analysis also has the problem of privacy leakage in its analysis process.How to improve the utility of clustering results and achieve the tradeoff between privacy and utility while satisfying privacy preserving is a research topic with application value.In this paper,from the perspective of achieving differentially private clustering algorithm and improving its utility and accuracy of results,by analyzing the privacy leakage problems existing in the clustering algorithm,aiming at the problems of privacy budget allocation,selection of initial centroids,and inability to determine the number of clusters in clustering algorithms that satisfy privacy preserving,three differentially private data clustering algorithms are proposed.(1)To improve the efficiency and utility of clustering algorithms that satisfy privacy preserving,an efficient data clustering algorithm EDCDP based on differential privacy is proposed on the MapReduce distributed framework.An initial centroid selection method based on the canopy algorithm that can be deployed in the MapReduce framework is designed.And the availability of clustering results is improved by optimizing the allocation of privacy budget.(2)In order to solve the privacy preserving issue of the mixed dataset clustering algorithm,based on K-means algorithm and K-modes algorithm,a clustering algorithm ODPC for mixed data that meets differential privacy protection is proposed.By analyzing the loss caused by introducing differential privacy,the privacy budget allocation scheme is improved,and thus increases the availability of clustering results.(3)In order to solve the privacy preserving issue of dataset with uncertain number of clusters in clustering algorithm.a clustering algorithm IDPC that meets differential privacy protection is proposed based on the nonparametric Bayesian method.IDPC does not need to determine the number of clusters in advance,and the number of clusters will change adaptively with the dataset.In addition,a reasonable mechanism is designed to ensure that the algorithm meets differential privacy protection.Aiming at the above algorithms,we provide detailed security analysis and performance analysis to prove that the proposed algorithms can improve the availability of algorithm results and achieving the banlance between privacy and utility while satisfying privacy preserving.
Keywords/Search Tags:Clustering, Differential Privacy, K-means Algorithm, K-modes Algorithm, Nonparametric Bayesian Method
PDF Full Text Request
Related items