Font Size: a A A

Research On Preservation Of Proximity Privacy And Ensemble Learning K-anonymity Algorithms

Posted on:2014-08-11Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y J LiFull Text:PDF
GTID:1228330434971218Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Data mining and data publication are of two importance problems in database applications.Application of knowledge discovery and data mining play a key role in our society. Data mining is to reveal hidden valuable knowledge, models, or rules etc, while data publication is to publish the data to users directly. Whereas, with careless protection methods, person-specific data may encounter privacy leakage, thus harming the data of users. For example, to find association rules from medical cases in order to further predicate disease and control it. In hospitals, the databases maintaining medical cases may include disease information of special individuals. Therefore, people seriously focus on privacy in many data applications and publication.For preventing pricacy leakage, Anonymization is a good and effective approach. The method is to change the row data (generalizing or suppressing etc.) and not to distinguish them from any individuals even with external information. Anonymization for the data including personal privacy information has been concerned and more and more researchers are interested in the area. In general, privacy preservation includes the following two concerns:(1) how to protect privacy in publication;(2) how to improve data utility. Both academic and industrial fields strive to look for a balance between the two aspects.This paper focuses on enhancing the data utility when the method can provide a certain strong privacy preservation, involving anonymization algorithm and technique:(1) K-anonymity is one of the most important anonymity models, which can be achieved by many techniques. In many applications, generalization is a common and simple method. The generalization-based approach follows a common framework to achieve K-anonymity of a table:divide the table into many QI (quasi-identifier)-groups so that the size of each QI-group is at least K. we address the problem of proximity privacy in publishing categorical sensitive data. However, when using traditional approach to anonymize the data and generalize the QI groups, some sensitive attribute values with semantical proximity may exist in the same QI-group, and thus lead to privacy leakage. To solve this issue, the concept of m-Color constraint is introduced and a method based on the m-Color constraint is proposed to prevent this kind of privacy leakage. The properties of m-Color constraint and related generalization algorithm are given, which reduce the loss of information greatly. The experiment results are provided to explain practicality and efficiency of the algorithm proposed in this thesis. (2) The existing solutions to privacy preserving publication can be classified into the theoretical and heuristic categories. The former ensures provably low information loss, whereas the latter incurs gigantic loss in the worst case, but is shown empirically to perform well on many real inputs. At present, numerous heuristic algorithms have been developed to satisfy advanced privacy principles such as l-diversity,t-closeness, etc., the theoretical category is only limited to k-anonymity which is the earliest principle known to have severe vulnerability to privacy attacks. Motivated by this, we present the first theoretical study on (ε, m)-anonymity, which a new anonymization principle for preservation of proximity privacy in publishing numerical sensitive data that is widely adopted in the literature. It is shown to be NP-Hard to (ε, m)-anonymize a table minimizing the number of suppressed cells and the corresponding algorithm is given in this section.(3) Based on the existing anonymization technique, a new effective technique called Ensemble Algorithm for Privacy Preservation is presented, which combines Generalization and Ensemble Learning. It can optimize the data utility and is able to retain more information in the microdata. Moreover, Ensemble Algorithm for Privacy Preservation can resist the presence attack and enjoy a wider range of applications. We present the details of the technique and build its underlying theory.
Keywords/Search Tags:privacy preservation, k-anonymity, m-color, NP-hard, ensemble learning
PDF Full Text Request
Related items