Font Size: a A A

Research On Adaptive Anonymity Methods For Set-Valued Data Publication

Posted on:2016-01-23Degree:MasterType:Thesis
Country:ChinaCandidate:R Y XinFull Text:PDF
GTID:2308330464454674Subject:Software engineering
Abstract/Summary:PDF Full Text Request
The rapid development of network information technology and the rise of e-commerce produce vast amounts of data resources in various applications. Set-valued data, as a primary data type in the information resources, is a collection of a series of elements and is associated with a single individual, such as web log information, online data records, shopping mall data, etc. Due to the data analysis requirements, different agencies, departments, or enterprises will share the data, which will be provided to scientific researchers for mining potential and valuable knowledge from the vast amounts of data. For example, data analysts can easily discover some association rules, detect some customer’s behavior patterns, and so on, then provide better services to customers, by analyzing customer’s consumption record data. However, these data often contains individual privacy information, direct publication will leak the user’s privacy, and bring great trouble to people’s life. So before the data publication, anonymous processing is very necessary. The anonymity of set-valued data is more difficult than relational data, because the characteristics of set-valued data are sparse, high dimensions and huge amounts. Although there are many papers for the anonymity of set-valued data, most of the ideas are reference to the traditional relational data publication strategy. The characteristics of set-valued data make that the direct use of the existing anonymous technology, such as k-anonymity and differential privacy, cause huge loss of information.In order to protect the privacy of individuals, and in the meantime improve the utility of the data in this paper, we compares the existing anonymous technologies for relational data and set-valued data, and find the deficiency of existing anonymous technologies for set-valued data. There are still deficiencies for the set-valued data anonymous publication. Aiming at solving these deficiencies, this paper proposes an adaptive privacy protection model based on the sensitivity of the data’s own distribution, and considers the partial suppression for anonymous processing.Firstly, this paper analyzes the difference between set-valued data and relational data, and introduces the common attack models used by attackers, including the background, linking, homogeneity and skewness attacks. For these attacks, this paper introduces grouping technologies series of k-anonymity models, differential privacy model by adding noise to statistical data and p - uncertainty model based on association rules, etc. k - anonymity model series in the process of data publication has a strict limit on the size of the group, i.e. group size should not be less than k, which lacks the flexibility of the method; differential privacy model ignores the attacker’s background knowledge, which emphasize the individual impact on the whole dataset is not valuable, and the degree of data privacy can be completely controlled by publishers, so it is a relatively strong privacy model for statistical data perturbation; p -uncertainty model, compared with k - anonymity, is more flexible, and there is no limit to the group size. These privacy models, however, did not consider the distribution of the data itself, the fixed threshold limitation according to the preset conditions of privacy restrictions will cause the excessive protection of privacy and enlarge the information loss.Secondly, in view of the insufficiency of set-valued data publication, this article considers the probability distribution of sensitive values and partial suppression strategy, and according to the distribution of sensitive values adaptively generates the value corresponding to the sensitivity of itself, then introduces a tuning factor for fine-tuning the sensitivity. To obtain the sensitivity of the various sensitive values, this paper puts forward a new model named SAPP-anonymity based on adaptive sensitivity. Then, it adopts the method of partial suppression in accordance with the sensitive values of the specified selection strategy to suppress for achieving the anonymous processing of set-valued data until meet the privacy constraints. Anonymous methods adopted in this paper are the frequent itemsets mining algorithms, which help find out all the itemsets patterns causing privacy violation. According to the length of frequent itemsets ascending order, this paper adopts greedy algorithms to delete the frequent itemsets from the itemsets patterns, and then removes the sensitive value from the transactions that contain the specified itemsets patterns, and realizes partial suppression, the algorithms end when the violation set is empty.Finally, in order to improve the efficiency and scalability, this article conducts the experiments on the real datasets by using the Hadoop distributed system for implementing the code of SAPP model, and the heuristic Apriori algorithm for mining frequent patterns narrows the searching space of the data, which improves the performance of data processing. The results show that the SAPP model at the same time of meeting the demand of privacy protection, better improve the utility of the data.
Keywords/Search Tags:set-valued data, data publication, privacy preserving, sensitivity-adaptive
PDF Full Text Request
Related items