Font Size: a A A

Research On MapReduce-based Parallel Architecture And Privacy Protection In Association Rule Mining

Posted on:2017-03-13Degree:MasterType:Thesis
Country:ChinaCandidate:F R XiongFull Text:PDF
GTID:2308330485957852Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
With the rapid development of information and network technology, global information emerge explosive growth. How to quickly dig out the useful information is a highly-demanded social problem to be solved. Association rules mining is a very important research topic in data mining, and is widely used in various fields. How to use association rules mining corretly is the most important task of data mining, as well as the main research direction of this paper. When mining massive data, traditional association rules algorithm mostly appears the phenomenon of insufficient memory. Parallel technology can efficiently process massive data. Therefore, the research of parallel association rules algorithm is of great practical significance. At the same time, with the continuous improvement of mining technology, the user’s privacy may be leaked, data privacy protection is also necessary. In this paper, we propose a parallel MRRCHA algorithm based on privacy protection and a parallel MRFP algorithm based on MapReduce for massive data mining and privacy protection. The main research work of this paper is as follows:(1) The traditional Apriori algorithm generates large number of candidate itemsets, and requires high computer memory. In this paper, a parallel PRRCHA algorithm based on privacy protection is proposed. Aiming at the shortcoming of Apriori algorithm and the problem of privacy leakage, firstly, the optimization algorithm CHA based on Apriori is presented, which reduces the number of candidate itemsets generation, simplifies the generation of maximal frequent itemsets, and obtains all frequent itemsets accurately. Then, analysis of the frequent pattern mining process of CHA algorithm by using the MapReduce programming model, which can realize the independence of data packet and ensure the completeness of the algorithm and realize the parallel algorithm of each step. Finally, experiments shows the proposed parallel PCHA algorithm based on MapReduce not only has efficiently processing capability for large data size, but also can effectively solve the problem of insufficient memory for mining massive data.(2) The traditional Fp-growth algorithm needs to traversal a large number of shared prefix when generating Fp-tree. Aiming at this shortcoming, first of all, this paper gives an optimized RFP algorithm based on Fp-growth, by reordering the entire data set, reducing the time to traverse the shared prefix, and improving the efficiency of constructing Fp-tree. Then, the MapReduce programming model and RFP algorithm are combined, and a parallel PRFP algorithm based on MapReduce is proposed, which can realize the independence of data packet and ensure the completeness of the algorithm. Finally, experiments shows the proposed parallel PRFP algorithm based on MapReduce not only has efficient processing capability for large data size, but also can effectively solve the problem of insufficient memory for mining massive data.(3) Because of the increasing ability to trace and collect large amount of personal information, privacy preserving has become an important issue in the development progress of data mining techniques. Many methods have been brought out to solve this problem. However, this technique can not handle massive data efficiently. In this paper, an algorithm on privacy preserving data mining named PRRCHA is introduced. PRRCHA not only can help preserve privacy but it can also be used to deal with mass data efficiently. PRRCHA proposed reduces the time complexity according to the experimental results.
Keywords/Search Tags:MapReduce, Apriori, Fp-growth, Privacy-Preserving
PDF Full Text Request
Related items