Font Size: a A A

Parallel Frequent Itemset Mining Optimization Algorithm Based On Spark

Posted on:2024-07-28Degree:MasterType:Thesis
Country:ChinaCandidate:B WuFull Text:PDF
GTID:2568307124474814Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
As one of the main means of data mining,frequent itemset mining is motivated to find the potential value in data,help enterprises and institutions to adjust the progress of process production,improve product quality,and make more effective strategies to meet customer needs.In recent years,with the wide application in many industries,the frequent itemset mining algorithm can show advantages when the data scale is small.When the amount of data reaches G level or even higher,these algorithms will become very inefficient due to the upper limit of storage and computing power,and are not suitable for mining massive data.Therefore,the idea of parallel computing is particularly important.By improving the frequent itemset mining algorithm and combining it with the distributed computing model,it has become the main direction of current research.At present,the proposed parallel frequent itemset mining algorithm based on Spark has solved the problem that traditional FP-Growth algorithm need to scan the database multiple times.Besides,its parallelization performance improves the performance of frequent itemsets mining.However,there are still the following questions of the algorithm: How to effectively improve the space-time efficiency in creating conditional FP-tree,how to significantly improve the efficiency of mining frequent items,and how to effectively improve the parallelization performance between nodes.In response to these problems,the main research work of this paper is as follows:Aiming at the problem of low space-time efficiency in creating conditional FP-tree,high communication overhead between nodes and redundant search in the FP growth algorithm based on spark,a parallel algorithm for mining frequent itemsets based on Spark was proposed,named FIMAFP-Spark.Firstly,a strategy of non-negative matrix factorization was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.Secondly,a grouping strategy based on genetic algorithm was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.Finally,an efficiently reduce tree structure strategy was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The experimental results verify the feasibility of the FIMAFP-Spark algorithm,prove its performance advantage compared with other mining algorithms,and can effectively process frequent itemset mining of various data.Aiming at the problems of the support threshold cannot be dynamically set,high communication overhead between nodes and redundant search in the FP growth algorithm based on spark.A parallel algorithm for mining frequent itemsets based on Spark and genetic algorithm is proposed,named PAFMFI-Spark GA.Firstly,a strategy of optimization based on dynamic programming to obtain the optimal support threshold,thereby solving the problem of the support threshold cannot be dynamically set.Secondly,a grouping strategy based on load estimation function was proposed,which evenly groups frequent itemsets,reduces the FP-tree structure generated by each group,thereby solving the problem of high communication overhead between nodes.Finally,a pruning strategy of maximal hyperclique pattern was proposed to prune the FP-tree,which solves the problem of redundant search.The experimental results verify the feasibility of the PAFMFI-Spark GA algorithm,prove its performance advantage compared with other mining algorithms,and can effectively process frequent itemset mining of various data.
Keywords/Search Tags:big data, Spark framework, parallel mining frequent itemsets, FP-Growth algorithm
PDF Full Text Request
Related items