Parallel Frequent Itemset Mining Optimization Algorithm Based On Spark

Posted on:2024-07-28

Degree:Master

Type:Thesis

Country:China

Candidate:B Wu

Full Text:PDF

GTID:2568307124474814

Subject:Computer software and theory

Abstract/Summary:

PDF Full Text Request

As one of the main means of data mining,frequent itemset mining is motivated to find the potential value in data,help enterprises and institutions to adjust the progress of process production,improve product quality,and make more effective strategies to meet customer needs.In recent years,with the wide application in many industries,the frequent itemset mining algorithm can show advantages when the data scale is small.When the amount of data reaches G level or even higher,these algorithms will become very inefficient due to the upper limit of storage and computing power,and are not suitable for mining massive data.Therefore,the idea of parallel computing is particularly important.By improving the frequent itemset mining algorithm and combining it with the distributed computing model,it has become the main direction of current research.At present,the proposed parallel frequent itemset mining algorithm based on Spark has solved the problem that traditional FP-Growth algorithm need to scan the database multiple times.Besides,its parallelization performance improves the performance of frequent itemsets mining.However,there are still the following questions of the algorithm: How to effectively improve the space-time efficiency in creating conditional FP-tree,how to significantly improve the efficiency of mining frequent items,and how to effectively improve the parallelization performance between nodes.In response to these problems,the main research work of this paper is as follows:Aiming at the problem of low space-time efficiency in creating conditional FP-tree,high communication overhead between nodes and redundant search in the FP growth algorithm based on spark,a parallel algorithm for mining frequent itemsets based on Spark was proposed,named FIMAFP-Spark.Firstly,a strategy of non-negative matrix factorization was proposed,which provided the query of support counts and decomposed the matrix of support counts,thereby solving the problem of low space-time efficiency in creating conditional FP-tree.Secondly,a grouping strategy based on genetic algorithm was proposed,which evenly grouped frequent 1 item sets to solve the problem of high communication overhead between nodes.Finally,an efficiently reduce tree structure strategy was proposed,which reduced the structure of FP-tree to solve the redundant search problem.The experimental results verify the feasibility of the FIMAFP-Spark algorithm,prove its performance advantage compared with other mining algorithms,and can effectively process frequent itemset mining of various data.Aiming at the problems of the support threshold cannot be dynamically set,high communication overhead between nodes and redundant search in the FP growth algorithm based on spark.A parallel algorithm for mining frequent itemsets based on Spark and genetic algorithm is proposed,named PAFMFI-Spark GA.Firstly,a strategy of optimization based on dynamic programming to obtain the optimal support threshold,thereby solving the problem of the support threshold cannot be dynamically set.Secondly,a grouping strategy based on load estimation function was proposed,which evenly groups frequent itemsets,reduces the FP-tree structure generated by each group,thereby solving the problem of high communication overhead between nodes.Finally,a pruning strategy of maximal hyperclique pattern was proposed to prune the FP-tree,which solves the problem of redundant search.The experimental results verify the feasibility of the PAFMFI-Spark GA algorithm,prove its performance advantage compared with other mining algorithms,and can effectively process frequent itemset mining of various data.

Keywords/Search Tags:

big data, Spark framework, parallel mining frequent itemsets, FP-Growth algorithm

PDF Full Text Request

Related items

1	Research And Implementation On Efficient Parallel Frequent Itemsets Mining Algorithm Based On Spark
2	Research On Parallel Frequent Itemsets Mining Algorithm
3	Research On Frequent Itemsets Mining Parallel Algorithm
4	Research On Multi-stream Frequent Item Set Mining Algorithm
5	Research On Algorithm Of Mining Frequent Itemsets From Uncertaln Data
6	Research On Parallelization Of Frequent Itemsets Mining Algorithm
7	Research On Frequent Itemsets Mining Algorithm Based On FP-Tree
8	Research On Algorithm For Mining Frequent Itemsets Of Uncertain Data
9	The Research And Implementation Of Mining Frequent Itemsets Algorithm Over Streaming Data
10	Research On Key Algorithms For Mining Frequent Patterns In Data Streams And Their Application In Simulation System