Research And Implementation Of Parallelism Of FP-Growth Algorithm In Big Data Environment

Posted on:2024-01-21

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Gu

Full Text:PDF

GTID:2568307100489244

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

Traditional association rule mining algorithms provide good support for mining the association between data and providing people with predictive analysis and decision support.However,as human beings enter the era of big data,traditional serial data mining algorithms based on single-machine environments are unable to cope with massive data and real-time requirements.For this reason,in recent years,people have focused on the research of more efficient parallel data mining algorithms to meet the needs of the big data environment.By comparing two classic traditional association rule mining algorithms,Apriori and FP-Growth,the FP-Growth algorithm is more efficient,so people focus on the parallel improvement of the traditional FP-Growth algorithm superior.Through a large amount of literature reading and analysis research,this paper summarizes the current two mainstream technical routes for parallel improvement of the traditional FP-Growth algorithm.One is to improve FP-Growth,compress the size of the generated FP tree,and pass the method of parallel computing realizes the mining of frequent items;the second is to divide the transaction database and use a distributed environment for computing to improve the efficiency of mining.On the basis of in-depth analysis and research on typical algorithms in the above two technical routes,summarizing their respective advantages and disadvantages,this article conducts the following research:(1)An improved FP Growth algorithm based on projection database partitioning is proposed for the research of the two technical routes.Firstly,extract the transaction database based on each frequent item,and generate the corresponding projection database;Then,these projection databases are distributed to the node machines in the distributed operating environment.Each node machine utilizes the improved FP Growth algorithm to mine frequent itemsets with one frequent item,and finally summarizes and merges them to obtain all frequent itemsets.Due to the much smaller scale of the projection database compared to the original transaction database,the FP trees generated by these projection databases are also much smaller than the FP trees generated by the original transaction database.This to some extent solves the problem of not being able to store FP trees with large specifications due to insufficient resources such as single machine memory,which leads to algorithm failure.(2)Based on the characteristics of Map Reduce parallel programming,the algorithm was improved and an optimization algorithm based on load balancing was proposed.During the analysis of the experiment,it was further discovered that there was a load imbalance problem among each node machine in the mining process of FP trees generated by the projection database.In a distributed environment,the efficiency of parallel computing is closely related to load balancing.The improved algorithm based on projection database partitioning distributes the shadow database in order according to the frequent item header table.However,the scale of projection databases generated by different frequent 1 item is inconsistent,resulting in uneven distribution of computing resources,i.e.load,in the node machine.This article proposes two optimization algorithms for balancing load.One is to sort the sub databases in descending order,and distribute them in a circular pattern from large to small and then from small to large;The second is to select a load assessment model to calculate the load,fix the distribution resources of the sub node machines,and distribute the load as fairly as possible on this basis.This article uses the Map Reduce programming model of Hadoop distributed platform to implement the above algorithms,and demonstrates the feasibility and effectiveness of the algorithm through experiments.The above algorithms and similar algorithms mentioned in the article were tested using real datasets.The experimental results show that the improved algorithm based on projection database partitioning and load balancing optimization algorithm proposed in this paper further improve operational efficiency,providing a choice for the application of risk prediction and data decision-making.

Keywords/Search Tags:

parallel computing, big data, FP-Growth, Projection, load balancing

PDF Full Text Request

Related items

1	Load Balancing Problems For Parallel And Distributed Computing
2	Load Balancing In Parallel Computing Based On LAN
3	Mpich-based Parallel Computing System, Load Balancing Technology
4	Research And Implementation Of Dynamic Load-balancing Method Under Parallel Computing
5	Design And Implementation Of A Load Balancing System Based On PVM
6	Research Of Dynamic Load Balancing Model On The Parallel Computing Platform For Simulation Of Ion Transmission
7	Research Of Load Balancing Algorithm Based On Parallel Computing
8	Research And Design On An Efficient Load Balancing Algorithm
9	The Design And Implementation Of Parallel Computing Platform Based On MapReduce
10	Parallel Database System Load Balancing Technology