Font Size: a A A

Research And Implementation Of Parallelism Of FP-Growth Algorithm In Big Data Environment

Posted on:2024-01-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Y GuFull Text:PDF
GTID:2568307100489244Subject:Computer technology
Abstract/Summary:PDF Full Text Request
Traditional association rule mining algorithms provide good support for mining the association between data and providing people with predictive analysis and decision support.However,as human beings enter the era of big data,traditional serial data mining algorithms based on single-machine environments are unable to cope with massive data and real-time requirements.For this reason,in recent years,people have focused on the research of more efficient parallel data mining algorithms to meet the needs of the big data environment.By comparing two classic traditional association rule mining algorithms,Apriori and FP-Growth,the FP-Growth algorithm is more efficient,so people focus on the parallel improvement of the traditional FP-Growth algorithm superior.Through a large amount of literature reading and analysis research,this paper summarizes the current two mainstream technical routes for parallel improvement of the traditional FP-Growth algorithm.One is to improve FP-Growth,compress the size of the generated FP tree,and pass the method of parallel computing realizes the mining of frequent items;the second is to divide the transaction database and use a distributed environment for computing to improve the efficiency of mining.On the basis of in-depth analysis and research on typical algorithms in the above two technical routes,summarizing their respective advantages and disadvantages,this article conducts the following research:(1)An improved FP Growth algorithm based on projection database partitioning is proposed for the research of the two technical routes.Firstly,extract the transaction database based on each frequent item,and generate the corresponding projection database;Then,these projection databases are distributed to the node machines in the distributed operating environment.Each node machine utilizes the improved FP Growth algorithm to mine frequent itemsets with one frequent item,and finally summarizes and merges them to obtain all frequent itemsets.Due to the much smaller scale of the projection database compared to the original transaction database,the FP trees generated by these projection databases are also much smaller than the FP trees generated by the original transaction database.This to some extent solves the problem of not being able to store FP trees with large specifications due to insufficient resources such as single machine memory,which leads to algorithm failure.(2)Based on the characteristics of Map Reduce parallel programming,the algorithm was improved and an optimization algorithm based on load balancing was proposed.During the analysis of the experiment,it was further discovered that there was a load imbalance problem among each node machine in the mining process of FP trees generated by the projection database.In a distributed environment,the efficiency of parallel computing is closely related to load balancing.The improved algorithm based on projection database partitioning distributes the shadow database in order according to the frequent item header table.However,the scale of projection databases generated by different frequent 1 item is inconsistent,resulting in uneven distribution of computing resources,i.e.load,in the node machine.This article proposes two optimization algorithms for balancing load.One is to sort the sub databases in descending order,and distribute them in a circular pattern from large to small and then from small to large;The second is to select a load assessment model to calculate the load,fix the distribution resources of the sub node machines,and distribute the load as fairly as possible on this basis.This article uses the Map Reduce programming model of Hadoop distributed platform to implement the above algorithms,and demonstrates the feasibility and effectiveness of the algorithm through experiments.The above algorithms and similar algorithms mentioned in the article were tested using real datasets.The experimental results show that the improved algorithm based on projection database partitioning and load balancing optimization algorithm proposed in this paper further improve operational efficiency,providing a choice for the application of risk prediction and data decision-making.
Keywords/Search Tags:parallel computing, big data, FP-Growth, Projection, load balancing
PDF Full Text Request
Related items