Parallel Research Of GSP Algorithm Based On Spark

Posted on:2018-10-21

Degree:Master

Type:Thesis

Country:China

Candidate:S Y Wu

Full Text:PDF

GTID:2428330569485422

Subject:Computer technology

Abstract/Summary:

PDF Full Text Request

In the era of information explosion,information companies need to formulate future palnning,faced with the challenge how to quickly and efficiently dig out useful information in the low cost way from a large amount of information,for this difficult problem data mining techology has been seeking solution.As one of the main research contents in the field of data mining,sequence pattern mining has a wide range of application requirements.The traditional serial sequence pattern mining algorithm has a good performance when dealing with a small amount of data.However,when the amount of data is large,the computing power of these algorithms is difficult to meet people's demand.The parallel sequence pattern mining algorithm based on Hadoop platform has the problem of load imbalance and excessive IO overhead.After the deep analyzing the process and the performance of classical sequence pattern mining GSP algorithm(Generalized Sequential Pattern mining algorithm),and understanding the characteristics of the Spark cloud computing platform,a parallel sequence pattern mining algorithm based on Spark-GSP-S algorithm(GSP Algorithm Based on Spark)is proposed.In the GSP-S algorithm,multiple MapReduce jobs are implemented to complete the mining task.In order to balance the load between nodes in Spark platform,the GSP-S algorithm divides the original sequence database into n database slices of the same size.To fully reduce IO overhead,the first MapReduce job loads the sequence database partitions from HDFS into the Spark RDDs and the further MapReduce jobs read the partitions from RDDs and store intermediate results into RDDs.Using real data sets and synthetic data sets,the GSP-S algorithm is compared with the sequence pattern mining algorithms based on Hadoop in terms of data scalability and cluster scalability.In the experiment environment,the results show,compared with the parallel sequence pattern mining algorithms based on Hadoop,in aspect of data scalability,the runtime of GSP-S algorithm is reduced by an average of 75%;in aspect of cluster scalability,the runtime of GSP-S algorithm is reduced by an average of 50%;at the same time,according to the proposed load balancing strategy,the runtime of GSP-S algorithm is reduced by about 12%.Therefore,GSP-S algorithm has better data scalability and cluster scalability,but also has better load balancing and fault tolerance.

Keywords/Search Tags:

Cloud computing, Parallel algorithm, Sequence pattern mining, Spark, GSP algorithm

PDF Full Text Request

Related items

1	Research On Parallelization Of Data Mining Algorithm Based On Distributed Platforms Spark And YARN
2	Pattern Mining Algorithm On Cloud Computing Platform
3	Research On Spark-based Parallel Contrast Pattern Mining Algorithm And Load Balancing
4	Parallel Data Mining Algorithm Research In Cloud
5	Research On Parallel Mining Algorithm Of Association Pattern Based On Spark
6	Parallel Research And Application Of Machine Learning Algorithm Based On Cloud Platform
7	Research And Design Of Data Mining System For Tcm Disease Based On Cloud Computing Environment
8	Research On Frequent Itemset Mining Algorithm And Its Parallelization Based On Spark
9	Research And Implementation Of Parallel Data Mining Algorithms Based On Cloud Computing
10	High Frequency And Low Utility Pattern Mining Algorithm And Its Implementation On Cloud Computing