Font Size: a A A

Parallel Research Of GSP Algorithm Based On Spark

Posted on:2018-10-21Degree:MasterType:Thesis
Country:ChinaCandidate:S Y WuFull Text:PDF
GTID:2428330569485422Subject:Computer technology
Abstract/Summary:PDF Full Text Request
In the era of information explosion,information companies need to formulate future palnning,faced with the challenge how to quickly and efficiently dig out useful information in the low cost way from a large amount of information,for this difficult problem data mining techology has been seeking solution.As one of the main research contents in the field of data mining,sequence pattern mining has a wide range of application requirements.The traditional serial sequence pattern mining algorithm has a good performance when dealing with a small amount of data.However,when the amount of data is large,the computing power of these algorithms is difficult to meet people's demand.The parallel sequence pattern mining algorithm based on Hadoop platform has the problem of load imbalance and excessive IO overhead.After the deep analyzing the process and the performance of classical sequence pattern mining GSP algorithm(Generalized Sequential Pattern mining algorithm),and understanding the characteristics of the Spark cloud computing platform,a parallel sequence pattern mining algorithm based on Spark-GSP-S algorithm(GSP Algorithm Based on Spark)is proposed.In the GSP-S algorithm,multiple MapReduce jobs are implemented to complete the mining task.In order to balance the load between nodes in Spark platform,the GSP-S algorithm divides the original sequence database into n database slices of the same size.To fully reduce IO overhead,the first MapReduce job loads the sequence database partitions from HDFS into the Spark RDDs and the further MapReduce jobs read the partitions from RDDs and store intermediate results into RDDs.Using real data sets and synthetic data sets,the GSP-S algorithm is compared with the sequence pattern mining algorithms based on Hadoop in terms of data scalability and cluster scalability.In the experiment environment,the results show,compared with the parallel sequence pattern mining algorithms based on Hadoop,in aspect of data scalability,the runtime of GSP-S algorithm is reduced by an average of 75%;in aspect of cluster scalability,the runtime of GSP-S algorithm is reduced by an average of 50%;at the same time,according to the proposed load balancing strategy,the runtime of GSP-S algorithm is reduced by about 12%.Therefore,GSP-S algorithm has better data scalability and cluster scalability,but also has better load balancing and fault tolerance.
Keywords/Search Tags:Cloud computing, Parallel algorithm, Sequence pattern mining, Spark, GSP algorithm
PDF Full Text Request
Related items