| The frequent pattern of browsing behavior describes the user browsing pattern and preferences,which the set frequent pattern reflects the relevance between visiting pages,the sequence frequent pattern describes the frequent visiting paths of users and the regular frequent pattern describes the semantic information of user access behavior.The frequent pattern of browsing behavior can be used for browsing behavior prediction,website structure optimization and browsing page recommendation,which improves user experience and increases the stickiness of the system.This thesis mainly studies the horizontal scalability of frequent pattern mining algorithms to solve the analysis task massive log data,focusing on solving the problem of load balancing pattern growth algorithm as well as frequent verification of candidate sequences of AprioriAll algorithm in distributed environment.The main content is including:1.Load balancing mining of set frequent pattern: Study the distributed design of FP-Growth algorithm.Firstly establish the relationship between conditional pattern tree and mining load,and then use the relationship to design distribution strategy to solve the problem of single-point storage bottleneck in the process of distributed mining algorithms.Finally,a approximate load balance FP-Growth distributed algorithm based on Spark to achieve load balance set frequent pattern mining.2.Sequential frequent pattern distributed mining: Study the distributed design of sequence frequent pattern mining AprioriAll algorithm.Firstly,use persistence operator to cache the intermediate which can be reused to reduce disk I/O consumption.Secondly,the method of generating frequent 2 sequences of AprioriAll algorithm is improved.Use PairWise method to replace the method of self-joining among frequent 1 sequeces to generate frequent 2 sequences,which solves the problem of high time and space expense caused by large-scale frequent 1 sequences generating frequent 2 sequences.3.Regular frequent pattern distributed mining: Firstly,classify web pages by the parent-child hierarchy semantic system and then convert the browsing web page sequence to web page type sequence to define the regular frequent pattern.Regular frequent pattern describes the semantic information of the user access behavior which can be implemented by the scalable AprioriAll algorithm based on Spark.4.System prototype design and algorithm performance testing: Firstly the prototype design of Spark-based frequent browsing pattern mining system is completed.Then design comparative experiment based on the real e-commerce website log data set to verify the accuracy,speed performance,and scalability of the distributed frequent browsing pattern mining algorithm proposed in this thesis. |