| Sequential pattern mining algorithm is an important research direction in the field of data mining,which can be used to mine frequent patterns from sequential data.However,when dealing with large-scale transaction sets,traditional sequence pattern mining algorithms have some problems,such as frequent database reads and wasted time-space resources.To solve these problems,an improved parallel sequential pattern mining algorithm EPSPM(Enhanced Prefix Span Sequential Pattern Mining Algorithm)is proposed in this thesis.The EPSPM algorithm consists of three main processes: data partitioning,support counting and candidate set generation.In the data partitioning process,the algorithm divides the transaction set into multiple blocks,and performs local pattern mining and intra-block pruning operations within each block,and implements dynamic load balancing on the Spark distributed framework.In the support counting process,the EPSPM algorithm employs a support counting method based on the parallel Map Reduce framework to reduce data communication overhead.In the projection database generation process,the EPSPM algorithm uses the Map Reduce results from the previous operations so as to avoid projection of all schemas and thus reduce the number of database scans.The experimental results show that the EPSPM algorithm can effectively reduce the number of database scans and communication overhead,and the EPSPM algorithm based on the Spark distributed framework has better performance than the Hadoop-based Prefix Span algorithm,and the EPSPM algorithm takes significantly less time for the same dataset with the same support threshold,reducing the runtime by an average of 59%.time on average.In addition,this thesis applies the EPSPM algorithm to the field of text classification and constructs a text classification method based on sequence patterns to achieve the classification of news texts.Traditional text classification methods usually use word,sentence and paragraph level features for classification,but these features lack consideration of the relationship between different elements in the text.The method in this thesis can extract word combinations with temporal relationships from the text and fully consider the temporal relationships among them.The experimental results show that for five categories of text: sports,entertainment,home,lottery,and real estate,the text classification method using EPSPM is significantly better than machine learning classification methods such as logistic regression,KNN,and support vector machine,and it has good classification results on different numbers of data sets. |