Research On Text Classification Based On Parallel Frequent Patterns Mining Algorithm

Posted on:2024-07-15

Degree:Master

Type:Thesis

Country:China

Candidate:S J Zhang

Full Text:PDF

GTID:2568307091996999

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Sequential pattern mining algorithm is an important research direction in the field of data mining,which can be used to mine frequent patterns from sequential data.However,when dealing with large-scale transaction sets,traditional sequence pattern mining algorithms have some problems,such as frequent database reads and wasted time-space resources.To solve these problems,an improved parallel sequential pattern mining algorithm EPSPM(Enhanced Prefix Span Sequential Pattern Mining Algorithm)is proposed in this thesis.The EPSPM algorithm consists of three main processes: data partitioning,support counting and candidate set generation.In the data partitioning process,the algorithm divides the transaction set into multiple blocks,and performs local pattern mining and intra-block pruning operations within each block,and implements dynamic load balancing on the Spark distributed framework.In the support counting process,the EPSPM algorithm employs a support counting method based on the parallel Map Reduce framework to reduce data communication overhead.In the projection database generation process,the EPSPM algorithm uses the Map Reduce results from the previous operations so as to avoid projection of all schemas and thus reduce the number of database scans.The experimental results show that the EPSPM algorithm can effectively reduce the number of database scans and communication overhead,and the EPSPM algorithm based on the Spark distributed framework has better performance than the Hadoop-based Prefix Span algorithm,and the EPSPM algorithm takes significantly less time for the same dataset with the same support threshold,reducing the runtime by an average of 59%.time on average.In addition,this thesis applies the EPSPM algorithm to the field of text classification and constructs a text classification method based on sequence patterns to achieve the classification of news texts.Traditional text classification methods usually use word,sentence and paragraph level features for classification,but these features lack consideration of the relationship between different elements in the text.The method in this thesis can extract word combinations with temporal relationships from the text and fully consider the temporal relationships among them.The experimental results show that for five categories of text: sports,entertainment,home,lottery,and real estate,the text classification method using EPSPM is significantly better than machine learning classification methods such as logistic regression,KNN,and support vector machine,and it has good classification results on different numbers of data sets.

Keywords/Search Tags:

Sequence Patterns, Text Classification, PrefixSpan, EPSPM

PDF Full Text Request

Related items

1	Research On Web Pattern Mining Method Based On PrefixSpan Algrithm
2	Researching Text Classification Using Semantic And Sequence Information
3	Application Of The Improved PrefixSpan Algorithm In Intrusion Detection
4	Contributions To Several Key Issues Of Associative Text Classification
5	Research On Attention-based Model For Sequence Classification
6	Improved Classification And Mining Algorithms And Application In Industrial Operation Safety Monitoring
7	Data-driven Alarm Flood Analysis Method Based On Improved Prefixspan
8	Research On Classification Of Short Text Sequences With Multi-Views Based On Semantic Representation
9	Research On Location-based Relationship Labeling Model In Social Networks
10	A Research On Automatic Web Text Classification Technology