Font Size: a A A

Sequential Pattern Mining With General Gap And Its Application In Keyphrase Extraction

Posted on:2018-04-18Degree:MasterType:Thesis
Country:ChinaCandidate:Z Z LiuFull Text:PDF
GTID:2348330515492808Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the coming of big data era,a huge number of sequence data has been generated in real-word application.There is abundant and valuable information hidden in the data,how to exploit valuable and attractive information from huge sequence data has been a hot issue and difficulty problem in the current research.Sequential pattern matching aims to match the number of occurrences of patterns from sequences database,and pattern mining mainly mines frequent patterns from database.They have become a very important research issue in the field of data mining.However,most researches focus on pattern matching with non-negative gaps,which has very strict requirement of occurrence of each character in a string.It limits the flexibility and reduces the practical value of pattern matching.So the research of general gaps and one-off condition has not only the value of theoretical research,but also practical value in bioinformatics and text mining.In order to increase the flexibility of matching,and taking into account that it is more reasonable to use one-off condition in sequential pattern mining,this thesis studies the pattern matching problem under general gaps and one-off condition.Key phrase extraction is one of the key topics in text data mining.Key phrases are used to summarize the document and high-quality key phrases have great importance in text summarizing,reading and indexing.However,the research of key phrase extraction has strict limitation in the pattern of extraction,and is unable to achieve the semantic relation between words and phrases.This results in failure to autonomously extract key phrase.This thesis researches on the sequential pattern matching with general gaps and one-off condition,sequential pattern mining and its applications in the field of key phrase extraction.The main content divides into three parts.(1)The design and analysis of sequential pattern matching algorithm with general gaps and one-off condition;(2)On the base of sequential pattern matching,the sequential pattern mining with general gaps and one-off condition is researched in this paper;(3)We apply the sequential pattern mining algorithmwith general gaps and one-off condition to text data mining.By mining the semantic relations between words,we can extract keywords.The main contributions of this thesis are as follows:(1)In the study of sequential pattern matching,this thesis proposes sequential pattern matching with general gaps and One-Off condition(SPMGOO),which adds one-off condition and allows character exist at any position in a sequence.The SPMGOO problem is proved to be NP-Hard.It is for the first time this thesis uses linear table to solve the SPMGOO problem.In the process of pattern matching,it is for the first time this thesis analyzes the structure of the pattern string and the frequency of each character in the string,and then decides whether the sequence and pattern need to be transposed or not.This will make them get the best matching status.(2)In the research of sequential pattern matching,maximum sequential pattern matching with one-off and general gaps condition algorithm(MSAING)is proposed in this thesis.Firstly,MSAING utilizes reverse strategy to decide whether the sequence and pattern need to be transposed;Secondly,MSAING uses liner table structure for matching pattern.It can be divided into location phase,forward phase and backward phase.It can greatly reduce the time and memory consumption and significantly improve the probability of success matching.Finally,to further improve the efficiency of the algorithm,this thesis determines whether internal repetition exists in the pattern or not,according to the inside_Checking mechanism.,And if internal repetition occurs,it will find where the position is.It is proven theoretically that MSAING is better than other algorithms in completeness,and it can get complete solution for no repetition patterns.Experimental results on real biological datasets show MASING has higher accuracy and lower complexity compared with other algorithms.The meaning of experimental results is also analyzed.(3)In the research of sequential pattern mining,sequential pattern mining with one-off and general gaps condition algorithm(SPING)is proposed in this thesis.It can not only get discontinuous sequential pattern,but also mine reversed frequent pattern to improve the flexibility of pattern mining.SPING can mine more useful information by computing a more complete solution of the pattern in a sequence.Experiments on biological sequence show the effectiveness of the proposed algorithm compared with other algorithms.(4)In the research of key phrase extraction,key phrase extraction using sequential patterns mining with one-off and general gaps condition algorithm(SPING)is proposed in this thesis.Taking into one off condition and general gaps,SPING can catch semantic relations between words and phrases more effectively.Therefore,SPING can get effective candidate key phrases and count their eigenvalues.Then supervised machine learning method is used to train features and construct a classification model.Then we can extract key phrase use this model.Experimental results demonstrate SPING effectively gets high quality key phrase.
Keywords/Search Tags:general gap, one-off condition, sequential pattern matching, sequential pattern mining, keyphrase extraction
PDF Full Text Request
Related items