Font Size: a A A

Application Research Of Multi-Pattern String Matching WM Algorithm Based On MPI

Posted on:2022-12-27Degree:MasterType:Thesis
Country:ChinaCandidate:W H LuFull Text:PDF
GTID:2480306614459004Subject:Computer Software and Application of Computer
Abstract/Summary:PDF Full Text Request
COVID-19 coronavirus is a large-scale infectious respiratory disease that began spreading globally in 2019.COVID-19 is not only having a generally harmful impact on our lives,our daily habits and our healthcare system,but it is also killing us from infection.Whenever a new virus is studied,its genes are extracted and then searched in databases for the presence of homologous genes.With the development of gene sequencing technology,string matching algorithm,as a traditional algorithm,has been greatly challenged.Studies show that a large number of short read fragments occur in gene sequencing,so higher requirements on the efficiency and time of string matching algorithm are proposed.Accurate pattern matching and approximate alignment are two key components in gene sequencing.BLAST,currently popular,is an approximate alignment algorithm,but specific pattern strings need to be precisely matched to find specific expressed genes in viral organisms.In order to make pattern matching adapt to large and even explosive gene sequence data,it is necessary to find an efficient and stable algorithm.Multi-pattern matching WM algorithm has a good effect on multi-pattern accurate matching.In this paper,the application research of multi-pattern string matching WM algorithm based on MPI is mainly combined with the characteristics of gene sequence to put forward a pre-processing method based on N-Gram model.In the matching process,bloom filter is used to reduce the number of matches.By using MPI's non-blocking communication technology,the time of matching calculation and communication overlaps greatly.The main process is that the main process n-gram the text and generate the data structure required by WM algorithm for the pattern string,and then send these data structures and text task segments to the sub-process,the sub-process receives the processed task text and WM data structure required by WM algorithm,and can carry out the matching task.After the main process allocates tasks,the waiting time can be matched at the same time,and the matching results can be directly output after each process is matched.The experiment is divided into five parts.The first part is the influence of different number of processes on the parallel algorithm.The second is the effect of gene sequence text task segmentation on algorithm;The third is the effect of gene text size on algorithm;The fourth is the influence of the shortest length of pattern string on the algorithm;The fifth is the comparison experiment with the conventional WM algorithm and AC algorithm.The experimental results show that the number of processes is not the more the better,and the maximum efficiency is achieved when 12 processes are selected,which is twice the number of multiple cores.When a task is executed in chunks,it is most efficient to execute the task in two chunks,because this allows you to take full advantage of MPI's non-blocking communication.Moreover,the proposed algorithm performs better than conventional WM algorithm and AC algorithm when the text length of gene sequence varies greatly.In addition,the WM algorithm's low efficiency caused by the interference of the shortest length in the pattern string set is avoided in the gene sequence because the length of the gene sequence is generally long,so the performance of the algorithm is obviously better,and the optimal acceleration effect can reach 31 times.
Keywords/Search Tags:Message Passing Interface, N-gram, Wu-Manber, gene sequence
PDF Full Text Request
Related items