Font Size: a A A

The Design And Implementation Of A Multiple Sequence Alignment Algorithm Based On Suffix Tree Strategy

Posted on:2018-03-17Degree:MasterType:Thesis
Country:ChinaCandidate:W H SuFull Text:PDF
GTID:2370330623950666Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Multiple sequence alignment is the alignment among more than two DNA,RNA or protein sequences.It is one of the most useful method in genome and proteome analysis,which could help people find the similarity information among sequences.As the basis of other algorithms,multiple sequence alignment software needs to align billions of sequences in very short time to meet the need of other analysis tools on performance and throughput.Most of the current software are not competent.In this paper,we carry on research on both algorithm and parallel technology.We design and speed up the multiple alignment by optimizing the algorithm by suffix tree data structure and carrying on parallelization.And the Spark distributed parallel framework is used to improve the performance and throughput.The detailed work is as follows:1.A pairwise alignment algorithm based on suffix tree is designed,which is an improvement on dynamic programming algorithm.Aiming at the properties of highly similar sequences,the algorithm speeds up the process of pairwise alignment by suffix tree which is a powerful tool on string processing.It can find all of the common substrings with scanning the sequence only once,which decreases the length of substrings that need to be aligned by Needleman-Wunsch algorithm.In this way,the time complexity of pairwise alignment is decreased.2.A new algorithm MASC is designed which is based on center-star strategy,a high performance algorithm,and suffix tree.However,the preprocessing of center-star strategy is very long.Aiming at this shortcoming,we accelerate it by selecting center sequence randomly.The experiment shows that there is no lose in accuracy when aligning homologous sequences.3.MASC is implemented on Spark framework as MASC-Spark.By organizing and scheduling RDD and running process to maximize the performance of Spark,we make further efforts to accelerate MASC and improve the throughput.The experiments show that the program has good scalability.
Keywords/Search Tags:multiple sequence alignment, suffix tree, center-star strategy, Spark
PDF Full Text Request
Related items