Font Size: a A A

Research On Large-scale Biological Sequence Alignment Algorithms And Their Parallelization

Posted on:2015-11-20Degree:DoctorType:Dissertation
Country:ChinaCandidate:X Y ZhuFull Text:PDF
GTID:1360330488498754Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new discipline which combines life science with computer science.It is one of the hottest topics in today's scientific research.Biological sequence alignment is a fundamental research in bioinformatics.It not only provides a wealth of information related to the evolutionary relationships,conserved motifs identification,and structure prediction for RNA and proteins,but also a basic tool for biological conservation,disease control,and many other researches.In the high-throughput era,with the development of sequencing and assembly techniques,the scale of biological sequence alignment has been changed significantly.In the last few decades,merely aligning all the known orthologs of a given gene requires aligning several thousands sequences.While in the future decades multiple comparison may be required to align billions of closely related sequences.Due to infeasible execution time,some multiple sequence alignment algorithms are not suitable for handling large-scale sequences.On the other hand,with the development of high performance computing platforms,including multi-core computers,clusters,heterogeneous architecture which combines CPU with GPU,it poses significant opportunities and challenges for handling large-scale data quickly and efficiently.Multiple sequence aligment needs to support high performance computing urgently due to its wide use,highly computational complexity,and massive data.This paper investigates large-scale biological sequence alignment and its parallel processing technologies based on multi-core computers and heterogeneous systems.The main work and contributions are as follows.Firstly,in order to improve the scalability of the existing sequence alignment algorithms,a divide-and-conquer based parallel strategy(DCPA)for multiple sequence alignment is presented.DCPA works by dividing the large-scale alignment problem into smaller and more tractable sub-problems which can be solved by the existing algorithms.It is capable of handling large-scale sequence alignment on multi-core computers.We assess the performance of DCPA using the classical benchmarks and artificially generated test sets.DCPA achieves up to 81-fold improvements in execution time compared to MUSCLE,while maintaining comparable alignment accuracy.Secondly,in order to solve the problem of accuracy degradation,we further study sequence set partition method,and then propose a data parallel strategy(CDAM)for aligning multiple biological sequences on multi-core computers.CDAM is capable of aligning large-scale sequences by applying cluster method to partition sequence set,designing a longest processing time first heuristic algorithm to distribute sequence subsets,and proposing a progressive profile merging strategy to get the large-scale sequence alignment results.We apply five different clustering algorithms which are CD-hit,UCLUST,SiLiX,CLUSS and BLASTClust in CDAM.The test results clearly show that different clustering approach affects CDAM in terms of alignment accuracy and speed,among which CDAM(UCLUST)and CDAM(CD-hit)have the best overall performance.CDAM(UCLUST)and CDAM(CD-hit)achieve 151-fold and 111-fold improvements in execution time compared to MUSCLE,while losing 2.19%and 2.87%alignment accuracy on average respectively.Thirdly,in order to accelerate the most accurate option of MAFFT,we present a parallel implementation of MAFFT on heterogeneous systems which combine CPU with GPU.A method is proposed to achieve load balancing by sequence data transformation.Many memory optimization methods,including a fully coalesced sequence accessing,similarity matrix storing and accessing,and score matrix computing and compressed storing,are developed to improve the performance of actual systems because of memory shortage.Based on a memory pre-allocation and reuse strategy,a coarse-grained parallel algorithm for large-scale sequence alignment is proposed.Our implementation tested in three NVIDIA GPUs achieves speedup up to 56.7 and 7.1 on a Tesla K20m GPU compared to the sequential and multi-thread MAFFT 7.017 respectively,maintaining the same accuracy with MAFFT.Fourthly,a new chemical reaction optimization based multiple sequence aligment method(CROMSA)is presented.To improve the alignment accuracy,the methods of molecular coding,population initialization,objective function,molecular reactions including on-wall ineffective collision,decomposition,synthesis,and inter-molecular ineffective collision are designed.We assess the alignment quality and computational complexity of CROMSA using four classical benchmarks.The experiment results show that CROMSA achieves better accuracy than DCPA,CDAM(Cd-hit),and CDAM(UCLUST)which are proposed in the previous chapters.CROMSA has longer execution time because it takes much time to optimize the alignment results.CROMSA outperforms some popular alignment algorithms such as ProbCons and MUMMALS in execution time,which further verifies the efficiency of CROMSA.
Keywords/Search Tags:Multi-core processor, GPU, Parallel algorithm, Sequence alignment, Clustering, CRO
PDF Full Text Request
Related items