Font Size: a A A

Research On GPU-based Sequence Aligner Towards Next-generation Sequencing

Posted on:2014-02-02Degree:MasterType:Thesis
Country:ChinaCandidate:Y LiuFull Text:PDF
GTID:2250330398989948Subject:Biomedical engineering
Abstract/Summary:PDF Full Text Request
Biological sequence alignment is the fundamental method in bioinformaticsresearch. With the rapid development of next-generation sequencing technology andthe rise of research fields such as Metagenomics, Epigenomics and genome-wideassociation study (GWAS), traditional sequence alignment tools can no longer meetthe needs to analyze the ever increasing volume of biological data. It is of greatsignificance to study how to accelerate the alignment process with the latestcomputing technology.To accelerate the alignment process, one method is to optimize the data structureand algorithm that are used in the aligner. These optimizations differ with alignmenttype. Another method is to utilize parallel computing and auxiliary hardwareaccelerators. The basic ideas of this method are the same: mining the parallelism oforiginal sequence aligner. So the actual acceleration effect depends on the parallelismof the algorithm. Major technologies that are used in the parallelization are parallelcomputing, distributed computing and heterogeneous computing. Using GPU ingeneral purpose computing is a kind of heterogeneous computing. Utilizing GPU toaccelerate sequence alignment has drawn much attention of the researchers in recentyears. Several traditional sequence aligners have been implemented with GPU andachieve significant speedup at the same time. Whether the researchers should choosethe GPU-based sequence aligners, there is no comprehensive evaluation on theusability of these aligners.In the research of bioinformatics, different sequence alignment tools aregenerally designed for specified applications. Many short read aligners have beendeveloped to analyze the huge amount of short reads generated by next-generationsequencing technology. These aligners can basically meet the needs of short readalignment, but most of them cannot align longer sequences. What’s more, they havepoor performance to process the insertions and deletions in the sequence. And thethird-generation sequencing technology can produce longer read, which cannot beprocessed with current short read aligners. So it is still meaningful to study how toaccelerate traditional sequence aligners. Metagenomics derives from traditional microbial genomics. It extracts all genesequences from environmental samples directly and then analyzes these sequences.Metagenomics can be used to reveal the relationship between community structureand function of microbe, to reveal the evolution relationship and to discover newgenes. In Metagenomics analysis, we need to align mixed gene sequences whichdiffer much from each other. Traditional sequence alignment tools are not fast enoughto analyze the huge amount of data, while current short read sequence aligners cannotmeet the sensitivity requirement of Metagenomics research, so new algorithms areneeded in the research of Metagenomics. GHOSTM is a GPU-based sequence alignerused in Metagenomics. It can greatly accelerate the alignment speed, but to analyzethe massive data generated by next-generation sequencing technology, furtheracceleration is needed.Here is the main work in this paper.First, we evaluated the usability of the GPU-based sequence alignment toolsthrough the method of literature search. To highlight performance authors of existingsequence aligners tend to overemphasize the gain of speed compared with thealignment tools on CPU. Other factors such as accuracy of the result, performance perwatt, price-performance and programming complexity of CUDA are often omitted byalmost all of the authors. We gave a comprehensive evaluation the usability of GPUfor sequence alignment from the above factors. And we further revaluated theperformance of GPU-based short read sequence aligners on the same platform withreal datasets. For most datasets, the GPU-based sequence aligners can generally getbetter performance per watt and price-performance. More optimizations are neededfor the gapped sequence aligners.Second, we used CUDA to accelerate the nucleotide sequence aligner BLASTNand developed CUDA-BLASTN. BLASTN is a sub tool of the widely used sequencealignment tool BLAST which plays an important role in the research of non-codingRNA, biological evolution and pathogen detection. Based on NVIDIA’s CUDAarchitecture, we accelerated the seed and ungapped alignment stages of BLASTNfrom the dimensions of coarse multithreaded parallelism and multi-GPU parallelism.CUDA-BLASTN fully utilizes the features of different memory types in GPU. Itachieves certain acceleration effect compared with latest NCBI-BLASTN.CUDA-BLASTN is best suited for the alignment of medium-sized query sequenceswith long subject database. Thirdly, we utilized traditional parallel computing technology MPI and latestdistributed computing architecture Hadoop to accelerate GHOSTM respectively.Metagenomics plays an important role in pathogen detection. MPI and Hadoop are thebasis of cluster computing and cloud computing. Our work provides the softwaresupport for Metagenomics research. MPI provides explicit parallelism, mpiGHOSTMscales linearly with the number of process. Hadoop has significant advantages in therecovery of single node failure and scalability. Based on the research of Hadoop andCUDA integration, we developed Hadoop-GHOSTM which can fully utilize CPU andGPU resources.Here are the main contributions of this paper.We analyzed the GPU-based sequence aligners from the perspectives of accuracy,performance per watt, price-performance and programming complexitycomprehensively for the first time and systematically evaluated the usability of GPUfor sequence alignment.Based on CUDA, we developed CUDA-BLASTN to accelerate nucleotidesequence aligner BLASTN. This tool can be used in the sequence alignment of hugeamount of biological data generated by next-generation sequencing technology.We accelerated the GPU-based sequence aligner GHOSTM with MPI andHadoop respectively, and provided a reference for the integration of Hadoop andCUDA.On one hand, the evaluation of GPU-based sequence aligners can provide areference for the researchers when they decide to choose an aligner for their specifiedtask. On the other hand, the newly developed sequence aligners can be used in manyresearch areas such as pathogen detection and Metagenomics research.
Keywords/Search Tags:next-generation sequencing, sequence alignment, CUDA, MPI, Hadoop
PDF Full Text Request
Related items