Font Size: a A A

Research On Sequence Alignment Algorithms For Large-scale Sequencing Datasets

Posted on:2020-08-05Degree:DoctorType:Dissertation
Country:ChinaCandidate:H Y ChengFull Text:PDF
GTID:1360330575966574Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the development of the next-generation sequencing(NGS)technologies,the sequencing cost has been dramatically reduced than Moore's Law.Moreover,genomes can be sequenced by NGS technologies at much higher speed.As a result,for related re-search projects,one of the major bottlenecks is how to process large amounts of data.In most existing sequencing data analysis pipelines,the fundamental step is the sequence alignment.This step is designed to identify the mapping positions of sequenced frag?ments(i.e.,reads)in the reference genome.Owing to the significant importance of sequence alignment,it has become a key problem in bioinformatics over the past few decades.In order to process the ever-increasing data generated by NGS technologies,it is necessary to develop more efficient sequence alignment algorithms.Essentially,the sequence alignment problem is a variant of the approximate string matching problem in computer science.Unlike the classical string matching algorithms,current sequence alignment algorithms consist of three key techniques:algorithm de-velopment,index development and parallelization.Based on the above three tech-niques,this dissertation studies the algorithm theories and efficient implementations of sequence alignment algorithms for different applications.Specifically,the contribu-tions and works of this dissertation include:1 Design and optimization of the locating algorithm for FM-indexAs a compressed full-text index,FM-index is the key data structure of many se-quence alignment algorithms.Thus,it significantly influences the performance of these sequence alignment algorithms.Unlike classical indexes,FM-index only saves a few sampled positions to reduce its space usage.However,when retrieving non-sampled positions from the FM-index,these positions have to be recovered one-by-one exploit-ing the expensive extra operations.In this dissertation,we proposed a novel locating algorithm FMtree for genomic data,which is able to recover the non-sampled positions block-by-block.For FMtree,its search space is organized into a quadtree.By travers-ing this quadtree,FMtree calculates the non-sampled positions block-by-block,while existing methods have to calculate these positions one-by-one.In addition,we also developed several branch-cut strategies to further improve the performance of FMtree.Experimental results show that for human and mouse genomes including billions of bases,FMtree is one order of magnitude faster than the state-of-the-art algorithms,and still memory-efficient.2 Optimization of all-mapperAll-mappers are designed to identify all mapping positions of each sequenced frag-ment(i.e.,read)in the reference genome.Compared with best-mappers which only re-port a few best mapping locations,all-mappers are much more time-consuming.There-fore,it is more difficult for them to efficiently process the massive sequencing data.In existing all-mappers,the most time-consuming step is the verification step,which needs to perform many edit distance computing tasks.In this dissertation,we proposed a fast all-mapper BitMapper.It is based on a novel vectorized bit-vector algorithm to signif-icantly accelerate the edit distance computing.This vectorized bit-vector algorithm is able to perform multiple edit distance computing tasks simultaneously,while existing algorithms have to calculate edit distance one-by-one.Experimental results show that BitMapper is about 3-4 times faster than the state-of-the-art all-mappers.3 Accelerating all-mapper exploiting GPUsAs a powerful tool for parallel computing,a GPU card consists of massive compu-tational cores to perform multiple tasks simultaneously.It seems that all-mappers are very suited for GPU computing,since they need to align a large number of indepen-dent reads.However,there is no suitable index for GPU-accelerated all-mappers.For popular FM-index,its string matching algorithm would result in many non-contiguous memory accesses without good data locality.For q-gram index,it is too large to be kept on limited GPU global memory.In this dissertation,we proposed a sparse q-gram index that is designed specifically for the architecture of GPUs.By utilizing this sparse q-gram index,we developed a GPU-accelerated all-mapper BitMapper2.Experimental results show that BitMapper2 is about 7-8 times faster than existing CPU-based and GPU-based all-mappers.4 Design of bisulfite sequencing aligner based on modified FM-indexAs a gold-standard technique for DNA methylation detection,whole-genome bisul-fite sequencing(WGB S)converts the unmethylated cytosines(C)to thymines(T),while the methylated Cs are still unchanged.In fact,the aim of WGBS aligners is to identify the methylated Cs.Since the existing algorithms do not make full use of this charac-teristic,the performance of them can be further improved.In this dissertation,we pro-posed an ultrafast WGBS aligner BitMapperBS.It is mainly based on a novel version of FM-index specifically for the WGBS data.Moreover,BitMapperBS also adopts the vectorized bit-vector algorithm to accelerate the verification step.Experimental results show that BitMapperBS is one or two order of magnitude faster than the state-of-the-art algorithms,while achieves similar or better sensitivity and precision.It is able to align 2 billion WGBS reads to the human genome in about 10 minutes,while popular WGBS aligners usually take more than 10 hours.Actually,the above four works can be classified into two groups:the theory works and the implementation works.Theory works consist of design and optimization of the locating algorithm for FM-index as well as design of bisulfite sequencing aligner based on modified FM-index,while the implementation works include optimization of all-mapper and accelerating all-mapper exploiting GPUs.
Keywords/Search Tags:Bioinformatic, Sequence alignment, Methylation, Parallel algorithm
PDF Full Text Request
Related items