Font Size: a A A

Parallel DNA Read Mapping Algorithms On Multi-core And Many-core Architecture

Posted on:2020-06-01Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y D ChanFull Text:PDF
GTID:1360330572988721Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Since the human genome project,the theory,method and application used in computing biology gains more and more attentions by biologists and medical scientist.As an interdiscipline,the research and development of computing biol-ogy plays an important role now.The data collection,storage and analysis for biological information including protein,protcomc.gene,genome,has becoiming an important,research direction of computing biology,medical and pharmacy.With the development of the sequencing technique.the volume of biology data grows quite fast.As a third-generation sequencing technique based company,Helicos can produce 3GB gene data with a single sequencing machine in one day.Besides,there is a trend between these sequencing company that biology data are getting longer and longer.So a lot of longer biolog,y data.pose a great challenge for many analysis applications of computing biology.Currently,these analysis applications target for protein and genome are developed on traditional CPU platform and implemented with single thread version.Then,in order to process much more biology data,we need use more hardware to meets the requirement of real-time analysis.Then the cost of biology analysis have been greatly increased.From 1986 to 2002,the performance of micro-processor grows with a ratio of 50%every year,Since 2002,the grows ratio decrease to 20%every year.So since 2005,many companies start to design and make multi-cores micro-processors to enhance the overall performance of CPU.Now each CPU can contains tens of cores micro-processors.At the same time,many computing device appears including Graphic processing units,Many integrated cores.Filed programmable Gate Array.Based on these new computing devices,we gains rapid development among different scientific areas,such as climate simulation,protein folding,drug discovery,energy research,human gene decoding,artificial intelligence.My research contains three parts,all of them are target to accelerate short read mapping process.First of all based on Many integrated cores architecture,we design and implement parallel Myers algorithm called MyPhi to accelerate the verification stage of short read mapping.The second part are target to the filtration stage of short read mapping,we propose a novel efficient filter algorithm PUNAS which can be parallelized on both multi-core and many-core architecture in a vectorized-scalable fashion.Finally,based on our previous work,we design an efficient short read mapper called FEM which can be used as an independent tool to accelerate the short read mapping progress.In order to accelerate the short read mapping,we make a simple survey on the efficient validation algorithm.Compared to Smith-Waterman algorithm,the bit-parallel based Myers algorithm achieves higher speed.The general idea of Myers is calculating the difference between adj acent cells instead of the absolute value of cells in the dynamic programming matrix.These differences can be represented with several bit-vectors.Based on the first and the second generation Xeon Phi platform,we design and realize a parallel version of Myers algorithm called MyPhi.In order to utilize the compute ability of multiple cores on Xeon Phi,we use OpenMP programming model to employ multi-threading parallelization.In the fine grain,we design two parallel schemes,including inter-parallel and intra-parallel mode.By comprehensive experiments,inter-parallel is generally better than intra-parallel mode.Based on specific platform,we use relative instruction set to implcenent the core operations of Myers algorithm.The parallel version of Myers achieves over trillions cells update per second(TCUPS)on a single Xeon Phi(KNL)which is an order of magnitude faster than other parallel Smith-Waterman algorithm.Myers algorithm sever as an efficient validations algorithm which have been widely used in many popular short read aligner such as SeqAn,RazerS.Our implementation can be a good tool to accelerate the efficiency of these short read mapper.Shifted Hamming Distance(SHD)is a faster algorithm compared to Myers algorithm.For short read alignment,many sequence alignment process in vali-dation stage is useless since they are only retrieved as an candidate.Thus,some heuristic algorithm use a fast filter strategy to quickly remove these incorrect can-didates.These algorithms are called filter algorithm.Shifted Hamming distance is currently the best filter algorithm since it gains a linear time complexity.We present a novel algorithm,called PUNAS algorithm which is equivalent to SHD algorithm in the algorithm view.PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors.Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing based eon Phi processors.Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Mvers bit-vector algorithm.Finallhy.based on the previous work,we present a short read mapping tool called FEM.FEM contains a number of novel techmiques.Firstly,FEM present a succinct hash index which only record a portion of locations in reference genome Based on these algorithm,we further present two seed selection algorithm called group-seeding and variable-length algorithm.Finally,we use the optimized Myers algorithm to validate all candidates and find the correct mappings for each short read.Compared to existed short read mapper,FEM present comparable spped,memory usage and accuracy.Given a hunan genome.the novel hash index can be compacted from 16GB to 3GB.Besides,we observe an up to 2.8-fold speedup compared to BitMapper and an order-of-magnitude speedup compared to BitMapper2 and Hobbes3.
Keywords/Search Tags:Biology computing, parallel computing, sequence alignment, Many Integrated cores, sequencing
PDF Full Text Request
Related items