Font Size: a A A

Parallel Optimization And Implementation Of Massive Genome Annotation Algorithms

Posted on:2018-09-09Degree:MasterType:Thesis
Country:ChinaCandidate:K W HuangFull Text:PDF
GTID:2370330623450861Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
Gene is the basis of heredity and variation.It controls all the life phenomena of organism,such as birth,aging,illness and death.Genome annotation is the process of annotating the biological functions of all genes in the genome using bioinformatics methods and tools.It is an important problem and a major challenge in bioinformatics research.However,sequencing data is becoming larger with the development of sequencing technology.And it is hard for the existing genome annotation tools to meet the needs of bioinformatics research.To solve these problems,we choose two important parts of genome annotation,motif discovery and gene function annotation.Based on the real application problem,we use Intel MIC processor,Tianhe-2 super computer and Hadoop to improve the performance of the existing genome annotation tools.Our work includes:1.MEME(Multiple EM for Motif Elicitation)is one of the currently widely-used algorithms based on maximum-likelihood principle for de novo motif discovery.It needs less initial conditions and it is difficult to fall into local extremum.MEME-Suite is an open source implementation of MEME algorithm.However,the high computational cost constrains MEME for handling large datasets.To accelerate MEME algorithm,we parallelize it targeted on MIC Architecture and presented a parallel implementation of MEME called MIC-MEME.We parallelize the starting point search part of MEME using multi-thread and improve the iteration updating strategy to eliminate data dependencies.And we have investigated a CPU/MIC collaborated parallel framework which overlaps the computation of the CPU/MIC and take advantage of the 512 bit vectorization unit.Our experimental platform is a high performance server.As the result shows,our approach gets average speedups of 26.6 and linear scalability.Based on hybrid CPU/MIC computing framework,our approach harnesses the powerful compute capability of MIC.The optimized software can accomplish motif discovery of one human promoter regions which is 2 million bps long within 50 minutes,whereas MEME-Suite needs up to 17 hours.2.MIC-MEME parallelized MEME algorithm based on hybrid CPU/MIC computing framework.Although it gets average speedups of 26.6 and linear scalability,it still can not meet the need of clinical research and can be further improved.To handle large datasets,we parallelize MIC-MEME cross nodes of Tianhe-2 and present a motif discovery algorithm based on Tianhe-2.We parallelize the starting point search algorithm cross nodes and improve the data structure.As the result shows,we archive 3175 folds parallel speedup in 1024 nodes of Tianhe-2.It can accomplish motif discovery of one human promoter regions within half a minute,whereas the original program needs up to 17 hours.Besides,it supports motif discovery of tens of millions bps of data and can accomplish motif discovery of full human promoter regions which is 10 million bps long within 7 minutes.3.SOAPgaea is a bioinformatics tool for genome resequencing analysis based on Hadoop.It mainly supports primary analysis,variation detection and variation annotation.Our paper focuses on the gene function annotation module of SOAPgaea.We build the gene function annotation module using Hadoop.Our approach improves the performance of multi-sample annotation based on regularity drawn from input data by reducing redundant database search.Besides,the function of local file search is implemented to make the software easy to use.In the experiment,the program can accomplish the jobs within 4 minutes,whereas the original program needs up to half an hour.And it supports multi-sample annotation and local file search.Generally,it is efficiency,easy using and friendly to massive data.
Keywords/Search Tags:Genome Annotation, Motif Discovery, MEME, Intel MIC, Tianhe-2, Hadoop
PDF Full Text Request
Related items