Font Size: a A A

Optimization Research And Implementation Of DNA Sequencing Data Analysis Tool MuTect2

Posted on:2019-03-23Degree:MasterType:Thesis
Country:ChinaCandidate:J W ChengFull Text:PDF
GTID:2370330563991547Subject:Information and Communication Engineering
Abstract/Summary:PDF Full Text Request
The most critical in the DNA sequencing data analysis process is to performer variant calling on sequencing data,while MuTect2 is one of the most commonly used tools for variant calling.MuTect2 is usually used to detect somatic mutations(SNV)and somatic insertion deletion mutants(Indel).Variant calling can play an important role in the research,development,and chemotherapy resistance of cancer diseases,but due to the large amount of genomic data,the whole genome-wide sequence has as high as 3 billion pairs base,and a pipeline analysis of DNA sequencing data may take several days to complete,and variant calling with MuTect2 is the most time-consuming step.The work of this paper is the optimization of the DNA sequencing data analysis tool MuTect2.The implementation language of MuTect2 is java,which is reality inefficient.In this paper,first,analyze the performance of MuTect2,find the time-consuming module,refine the time-consuming module to specific algorithms Pair-HMM algorithm and SmithWaterman algorithm.And then design and implement the following optimization plans from a local and global perspective.From the perspective of local optimization,this paper rewrites the most time-consuming Pair-HMM algorithm and Smith-Waterman algorithm based on C++,and uses OpenMP for parallel acceleration.It also uses CUDA to program the most time-consuming Pair-HMM algorithm and Smith-Waterman algorithm is rewritten to perform parallel computing using the GPU's massively parallel computing capabilities.From the perspective of global optimization,this paper uses the principle of concurrent programming to reconstruct the MuTect2 framework,first decompose the architecture of MuTect2 module,and then assign each module to a different thread to run.At the same time,the concurrent version optimization solution is compatible with the optimization strategies of the two time-consuming algorithms of MuTect2 in the C++ version and the GPU version.Finally,this paper,we evaluates the performance of the three optimization schemes and the consistency of the results.The C++ version of MuTect2 speeds up the run time for deeper target data?the shallower sequencing of the whole exonic sequencing data and the whole genome sequencing data to 4.90 times,1.32 times and 1.10 timesrespectively,and the results of the detected mutation sites are completely the same,recalls all are 100%;the GPU version of MuTect2 speed up the run time for deeper sequencing target data to 7.45 times by GTX980,and the results of detected mutation sites achieve a recall of 99.85%;the concurrent version of MuTect2 In the 40-core server environment,speed up the run times for deeper target data,the shallower sequencing of the whole exonic sequencing data and the whole genome sequencing data to7.02 times,4.85 times and 4.29 times respectively,and the recall of the detected mutation sites are 100%,99.88% and 99.29%,respectively.
Keywords/Search Tags:Variant calling, Pair-HMM algorithm, Smith-Waterman algorithm, GPU programming, Concurrent programming
PDF Full Text Request
Related items