Font Size: a A A

DNA Sequence Splicing Algorithm Based On Spark

Posted on:2018-07-09Degree:MasterType:Thesis
Country:ChinaCandidate:X PanFull Text:PDF
GTID:2310330518955801Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
Bioinformatics is a cross discipline dealing with biological information,and the problem of DNA sequence splicing is one of the main research contents.The length of the DNA sequence is at least a few thousand.Once the number of sequencer can be measured only 500bp,which can not directly measure the genome of organisms.Therefore,DNA sequence stitching algorithm arises at the historic moment.In this algorithm,the target sequence is divided into small fragments,and then the small fragments are sequenced respectively.Finally,the fragments are spliced according to the overlapping relationship between the fragments by the computer technology.At present,the sequence stitching algorithm is mainly divided into Overlap-Layout-Consensus splicing algorithm and de-Bruijin graph algorithm.The former uses the "overlap-layout-consensus" method based on the read fragment for splicing,although it can retain the complete information of the fragment,but it can not effectively overcome the problem of repeat sequence.The latter will be further split read fragments,and then stitching these smaller fragments.Although the problem of repetitive sequences is overcome to some extent,but a large number of k-mer fragments and the de-Bruijin graph is generated.Therefore,this algorithm is still time-consuming.In addition,for the realization of the platform,most of the research is based on the serial algorithm,which is the bottleneck of space and time consumption.The parallel strategy based on MapReduce is proposed and realized.This method is very good to overcome the efficiency and storage problem,but computing engine based on MapReduce will usually output intermediate results to disk,storage and fault tolerance,a time-consuming hard disk read and write operations.Although the efficiency has been significantly improved,but still need a longer run time.In order to solve the above problems,based on the research of the de-Bruijin graph splicing algorithm,this paper proposes a strategy to split the read fragment in turn to the right two bits.The idea of changing the k-mer acquisition method is incorporated into the algorithm to reduce the complexity of the de Bruijin graph.At the same time,the algorithm is implemented in Spark parallel environment.The simulation results show that the efficiency of the proposed algorithm based on Spark is higher than that of single serial and parallel algorithms based on MapReduce.
Keywords/Search Tags:Sequence stitching algorithm, De-Bruijin graph stitching algorithm, Repetitive sequence, Spark parallel environment
PDF Full Text Request
Related items