DNA Sequence Splicing Algorithm Based On Spark

Posted on:2018-07-09

Degree:Master

Type:Thesis

Country:China

Candidate:X Pan

Full Text:PDF

GTID:2310330518955801

Subject:Computer application technology

Abstract/Summary:

PDF Full Text Request

Bioinformatics is a cross discipline dealing with biological information,and the problem of DNA sequence splicing is one of the main research contents.The length of the DNA sequence is at least a few thousand.Once the number of sequencer can be measured only 500bp,which can not directly measure the genome of organisms.Therefore,DNA sequence stitching algorithm arises at the historic moment.In this algorithm,the target sequence is divided into small fragments,and then the small fragments are sequenced respectively.Finally,the fragments are spliced according to the overlapping relationship between the fragments by the computer technology.At present,the sequence stitching algorithm is mainly divided into Overlap-Layout-Consensus splicing algorithm and de-Bruijin graph algorithm.The former uses the "overlap-layout-consensus" method based on the read fragment for splicing,although it can retain the complete information of the fragment,but it can not effectively overcome the problem of repeat sequence.The latter will be further split read fragments,and then stitching these smaller fragments.Although the problem of repetitive sequences is overcome to some extent,but a large number of k-mer fragments and the de-Bruijin graph is generated.Therefore,this algorithm is still time-consuming.In addition,for the realization of the platform,most of the research is based on the serial algorithm,which is the bottleneck of space and time consumption.The parallel strategy based on MapReduce is proposed and realized.This method is very good to overcome the efficiency and storage problem,but computing engine based on MapReduce will usually output intermediate results to disk,storage and fault tolerance,a time-consuming hard disk read and write operations.Although the efficiency has been significantly improved,but still need a longer run time.In order to solve the above problems,based on the research of the de-Bruijin graph splicing algorithm,this paper proposes a strategy to split the read fragment in turn to the right two bits.The idea of changing the k-mer acquisition method is incorporated into the algorithm to reduce the complexity of the de Bruijin graph.At the same time,the algorithm is implemented in Spark parallel environment.The simulation results show that the efficiency of the proposed algorithm based on Spark is higher than that of single serial and parallel algorithms based on MapReduce.

Keywords/Search Tags:

Sequence stitching algorithm, De-Bruijin graph stitching algorithm, Repetitive sequence, Spark parallel environment

PDF Full Text Request

Related items

1	Research On Subaperture Stitching Interferometry
2	High Accuracy Surface Metrology Of X-ray Mirrors Research
3	The Optical Plane Of The Large-diameter Interference Detection Sub-aperture Stitching
4	Stitching Methods For Satellite Images From Multiple TDI-CCD
5	Research And Implementation Of Subgraph Matching Algorithm Based On Spark
6	Based On UAV Route Planning And Image Stitching Technology Area Assessment Of Farmland Floods
7	Parallel Optimization Design And Implementation Of Biological Sequence Alignment Algorithm
8	Research On The Algorithm Testing Optical Mirror By Subaperture Stitching Interferometry
9	Research On Stitching Algorithms For Multi-view Oblique Aerial Images Of Karst Landform
10	Design And Implementation Of UAV Remote Sensing Image Stitching System