Efficient Distributed Large-scale Genome Sequence Assembly

Posted on:2017-04-11

Degree:Master

Type:Thesis

Country:China

Candidate:K Xu

Full Text:PDF

GTID:2270330482997615

Subject:Computer Science and Technology

Abstract/Summary:

PDF Full Text Request

In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results.In this article, we develop an algorithm, which determine the overlapping area of contigs generated by several assemblers and then scaffolding. Through building k-mer-position index,mapping reads,clustering of contigs and assembling in cluster, the proposed algorithm outperforms any of the single algorithm. The algorithm runs effectively in Hadoop platform with no need of very large memery, and we use MapReduce in many parts of it. On real E.coli k12 datasets, we compare the performance of the proposed algorithm and other popular assemblers. The experimental results demonstrate that the proposed algorithm can effectively obtain longer and more accurate scaffolds, especially N50 is increased by 46%. The final assembled sequence is more close to the whole genome, and run fast at Hadoop platform.

Keywords/Search Tags:

genome assembly, MapReduce, Contig, Scaffold, Bloom Filter

PDF Full Text Request

Related items

1	Research On Genomic Scaffold Filling Problem Based On Contig
2	Research And Application Of The Third Generation Genome Assembly High Repeat Sequence
3	Research And Implement On Genome Wide Association Study Techniques Based On MapReduce
4	Algorithms For Genomic Scaffold Filling Problem
5	Construction Of Aegilops Genome Data Platform
6	Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs
7	Optimization Of Guide RNA Cluster Assembly Based On Guide RNA Scaffold Sequence Permutation
8	Three-dimensional Assembly Of Cells Based On Robotic Micromanipulation Combined With Magnetic Guidance
9	Algorithm Research Of DNA Contig Merger Based On BWT
10	Development of a high resolution whole genome radiation hybrid map for interrogating the rhesus macaque genome assembly