Font Size: a A A

Efficient Distributed Large-scale Genome Sequence Assembly

Posted on:2017-04-11Degree:MasterType:Thesis
Country:ChinaCandidate:K XuFull Text:PDF
GTID:2270330482997615Subject:Computer Science and Technology
Abstract/Summary:PDF Full Text Request
In genome assembly, the primary issue is how to determine upstream and downstream sequence regions of sequence seeds for constructing long contigs or scaffolds. When extending one sequence seed, repetitive regions in the genome always cause multiple feasible extension candidates which increase the difficulty of genome assembly. The universally accepted solution is choosing one based on read overlaps and paired-end (mate-pair) reads. However, this solution faces difficulties with regard to some complex repetitive regions. In addition, sequencing errors may produce false repetitive regions and uneven sequencing depth leads some sequence regions to have too few or too many reads. All the aforementioned problems prohibit existing assemblers from getting satisfactory assembly results.In this article, we develop an algorithm, which determine the overlapping area of contigs generated by several assemblers and then scaffolding. Through building k-mer-position index,mapping reads,clustering of contigs and assembling in cluster, the proposed algorithm outperforms any of the single algorithm. The algorithm runs effectively in Hadoop platform with no need of very large memery, and we use MapReduce in many parts of it. On real E.coli k12 datasets, we compare the performance of the proposed algorithm and other popular assemblers. The experimental results demonstrate that the proposed algorithm can effectively obtain longer and more accurate scaffolds, especially N50 is increased by 46%. The final assembled sequence is more close to the whole genome, and run fast at Hadoop platform.
Keywords/Search Tags:genome assembly, MapReduce, Contig, Scaffold, Bloom Filter
PDF Full Text Request
Related items