Cloud Computation-Based Error Correction For Transcriptome Assembly

Posted on:2016-12-02

Degree:Master

Type:Thesis

Country:China

Candidate:H Z Pang

Full Text:PDF

GTID:2180330464470715

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Gene sequencing can help us understand genetic information of biology. This assists us in identifying disease genes of organisms and finding the right way for disease treatment. Due to restriction of experimental technologies, DNAs were usually split into some fragments, and sequence assembly is then applied to splice. However these short fragments might have loss, insertion or deletion of bases. Thus, the correction of these error or noise embedded with bases is an important issue.The main idea of serial algorithm is if two reads share subfragments, they might come from the same genome. By this way, we can find out the most possible error bases and correct them, such as K-mer listing algorithm that uses the K-mer listing diagram to find wrong base. Although it is relatively fast, its accuracy is low. K-mer enumeration alignment algorithm first uses K-mer to find out the reads that have the same feature, then uses these reads to identify error bases. Although this algorithm improved the accuracy, the computation is very complex and the memory consumption is very big. Thus the two algorithms can not handle the task that has mass data.This thesis proposes a parallel error correction algorithm to increase the speed and efficiency of error correction and to reduce the memory usage by using HDFS (Hadoop Distributed File System)、Map/Reduce (Google Map/Reduce open source implementations) parallel programming model and new modification rules of bases. The main procedures are as follows:(1) Our method improves the original operation procedure and overall structure based on Hadoop Map/Reduce parallel programming model. A linked list is proposed to fit for parallel error correction and use this linked list to store related K-mer information and reads information. Futher, data pre-processing is carried out with Map/Reduce parallel programming model by changing the storage format of short fragment sequence and filtering the useless information within it. Finally the parallel enumeration of all K-mer is outputted for latter reads comparison.(2) It performs sequence comparison between K-mer and reads in terms of Map/Reduce parallel programming model. It aims to get all the reads sequence that include the same K-mer feature and store all the sequence comparison results in the aforementioned linked list. Also, the modification rules of bases is improved and a comparatively perfect rule is designed to calculate average case quality score, which is applied by parallel algorithm. This new rule is also applied to change the wrong base to improve the accuracy rate of the final result.(3) Results from a comparison and analysis of the run time, memory usage and error correction accuracy between parallel algorithm and serial algorithm has shown that parallel error correction algorithm is feasible and effective.

Keywords/Search Tags:

Error correction, reads, cloudy platform, sequence assembly, parallel computation

PDF Full Text Request

Related items

1	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
2	Research On The Construction And Sequence Splicing Parallel Optimization Method Of The Second And Third Generation Genome Hybrid Assembly Process
3	Sequence Assembly Algorithms For Next-generation Sequencing Technology Research
4	Second-generation Sequencing Technology Based Short Reads Assembly System
5	Improving quality of high-throughput sequencing reads
6	Parallel and Cloud Computing Based Genome Assembly using Bi-directed String Graphs
7	Genome Assembly Guided By Reads
8	Quantum Logic Gate Sequence And Quantum Error Correction With Continuous Variables
9	A Study And Implementation Of High Throughput Algorithm For Long Read Error Correction
10	Research And Implementation Of Sequence Assembly Parallel Programming On Bi-directed De Bruijn Graph