Font Size: a A A

Cloud Computation-Based Error Correction For Transcriptome Assembly

Posted on:2016-12-02Degree:MasterType:Thesis
Country:ChinaCandidate:H Z PangFull Text:PDF
GTID:2180330464470715Subject:Software engineering
Abstract/Summary:PDF Full Text Request
Gene sequencing can help us understand genetic information of biology. This assists us in identifying disease genes of organisms and finding the right way for disease treatment. Due to restriction of experimental technologies, DNAs were usually split into some fragments, and sequence assembly is then applied to splice. However these short fragments might have loss, insertion or deletion of bases. Thus, the correction of these error or noise embedded with bases is an important issue.The main idea of serial algorithm is if two reads share subfragments, they might come from the same genome. By this way, we can find out the most possible error bases and correct them, such as K-mer listing algorithm that uses the K-mer listing diagram to find wrong base. Although it is relatively fast, its accuracy is low. K-mer enumeration alignment algorithm first uses K-mer to find out the reads that have the same feature, then uses these reads to identify error bases. Although this algorithm improved the accuracy, the computation is very complex and the memory consumption is very big. Thus the two algorithms can not handle the task that has mass data.This thesis proposes a parallel error correction algorithm to increase the speed and efficiency of error correction and to reduce the memory usage by using HDFS (Hadoop Distributed File System)、Map/Reduce (Google Map/Reduce open source implementations) parallel programming model and new modification rules of bases. The main procedures are as follows:(1) Our method improves the original operation procedure and overall structure based on Hadoop Map/Reduce parallel programming model. A linked list is proposed to fit for parallel error correction and use this linked list to store related K-mer information and reads information. Futher, data pre-processing is carried out with Map/Reduce parallel programming model by changing the storage format of short fragment sequence and filtering the useless information within it. Finally the parallel enumeration of all K-mer is outputted for latter reads comparison.(2) It performs sequence comparison between K-mer and reads in terms of Map/Reduce parallel programming model. It aims to get all the reads sequence that include the same K-mer feature and store all the sequence comparison results in the aforementioned linked list. Also, the modification rules of bases is improved and a comparatively perfect rule is designed to calculate average case quality score, which is applied by parallel algorithm. This new rule is also applied to change the wrong base to improve the accuracy rate of the final result.(3) Results from a comparison and analysis of the run time, memory usage and error correction accuracy between parallel algorithm and serial algorithm has shown that parallel error correction algorithm is feasible and effective.
Keywords/Search Tags:Error correction, reads, cloudy platform, sequence assembly, parallel computation
PDF Full Text Request
Related items