A Study And Implementation Of High Throughput Algorithm For Long Read Error Correction

Posted on:2019-04-18

Degree:Master

Type:Thesis

Country:China

Candidate:L X Lan

Full Text:PDF

GTID:2310330542991591

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

The third generation PacBio SMRT long reads can effectively address the read length issue of the second generation sequencing technology but contain approximately 15%sequencing errors.Several error correction algorithms have been designed to efficiently reduce the error rate to 1%,but they discard large amounts of uncorrected bases and thus lead to low throughput.This loss of bases could limit the completeness of downstream assemblies and the accuracy of analysis.The low throughput discussed above is because of the following two problems:error richness problem and lack of reference data problem.In order to address these two problems,here,we introduce HALC,a high throughput algorithm for long read error correction.HALC uses two novel approaches,which are similar repeat based alignment approach and long read support based validation approach.According to similar repeat based alignment approach,HALC aligns the long reads to short read contigs from the same species with a relatively low identity requirement so that a long read region can be aligned to at least one contig region,including its true genome region's repeats in the contigs sufficiently similar to it.According to long read support based validation approach,HALC then constructs a contig graph and,for each long read,references the other long reads' alignments and the adjacent relationship of contig regions to find the most accurate alignment and correct it with the aligned contig regions.Even though some long read regions without the true genome regions in the contigs are corrected with their repeats,this approach makes it possible to further refine these long read regions with the initial insufficient short reads and correct the unconected regions in between.In our performance tests on E.coli,A.thaliana and Maylandia zebra data sets,HALC was able to obtain 6.7-41.1%higher throughput than the existing algorithms while maintaining comparable accuracy.The HALC corrected long reads can thus result in 11.4-60.7%longer assembled contigs than the existing algorithms.

Keywords/Search Tags:

PacBio long reads, Error correction, Throughput

PDF Full Text Request

Related items

1	Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data
2	High-Throughput Long Paired-End Sequencing Of A Fosmid Library By PacBio
3	Improving quality of high-throughput sequencing reads
4	Chloroplast Genomics And Transcriptomics In Duckweeds By PacBio Long-reads
5	Cloud Computation-Based Error Correction For Transcriptome Assembly
6	A Study Of Reference Assisted Misassembly Detection Algorithm Using Short And Long Reads
7	A Long Read Hybrid Error Correction Algorithm Based On Segmented PHMM
8	The Flow Of Long Amplicons Technology Improvement And Software Development For Third Generation Of Mixed Sequencing
9	Test And Comparation Of Softwares Suitable For RNA-seq Reads Mapping Via Simulated And Real Reads
10	Research On High-precision Deformation Inversion Method Of Ground-based SAR Based On Long-term Sequence Observation Error Correction