Font Size: a A A

Algorithmic Study On Long Read Assembly Error Correction Based On Linked Reads Sequencing Data

Posted on:2020-03-04Degree:MasterType:Thesis
Country:ChinaCandidate:Y X FengFull Text:PDF
GTID:2370330578971052Subject:Education Technology
Abstract/Summary:PDF Full Text Request
In view of the relatively high error rate of the third-generation sequencing data itself,an error correction algorithm based on Linked Reads sequencing data of 10x Genomics sequencing platform to correct long read data of PacBio sequencing platform was proposed.Firstly,assembly algorithm Wtdbg2 was used to splice the third generation of long read sequencing data of the human genome into overlapping chains(contigs),and contigs were broken into short reads(k-mers)of the same k-base length and stored in the hash table;Then,the Linked Reads in the same Barcode were broken into short sequence k-mers with the same k value.Each k-mer was selected to go through the hash table to find the corresponding contig number and location information.In this way.Linked Reads sequences could be rapidly distributed to contigs;Using the comparison tool Bowtie 2 to convert Linked Reads sequence alignment to contigs;Finally,the hypergeometric distribution formula was used to calculate the frequency of the bases at each position,calculate the P value,and identify the wrong base or single nucleic acid polymorphism(SNP).Linked Reads from lOx Genomics company was used to verify the error correction of Human HG00733,Human NA24385 and Human CHM1 genomic data from different Human cells.The results show that the proposed algorithm can significantly improve the length of the Scaffold for gene assembly and the assembled genome has a highly accuracy.Three generations of PacBio genome sequencing data and Linked Reads of lOx Genomics were selected for our experimental data set.Linked Reads data itself has certain technical advantages.Technical principle,introducing the barcode in the long sequence fragment sequence of DNA for precise partition,and will long segment assigned to different oil particles,using GemCode platform technology for long sequence fragment sequence amplification introduced in barcode sequence and joint primers,followed by a sequence of breaking into suitable for sequencing the size of the fragment sequencing,the same barcode sequences of short sequence from the segments of the same length.The technology can be seamlessly connected with Illumina sequencers,and short sequences can be used to obtain fragments up to 100Kb in length.The Scaffold N50 method how how to combine long fragment information with Illumina assembly data is ten times longer than how to use only Illumina method.The third generation sequencing data of human is selected because the ultimate purpose of biological research is to explore the construction principle and development law of human body.The accuracy of sequencing and assembly can be further improved by correcting the errors of the third-generation sequencing data.The research of this algorithm is of great significance for structural variation prediction and disease prediction.
Keywords/Search Tags:High throughput sequencing, sequence error correction algorithm, gene assembly algorithm, long reads, Linked Reads
PDF Full Text Request
Related items