Font Size: a A A

Researches Of Short Sequence Alignment And Scaffold Algorithm Based On Next Generation Sequencing

Posted on:2019-06-21Degree:MasterType:Thesis
Country:ChinaCandidate:G ZhengFull Text:PDF
GTID:2370330545489871Subject:Radio Physics
Abstract/Summary:PDF Full Text Request
Bioinformatics is a new discipline that analyzes and processes relevant data in the biomedical field by means of comprehensive electronic informatics,statistics,and computer science.With the rapid development of gene sequencing technology,sequence analysis and processing of sequencing data is an important field of bioinformatics.Because the existing short-sequence matching algorithm based on next-generation sequencing performs poorly in processing repeated sequences,Existing scaffolding algorithms are prone to errors when directly assembling repetitive and non-repetitive sequences,which raises new requirements for short sequence matching algorithms and gene assembly scaffolding algorithms.According to the characteristics of the next-generation sequencing data,it has become very urgent to develop short sequence matching algorithms and scaffolding algorithms that can meet the needs of actual scientific research.First of all,the thesis introduces the short-sequence matching algorithm represented by the existing algorithms such as dynamic programming,lattice,Bowtie,and BWA,and the bracket algorithm represented by algorithms such as SOAPdenovo,Bambus2,Opera,and Velvet,detailed description of their basic principles and specific steps.At the same time,the performance of the current mainstream short-sequence matching algorithms and scaffolding algorithms are compared,and the performance of the duplicated segments is analyzed.The analysis results show that the existing algorithms have much room for improvement in the processing of repeatability.Secondly,the thesis proposes a new repetitive sequence matching algorithm based on constructing Hash index and sliding matching point,named HashRepAligner.The new algorithm is a complete and accurate repetitive short sequence matching algorithm.It is divided into four steps: constructing Hash index for short sequences,sliding matching points,determining coverage depth and boundary detection.The experimental results show that the HashRepAligner algorithm can more completely align the repeated sequences,and can accurately calculate the number of copies of each repeated sequence.At the same time,the algorithm can accurately find the start and end positions of the repeated sequences.Finally,based on the SWA algorithm combined with the next-generation sequencing,the thesis proposes a genomic scaffolding algorithm with extended repeats and non-repetitive sequences and named it as HashRepScaffold.HashRepScaffold can independently assemble repetitive and non-repetitive areas.The algorithm first performs data preprocessing and builds a Hash index,then calculate the mapping relationship between the left and right ends of each Contig and the original sequencing right-end sequence(br)and the original sequencing left-end sequence(bl)to determine the number of br and bl on the left and right mapping of each Contig,using these mappings and paired data connections to repetitive and non-repetitive Contigs to obtain the scaffold.The experimental results show that HashRepScaffold is suitable for the sequence repeat fragments larger than 240 bp,and the lower coverage depth can completely and accurately assemble Scaffolds.
Keywords/Search Tags:Next-generation sequencing, Short Sequence Matching Algorithm, Hash index, Scaffold algorithms, Repeat sequence
PDF Full Text Request
Related items