Indel Of Structural Variation Detection Algorithm Based On Split-read Method

Posted on:2017-02-16

Degree:Master

Type:Thesis

Country:China

Candidate:J Pan

Full Text:PDF

GTID:2180330503987216

Subject:Software engineering

Abstract/Summary:

PDF Full Text Request

Genome structure variation is one common variation of the genetic variations.It is a very important part in genomics research and it provides very important information for the research on human diseases and biological character.The thousand genes project is put forward to promoted the study of human for genetic variation and aims to build one of the most detailed Genetic variation information database which has a lot medicinal value.On the other hand,The development of Next-generation high-throughput DNA sequencing techniques has greatly promoted the research of structural variations(SVs) detection. Insertions and deletions is called Indel, which is the most common variations of genome structure variations.Recently the research of indel are more and more popular. The amount of human genetic variation of medium and small indel number is in second place, which is only less than the number of SNP. In addition, these indel many occurred in the key position of genome. In view of the current genetic structure variation detection methods of several areas, they mainly base on depth of coverage, pair-end mapping clusters, or sequence assembly. We find that some of them are not accurate or too sensitive. What’s more, some methods can’t recognize the specific position and sequence of structural variation. The difficulty of Current research in the field of genome structure variation depends on how to utilize the high throughput data, and be more accurate, more specifically to identify variation information in biological sequences.This paper mainly studies the split-read method to identify the variations. First of all; we analyze the high-throughput sequencing technology platforms and its data, as well as the related concepts, such as read,pair-end read, split-read and Whole genome resequencing. Then we summary the types of structural variations. Genetic mutation events are not only determined only by the type. The Size and location of them determines the different strategies should be used for identification. For this, we put forward the Optimal Split-read Matching algorithm(OSRM) which is designed by dynamic programming.OSRM is based on classic sequence alignment algorithm, combine the global alignment and local alignment. OSRM breaks an abnormal read into several reads in a least quantity. First, it sets up a score matrix of the abnormal read and the corresponding referenced sequence. Then a matrix of backtracking path is established. In the end, it use a formula designed according to the characters of structural variation to elect the optimal backtracking path matrix, so it can match the split-read and referenced sequence in an optimal arrangement, by which the ccurate position and sequence of found indel are outputted. This article also improved theclassic pattern string matching algorithm to identify variations. Due to the existence of SNP and the limitation of the sequencing technology of sequencing errors and sequencing problem, it is difficult to identify the structure variation. So this paper puts forward a mismatch string matching algorithms. In sequence pattern string matching to proofread the current character which are at back of current character if they can match up to determine whether the current mismatch of the characters can be allowed to make a single character can perform a mismatch. So it could find the gene variants which allow the existence of mismatch. And improve the robustness of the algorithm. The improved algorithm has been demonstrated through the simulation experiment of testing, it can Increase the support of variations, even it could find more variation. Finally we testing the performance of OSRM on simulation experiment and its extension algorithms, and compared with Pindel. the results show that the accuracy of OSRM algorithm on small and medium-sized indel recognition is higher, and has the ability to identify more complex mutation events.the only shortcoming of OSRM algorithm is the efficiency to search large variatio not high, which shows the high time complexity.And our true soybean data experiments verify the algorithmcould identify more small indel.

Keywords/Search Tags:

indel, OSRM, structral vatiations, split-read, dynamic programming

PDF Full Text Request

Related items

1	Research On Human Genome Indel And Structural Variants Detection And Analysis Approaches
2	Dynamic Programming Problem In Mathematics Modelling
3	Distribution Of Insertion And Deletion In Genome
4	Researches On Long Read Alignment Algorithms Oriented To The Third Generation Sequencing Technology
5	Integer programming approaches for equal-split network flow problems
6	A General And Fast Distributed System For Large-scale Dynamic Programming Applications
7	High-dimensional adaptive dynamic programming with mixed integer linear programming
8	The Design And Implementation Of Indel Franking Region Database Based On Gridsphere
9	The Research On GEP Based On The Open Read Frame
10	Researches On Optimal Control Of Nonlinear Systems Based On Approximate Dynamic Programming