Font Size: a A A

Algorithms For Genomestructural Variation Prediction

Posted on:2020-01-24Degree:DoctorType:Dissertation
Country:ChinaCandidate:H YangFull Text:PDF
GTID:1360330575956839Subject:Computer software and theory
Abstract/Summary:PDF Full Text Request
Structural variation(SV)usually refers to the form of genome variation between single-nucleotide polymorphism and chromosome mutation.As an important part of biological genetic diversity,structural variation can not only lead to phenotypic differences between individuals,but also be closely related to the occurrence of a variety of diseases.The continuous development and pervasive application of high-throughput sequencing technology provide technical support for the prediction and research of structural variation.However,the large amount of sequencing data consisting of short reads brings difficulties and challenges to the structural variation prediction.Genome structural variation prediction based on high-throughput sequencing has become a research hotspot in the field of bioinformatics.Most animals,including humans,and more than half of all higher plants have diploid genomes.Therefore,focusing on diploid genome research and designing effective algorithms for structural variation prediction and analysis can not only improve the precision and sensitivity of prediction results,but also help to explore the internal correlation between structural variation and some diseases,and lay a foundation for the research of structural variation prediction in the polyploid genome.This doctoral dissertation focuses on the pairwise sequence alignment problem and the prediction problem of different types of structural variation.Then an improved pairwise sequence alignment algorithm and corresponding algorithms for genome structural variation prediction have been proposed in order to improve the precision and sensitivity of prediction results.The main contents and innovation points of this doctoral dissertation are as follows:1.The backtracking process of the existing pairwise sequence alignment algorithms are carried out strictly along the direction of the optimal solution,which is easy to cause premature base matching in the alignment result and is not conducive to the discovery of longer gap fragments,resulting in the deviation between the alignment result and the actual situation of InDel variation.In addition,the relatively fixed gap penalty score is not conducive to the increase of gap fragments and the reduction of base mismatch in the alignment results.In this doctoral dissertation,Needleman-Wunsch algorithm is optimized and improved from three aspects including the dynamic adjustment strategy of gap penalty,the reverse derive strategy for finding optimal solution and calculation method of cells in the scoring matrix.And then an improved global pairwise sequence alignment algorithm(DNA-NW)is proposed.Because the new reverse derive strategy of the algorithm no longer strictly follows the source direction of the optimal solution,the term "backtracking" is no longer used,and it is called the reverse derive strategy.The DNA-NW algorithm is composed of the pre-processing stage and the alignment execution stage.The pre-processing stage is implemented by the dynamic adjustment strategy of gap penalty(DGPS-LD)based on the Levenshtein distance,and the alignment execution stage is implemented by the improved Needleman-Wunsch algorithm(INW).In particular,the improved Needleman-Wunsch algorithm(INW)not only has higher execution efficiency than the original Needleman-Wunsch algorithm,but also the newly proposed reverse derive strategy can find longer gap fragments and reduce the number of mismatches under the premise of keeping the optimal alignment score unchanged,which reduces the possibility of false positive SNP.The DNA sequence alignment results of DNA-NW are well close to the actual situation of InDel variation so that the DNA-NW algorithm is more suitable for the InDel prediction than the original Needleman-Wunsch algorithm.2.The current status of InDel and its prediction methods are reviewed.The quality control and preprocessing methods of high throughput sequencing raw data are also introduced.The prediction problem of InDel with less than 50 bp is studied so that a comprehensive prediction and analysis method for InDel based on split read is proposed,which is named SRInDel.SRInDel algorithm firstly defines the target region for alignment of split read on the reference genome.Then the length of the target region for alignment on the reference genome is modified by the correction algorithm of alignment target region based on k-mer short sequences.So that it is convenient for the pairwise sequence alignment to be solved by the improved Needleman-Wunsch algorithm(INW).The InDel variation type,length and breakpoints coordinates can be predicted effectively according to the results of the pairwise sequence alignment.Aiming at the possible sequencing errors in homopolymer sequences,the SRInDel algorithm can correct the prediction results for insertion and deletion,design the prediction method of InDel in the coding sequence and frameshift mutation,and propose the discriminated method for the homozygosity and heterozygosity of InDel.In addition,a prediction method for short tandem repeats(STR)based on k-mer is proposed,which is named kmer-STR.Compared with SSRIT,the kmer-STR algorithm significantly improves the efficiency of algorithm execution on the premise of ensuring the correctness of results,and can be applied to the prediction of STR in large-scale genome sequences.3.The main types of structural variation and their prediction methods are introduced.Aiming at the prediction of structural variation with more than 50 bp,this doctoral dissertation focuses on the structural variation characteristics of insertion,deletion,inversion,intrachromosomal translocation and interchromosomal translocation.And then an algorithm for genome structural varitation prediction based on the discordant paired-end reads and split reads is proposed,which is named SVDS.SVDS can predict insertion,deletion,inversion,intrachromosomal translocation and interchromosomal translocation.One of the remarkable features of SVDS is that multiple possible alignment results of each paired-end read are retained during the sequence alignment,which can increase the sensitivity of structural variation prediction.Meanwhile,the probability of each candidate structural variation is calculated,and the set coverage problem is used to filter the false positive structural variation in the candidate results,so that the algorithm is greatly improved in both sensitivity and precision.4.The prediction problem of copy number variation with more than 1kb,as well as the hidden Markov model have been studied.A CNV prediction algorithm named CNV-HMM was proposed.In order to improve the precision of the algorithm results,the problems of statistical and probabilistic modeling of read depth signal,GC content bias of sequence and its correction,mappability and its influence on reads depth are studied.Subsequently,the corresponding solutions are also proposed.In order to further improve the sensitivity and precision of CNV prediction results,the CNV-HMM algorithm also uses a result optimization method based on split read,which can not only filter partial false positive copy number variation,but also combine the same varnation to obtain longer CNV in the prediction results.
Keywords/Search Tags:High-throughput Sequencing, Pairwise Sequence Alignment, Indel, Structural Variation, Copy Number Variation
PDF Full Text Request
Related items