Font Size: a A A

Alignment And Variant Calling For Third Generation Sequencing Data Based On Alignment Skeleton

Posted on:2023-01-18Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y D LiuFull Text:PDF
GTID:1520306839478474Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
In recent years,with the rapid development of high-throughput sequencing technology and the implementation of international large-scale human genome project,EB to ZB magnitude of high-quality genomic data have been continuously generated and urgently need to be analyzed.These data are the basis for the development of strategic fields such as life sciences,population health,and biosecurity in the new era,and contain huge scientific,social and economic value.Sequence alignment and variation calling are the core technical aspects of genome data analysis which brings great significance to gene expression analysis,alternative splicing,analysis of heredity and variation information in the genome,discovery of association analysis with diseases and phenotypes,and revealing the molecular mechanism of diseases.However,the state-of-the-art variation calling methods based on sequence alignment still have drawbacks on detection efficiency,accuracy and sensitivity,and cannot satisfy the demands for the frontier research on genomes,which severely restricts the analysis and application of genome data.This thesis summarizes the basic methods and procedures of genome data analysis.Aiming at the problems of low efficiency and poor accuracy on variant calling in the existing methods based on sequenced read base-level alignment,this article proposes a new perspective of alignment skeleton based on genome graph representation model,focusing on solving the key issues such as rapid read mapping,alignment consistency,accurate variation calling,breaking the bottleneck of sequence alignment for existing algorithms,and improving the accuracy of variation calling.The four main research components of this thesis are as follows:(1)Aiming at seed duplication operations caused by a large number of repetitive sequences in the genome,and caused low alignment efficiency,proposing a genome graph representation model with repetitive sequences as the core unit,breaking the limitations of the linear model of the genome,in which individual bases are the unit and the natural arrangement between bases is the core.Analyzing the performance of the genome graph representation model compared to the genome linear representation model in sequence alignment from "seeding","chaining" and "extension".Based on the hash index of the genome graph representation model,this thesis proposes a method for constructing a alignment skeleton for sequencing reads,replacing the base-level alignment in the traditional sequence mapping,which significantly reduces the alignment time consuming without affecting the accuracy.(2)In view of the current situation that large-scale genome structural variation is still difficult to be detected efficiently,accurately and sensitively,this thesis researches on the structural variation calling method based on alignment skeletons and multi-feature fusion.This method abandons the traditional structural variation calling pipelines using baselevel alignment,based on the genome graph representation model,using sparse dynamic programming and local greedy algorithm to construct a variation sensitive read alignment skeleton.Identify the structural variation signals through non-co-linear units in the alignment skeleton.Finally,a method based on multi-feature fusion is used for variation detection and genotyping.The experimental results on the international authoritative datasets show that this method has higher accuracy and sensitivity of structural variation detection than mainstream algorithms at this stage,and at the same time,it achieves more than ten times the acceleration effect.These pieces of evidence prove its adaption for cutting-edge research in large-scale genomic science(3)In view of the large time cost and low sensitivity from raw sequencing data to SNP/Indel variation detection in the current large-scale genome project,this thesis researches the SNP/INDEL detection method of multiple strategies based on non-co-linear alignment skeletons.The method constructs sequencing read alignment skeleton based on the genome graph representation model,and partially fills the gaps of the skeletons,which significantly improves the efficiency of the mapping of original sequencing data.Through the alignment skeleton cluster pile-up,it identifies the candidate variations and divides them into three categories according to the sequence and distribution of the candidate variation sites.For different categories,this method applies the binomial distribution probability model,multiple sequence alignment and local assembly are used for variation detection and genotyping,respectively.This method can effectively improve the sensitivity of variant detection,while significantly reducing the time consumption of sequence alignment and variant detection.(4)Aiming at the low speed of transcriptomic sequencing alignment and low sensitivity and accuracy of exon detection caused by the erroneous long sequencing reads,this thesis studies a method of long transcriptomic read alignment based on the construction and integration of alignment skeleton.This method uses a 2-pass alignment strategy and uses the alignment information of all sequencing reads comprehensively to solve the problems of single reads exon detection missing and high heterozygosity at alternative splicing sites caused by high sequencing error rate.In the first round of alignment,the sequencing alignment skeleton is constructed and mapped to the reference genome to integrate and identify the exon regions.In the second round of alignment,the method utilizes the local hash index to restore the integrated exons into the original alignment skeleton of a single sequencing read to construct a local splicing reference sequence and complete the sequence alignment.This method can maximize the reduction of the exons present in all sequenced reads while performing high-efficiency alignment,and significantly improve the accuracy of the alignment and the sensitivity of exon recognition,and breakthrough bottlenecks such as short exon identification and complex splicing site processing under high sequencing noise.This thesis focuses on the key issues of sequence alignment and variation calling in genome data analysis and proposes an innovative genomic variation calling algorithm system based on alignment skeleton,which changes the core technical route based on sequencing read base-level alignment in the past ten years.It aims at achieving the systematic innovation and quality improvement of large-scale genomic variation calling,and provides new ideas for the research of the key algorithms for genomic data analysis.
Keywords/Search Tags:third generation sequencing technologies, genome sequence graph representation model, sequence alignment, alignment skeleton, consensus, genomic variation detection
PDF Full Text Request
Related items