Font Size: a A A

Research On Sequence Alignment Methods For The Third-generation Sequencing Data

Posted on:2021-04-21Degree:DoctorType:Dissertation
Country:ChinaCandidate:Y GaoFull Text:PDF
GTID:1480306569483424Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
With the contiguous development of third-generation sequencing technology,thirdgeneration sequencing data has been widely used in the fields of genome assembly,structural variation detection,and full-length transcript identification.As the most basic and critical step in all third-generation sequencing data workflows,sequence alignment has always been one of the most important computer science problems in the current field of bioinformatics.However,the existing third-generation sequence alignment methods have failed to meet the needs brought by the growing mass of sequencing data in terms of alignment speed,accuracy and sensitivity,which has become a big hurdle in the field of genomic science.This article comprehensively summarizes the basic ideas and main strategies of existing alignment methods and tools.It aims at comprehensively improve the speed,accuracy and sensitivity of the third-generation sequencing data sequence alignment.Based on the characteristics of data such as long sequence length,high error rate,and containing large structural variations,we have specifically developed multiple third-generation sequencing data sequence alignment methods to effectively solve multiple computing bottlenecks in the existing workflows.The main research contents of the paper are as follows:(1)Aiming at the problem that the existing alignment tools cannot effectively deal with the structural variation in the third-generation sequence,this article studies a long approximate match and skeleton trimming-based split alignment method,LAMSA.This method uses the approximate match of long seed to effectively sovle the problem that traditional short seed strategies are difficult to deal with the repeated regions of the genome.And through the way of tree pruning,it can generate an alignment skeleton that can reflect various variation events,and realize accurate split alignment near the structural variation breakpoint.This method can quickly and accurately align the third-generation sequencing data to the reference genome and has a good identification ability for structural variation breakpoints in the sequence.It can provide accurate sequence alignment results for downstream genome structural variation-related analysis task.(2)Aiming at the problem that the existing graph reference genome alignment tools cannot effectively process the third-generation sequencing data,this article studies a local haplotype index-based graph reference genome alignment method,Hi Pan.This method incorporats the existing graph reference genome construction idea,by designing a local haplotype path index based on the population haplotype information,it achieves the efficient query of the sequence within and between nodes in the graph reference genome and then realizes sequence alignment on the graph reference genome.This method can efficiently construct the graph reference genome and its index,and can quickly and accurately align the third-generation sequencing data to the graph reference genome,It can provide accurate alignment information of sequencing reads on the graph reference genome for downstream variation detection and other related work.(3)Aiming at the huge time-consuming problem of local multi sequence alignment of the existing third-generation sequencing data,this article studies a single instruction multiple data(SIMD)-base parallel banded partial order alignment method,ab POA.This method accomplishes the multi sequence alignment task through partial order alignment.It borrows the banded alignment strategy that is widely used in pairwise sequence alignment tools,and extends it to partial order alignment between a sequence and a graph.It designs a SIMD-based parallel algorithm to achieve a further speed improvement of the dynamic programming process.This method can significantly reduce the running time of the partial order alignment process and provide accurate partial order alignment results.It can provide speed and accuracy support for the multiple sequence alignment-base precise reconstruction of local genome sequence.(4)Aiming at the problem that the existing analysis methods cannot effectively deal with the new tandemly repeated third-generation sequencing data,this article studies a sequence self-matching-based tandem repeat alignment method Tide Hunter.Based on the new data characteristics of containing multiple tandem copies of the original template sequence,this method borrows “seed and extension” strategy from the traditional sequence alignment methods,and extends it to the new tandem repeat alignment problem to accomplish the quick detection of tandem repeat units.This method achieves a significant speed improvement and higher sensitivity for tandem repeat detection from new tandemly repeated data.It can efficiently detect the repeat units in the tandem repeat sequence and accurately reconstruct the original template sequence.It can provide high-quality sequencing reads with lower error rates for the conventional third-generation sequencing data alignment workflow.This article focuses on the research topic of third-generation sequencing data sequence alignment,and studies the multiple key and difficult problems in sequence alignment from different aspects.By developing multiple third-generation sequencing data alignment methods,this study achieves a comprehensive improvement of existing tools in terms of running speed,alignment accuracy and sensitivity.The first three methods form an solution for regular third-generation sequence alignment,the fourth method complements this solution by addressing issues of the new sequencing data.This study effectively solves the problems of multiple computing bottlenecks,such as split alignment,graph reference genome alignment and local multiple sequence alignment,in the existing sequence alignment workflow and promote the development of related fields of third-generation sequencing data analysis It provides basic technical support for future large-scale genomic frontier scientific research,and has very high practical value and theoretical significance.
Keywords/Search Tags:Third-generation sequencing data, Sequence alignment, Genome variation, Graph genome, Multiple sequence alignment
PDF Full Text Request
Related items