Font Size: a A A

Research On Haploid Genome Scaffolding Methods

Posted on:2021-05-29Degree:DoctorType:Dissertation
Country:ChinaCandidate:D F GuanFull Text:PDF
GTID:1360330614950827Subject:Computer application technology
Abstract/Summary:PDF Full Text Request
The emergence and rapid development of long read sequencing technologies have provided a solid foundation for the high-quality implementation of large-scale species sequencing programs.Constructing haploid contigs from long sequencing data and combining multiple sequencing data to scaffold haploid genomes has become a hot issue in computer science and bioinformatics research.However,current haploid genome scaffolding methods have three major drawbacks in generating low-redundancy or nonredundant haploid contigs,generating high-continuity haploid genome sequences,and identifying mis-assemblies in haploid genome assemblies.The problems greatly limit the efficient construction of high-quality genomes.The haploid genome scaffolding belongs to the upstream genomic research,and the quality of its generated genomes have a direct impact on downstream genome analysis,especially in genomic variation detection,genome annotation,gene regulatory element analysis,evolution analysis,etc.This research focuses on the computational methods to generate high-quality haploid genomes efficiently.The major research contents are listed as follows:(1)To speed up single molecule read alignment,it presents a method,r HAT,which is based on regional hash index and sparse dynamic programming.A regional hash index is built for the reference sequence,and a partial seeding strategy is applied to accelerate alignment candidate selection for reads with high sequencing error rate,alignment algorithm based on sparse dynamic programming is used to further speed up the read alignment process.This efficient alignment tool can support the haplotypic duplication purging and mis-assembly identification studies,meanwhile it can be applied to other genomic research fields such as variation detection.(2)To handle haplotypic duplications in primary contigs,it proposes a method,purge?dups,which is based on read depth from single molecule reads and sequence similarities.Firstly,read depth threshold for heterozygous sequences is automatically determined through a smoothed read depth distribution.Secondly,haplotigs are removed by combining both read depth distribution and the inclusion relation between the contigs.Thirdly,a dynamic programming is applied to calculate collinear matches between contigs,the matches of heterozygous sequences are identified and purged by checking average read depth.This method can reduce redundancy of primary contigs and supply lowheterozygous primary contigs to the next scaffolding phase of the studies,which can be useful to improve the continuity of the scaffolds.Purge?dups has been integrated into VGP assembly pipeline,and has completed preprocessing of about 60 vertebrate genomes.(3)To overcome low continuity scaffolding results for the Hi-C scaffolders,it presents a method pin?hic,which is based on sequence partition and N-best-partner strategy.First,A linkage matrix is built for the partitioned sequences to reduce misjoin errors,then the N-best-partner strategy is utilized to build more links,which can improve the continuity of the scaffolds,finally,misjoined contigs are broken based on read depth distribution,which increases accuracy of the scaffolds.This method can generate highly continuous scaffolds.(4)To reliably identify mis-assemblies in a haploid genome,it proposes a method,asset,that combines sequence features from multiple types of sequencing data.Contig errors are identified by read depth distribution from single molecule reads and alignments between consensus map of optical mapping data and the input genome sequence,and link errors are recognized by the consensus map alignments,DNA molecular depth of linked reads and Hi-C linkage matrix.Contig errors and link errors are combined to produce a final set of potential mis-assemblies.This method able to be used for genome curation can improve the correctness of the genome sequence,while it can also be applied for genome quality assessment.Asset has been applied in VGP,and has helped create several high-quality genomes.This research aims to achieve the construction of haploid genomes with high continuity and correctness.By combining multiple types of sequencing data,a series of targeted and practical algorithms are designed to solve the bottleneck problems in genome construction at this stage.This work will also provide a new research perspective and analytical ideas for the study of the haploid genome construction algorithms.
Keywords/Search Tags:Long reads alignment, heterozygous sequences purging, haploid genome scaffolding, mis-assembly detection
PDF Full Text Request
Related items