Font Size: a A A

Haplotype Assembly Of One Human Genome Based On Multi-platform Sequencing Technologies

Posted on:2024-03-26Degree:MasterType:Thesis
Country:ChinaCandidate:C Y ZhangFull Text:PDF
GTID:2530307160978119Subject:Engineering
Abstract/Summary:PDF Full Text Request
A haplotype is an assembly of a series of genetically variable loci in an individual that is inherited exclusively from one parent.Haplotype genome assembly aims to obtain DNA sequences from two sets of homologous chromosomes to compensate for the neglected allelic sequence variations in conventional genome assembly.Haplotype sequences play a key role in studies of allelic differential,linkage and association,population genetics,and clinical genetics.After decades of development of haplotype assembly technology for the human genome,there are still many challenges,and there is an urgent need to develop efficient and automated haplotype sequence assembly methods to obtain more high-quality,chromosome-level haplotype assembled sequences.In this study,the human genome HG001 sample is used as the research object.Using the specific sequence information of the parental genomes,the whole-genome long fragment sequencing Pac Bio and Nanopore data of the target individual HG001 were sorted into two sets with one from the parental haplotype and the other from the maternal haplotype,and followed by de novo assembly independently.Finally,two sets of accurate and nearly complete chromosome-level haplotype sequences were obtained by integrating the chromatin conformation capture sequencing data.The two haplotype assemblies feature a NG50 of 49.4 Mb and 60.45 Mb,total lengths of 3.02 Gb and 3.03 Gb,respectively,nucleotide accuracy around 99.999%,and a switch error rate of less than 0.4%.Multiple assembly quality assessment analyses showed that the assembly quality(continuity,accuracy,and completeness)of this genome was superior to the published HG001 haplotype assembly.Compared with previously published assemblies,a total of 80.5 Mb of new sequences were identified,of which about 53.3 Mb were derived from repetitive regions and about 4 Mb from gene regions containing 4,528 potential protein-coding genes.Meanwhile,the genomic sequence contains complex regions such as the complete MHC.In addition,this study performed a comprehensive analysis of sequence variation and allelic differences between the two haplotypes.This study found very high heterozygosity between haplotypes,with 2,482,886 SNPs,694,306 Indels,and 338,305 structural variants.Most of the variants locate in intergenic regions and are distal genic regions.Genes with allelic sequence conservation are mainly involved in life processes such as defense responses,while those with allelic variants are mainly involved in life processes such as nervous system development and cell signaling.In summary,this study not only develops a simple and automated method for haplotype-resolved genome assembly,but also provides a high chromosomal-level,high-quality haplotype genome sequence of HG001,an important sample of the human genome.This study will provide an important basis for the study of genetic variations,pan-genomes and genetic diseases of human genomes.
Keywords/Search Tags:Human genome, haplotype, genome assembly, multi-omics data
PDF Full Text Request
Related items