Font Size: a A A

Research On Imputation Strategy Of Low Coverage Genomic Sequencing Data

Posted on:2021-02-12Degree:MasterType:Thesis
Country:ChinaCandidate:T Y DengFull Text:PDF
GTID:2370330602994866Subject:Agriculture
Abstract/Summary:PDF Full Text Request
Genotype imputation technology refers to the statistical inference of unobserved genotypes,so as to obtain more genomic information.The basic principle of imputation is to construct common haplotypes segment based on the linkage disequilibrium and recombination information between the reference populations or other individuals within the population and the target population,then the unobserved genotypes in the target population were estimated and imputation with those haplotypes information.As an important tool for genomic data processing,the quality of genotype imputation directly affects the subsequent analysis,in order to obtain good imputation results,it is necessary to formulate a perfect imputaion strategy.In this study,the genome data of 20,000 individuals were simulated with a 10Mb chromosome from four populations with increasing genetic distance.The simulated data were divided into target population and reference population,and the locus of MAF<0.01 was deleted.The target population has a fixed 1,000individuals in population 1,while the reference population sizes have five levels including 100,1,000,3,000,5,000 and 10,000 individuals.The genotypes of the target population were randomly deleted according to chip data or whole genome sequencing data,and 1,5,10,30,50 or 90%genotypes of the original data were retained respectively,that is,the proportion of target population site in the reference population site or SNP coverage.Beagle5.1 and Minimac4 were used for imputation.The impoutation reliability,imputation rate and imputation time were calculated after imputation.The effects of different imputation methods,the proportion of target site?SNP coverage?,reference population size,genetic distance between reference population and target population individuals and data type on the imputation quality were compared.The results show that 1)Target site ratio or SNP coverage has a very significant(P<10-4)impact on imputation reliability and error rate under all imputation situations,and is the most important factor affecting the imputation quality.When Beagle5.1 was used to impute the sequencing data and the reference population was 100,the SNP coverage increased from 1%to 90%,and the imputation reliability and error rate changed from 0.21 to 0.99 and 19%to 0.16%respectively.In addition,the reference population size and the genetic distance between the target population and the reference population also have important effects on the imputation quality in some cases.2)The imputation quality of Beagle5.1 is better than Minimac4 in most cases,but when the level of each factor is very low,the imputation quality of Beagle5.1 is more easily affected.Compared with Minimac4,Beagle5.1 can achieve an excellent or ideal state of the imputed data at a low factor level,which is more obvious when the sequencing data is used to imputed.Meanwhile,the speed of Beagle5.1 within the scope of this study was lower than that of Minimac4 at the same level.3)Except in the case of extremely low target site ratio or SNP coverage,the imputation quality based on sequencing data is usually better than that based on chip data.When imputing with Beagle5.1 and the SNP coverage of sequencing data more than 5%,the imputation quality used chip data can be exceeded,and this condition only requires that the sequencing depth of the target population reaches about 1-2×.When the sequencing depth is about 4×,the SNP coverage reaches 30%,and the imputation quality can reaches a perfect state of reliability greater than 0.99 and error rate less than 1%,indicating that sequencing data has a strong advantage in genotype imputation,and high-quality imputation can be obtained under low coverage.Based on the above results,we formulated different strategies for different imputation purposes and provided reference for the application of genotype imputation.
Keywords/Search Tags:Genotype imputation, Imputation strategy, Whole-genome sequencing, Imputation reliability, Imputation error rate
PDF Full Text Request
Related items