Font Size: a A A

Multi-locus Genome-wide Association Study Method Based On Singular Value Decomposition And SCAD Estimation

Posted on:2019-03-03Degree:MasterType:Thesis
Country:ChinaCandidate:Y W DuFull Text:PDF
GTID:2370330545491170Subject:Crop Genetics and Breeding
Abstract/Summary:PDF Full Text Request
Most important traits in animals and plants are quantitative traits,controlled by a few large-effect genes and a series of polygenes with minor effects,and modified by environmental effect.To better utilize and improve these traits in plant and animal breeding,we require a deep understanding of the genetic basis of these traits.At present,genomewide association analysis is the main way to dissect the genetic basis of these traits.With the rapid development of bio-sequencing technology,the ultra-high dimensional markers with small sample size has become the norm.This undoubtedly aggravated the computational pressure of genome-wide association analysis.How to quickly and accurately select loci that are significantly associated with quantitative traits from massive markers in a limited sample becomes a major challenge.At present the widely used association study methods are single-locus genome-wide scan based on population structure and polygenic background controls.These methods cannot estimate the genetic effects of all markers simultaneously.So the estimates are often biased.In order to solve this issue,in this study we integrated singular value decomposition,SCAD,empirical Bayes estimation,multi-site genetic model and likelihood ratio test in order to propose a new multi-site genome-wide association study method.The S3-EB was validated by three Monte Carlo simulation experiments and the analyses of four flowering time-related traits in Arabidopsis.The major results are as follows:1.This new method is divided into two steps: 1)The selection of potentially associated markers.With singular value decomposition,all the marker effects were estimated.Among all these effects,the markers with larger effects were more possible to relate to the trait.The selected markers were further identified by SCAD compression estimation as potential associated markers;2)Identification of significant QTN(quantitative trait nucleotide).The potentially associated markers were put into a multilocus genetic model,all the effects in this model were estimated by empirical Bayes,and all the non-zero effects were further identified by likelihood ratio test for true QTNs.This method is called a multi-locus genome-wide association study based on both singular value decomposition and SCAD estimation(S3-EB).2.Three Monte Carlo computer simulation experiments were used to validate the effectiveness of S3-EB.In the first simulation experiment,10,000 SNPs were randomly selected from 216,130 SNPs in actual association mapping population of 199 Arabidopsis lines as the genotypes of the simulated association mapping population.Six simulated QTNs were set on six SNPs with a rare allelic frequency of 0.3,and the heritabilities were set at 0.1,0.05,0.05,0.15,0.05,and 0.05,respectively.The population mean and error variance are both set at 10.The simulated phenotypic observations of 199 lines were obtained by the genotypic values of six QTNs and random errors.The repeate was 1000 times.Each simulation dataset was analyzed by S3-EB,mrMLM,EMMA and FarmCPU.The results showed that: 1)the average powers of the six simulated QTNs detected by the above four methods were 74.8,67.03,46.0,and 41.87(%),respectively.Paired t-test showed that the statistical power of S3-EB was significantly higher than the other three methods(P-values: 0.0036 ~ 0.0063);2)average mean squared error(MSE)of six simulated QTN effects were 0.1064,0.0934,0.5432 and 0.2824,respectively.Paired t-test showed that the average MSE of S3-EB was significantly lower than that of EMMA(Pvalue: 0.015),but there were no significant differences with mrMLM and FarmCPU(Pvalues: 0.3199 and 0.1549,respectively);3)the calculation times of the above four methods were 0.79,4.01,68.77 and 5.12(hours),respectively;4)the false positive rates of the above four methods were 0.0489,0.0167,0.0325 and 0.0178(%).To study the effect of these background disturbances on the QTN detection efficacy and parameter estimation accuracy of S3-EB,the polygenic background and epistatic background were added in the first simulation experiment.The results show that these results are consistent with the results of the first simulation experiment.As shown above,the new method S3-EB uses singular value decomposition to reduce the computational dimensions,from millions of SNP markers to thousands of sample sizes,in order to obtain quickly all SNP marker effect values under the same genetic model.This is beneficial for the selection of potential related variables.The new method improves the statistical power and accuracy of parameter estimation,shortens the calculation time,and makes the false positive rate at the same level of Bonferroni correction method,validating the effectiveness of the new method.3.The flowering time-related traits FLC,FRI,FT-GH and FT-Field in 199 Arabidopsis lines each with 216130 SNPs were analyzed using these four methods.The results showed that: 1)15,21,0 and 6 SNPs were found,respectively,by the above four methods to be associated with FLC;6,8,33,and 5 SNPs were found,respectively,by the above four methods to be associated with FRI;17,4,0,and 7 SNPs were found,respectively,by the above four methods to be associated with FT-GH;17,24,0,and 9 SNPs were found,respectively,by the above four methods to be associated with FT-Field;2)Establishment for the multiple linear regression model of quantitative trait phenotypes on significantly associated markers.The BIC values for the above four methods were 336,328.2,596.5,and 521.3,respectively,for FLC;163.5,156.7,322.3,and 211.6,respectively,for FRI;-321.2,-296.1,314.6 and-465.0,respectively,for FT-GH;and 30.4,318.9,306.9,and 156.6,respectively,for FT-Field.The BIC value of the new method is the smallest or the second smallest,indicating the relatively superior of the new method over other methods;3)Within 50 kb around the significantly associated markers,59,9,3,and 8 reported trait-related genes were found from the above four methods,respectively.With the R environment and the add-on package shiny,the S3-EB program was developed,and incoraporated into software mrMLM.This package can run on Windows,Mac and Linux systems.
Keywords/Search Tags:GWAS, Singular Value Decomposition, SCAD, Multi-locus model, Mixed linear model, Empirical Bayes, Likelihood ratio test
PDF Full Text Request
Related items