| Genome-wide association studies(GWAS)can identify functional genetic variants on a genome-wide scale and provide important research clues for subsequent biomedical mechanism studies and translational applications.Designing and improving effective biomedical big data mining algorithms has important theoretical significance and application prospects.Large-scale biobanks have some new and complex characteristics that result in difficulties and challenges such as the rapid growth of sample size,more complex data types,highly unbalanced phenotypic distribution,and population stratification.Therefore,theoretical innovation of GWAS methods is urgently needed.In GWAS,the traditional normal distribution approximation method cannot control type one error rates when the phenotypic distribution is unbalanced or the minor allele frequency(MAF)of genetic variants to test is low,which will cause a large number of false positive results.Saddlepoint approximation(SPA)can utilize the entire cumulant-generating function(CGF),which can considerably improve the accuracy to approximate the null distribution of a test statistic and obtain more reliable correlation analysis results.Although saddlepoint approximation methods have been greatly developed in GWAS,its theory and application still need to be further improved.The core difficulty in saddlepoint approximation methods is to calculate the moment generating function(MGF)of test statistics.For complex phenotype data,the moment generating function of test statistics under the null hypothesis is difficult to calculate,which brings difficulties to the using of saddlepoint approximation methods.The genetic structure of the population is one of the important confounding factors in GWAS,and only a reasonable description of the genetic structure of the population can avoid the false positive results in association studies.In GWAS genetic models are used to describe the impacts of genetic and confounding factors on phenotypes,and accurate genetic models can reduce false positive results and improve statistical power.It is of signicance to extend saddlepoint approximation methods from simple phenotype data analysis to complex phenotype data analysis,from homogenous population analysis to heterogenous population analysis,and from genetic effects analysis to gene-environment interaction analysis.Thus,saddlepoint approximation methods can be applied to more complex phenotype data and a more general genetic structure,and bring a more accurate and comprehensive description of genetic effects,which can result in more reliable analysis results.The research in this paper is based on the application of saddlepoint approximation methods to GWAS,and mainly includes the following three research contents.Firstly,we focus on an empirical saddlepoint approximation algorithm EmpSPA for homogeneous population analysis.The asymptotic equivalence property between SPA algorithm and EmpSPA algorithm is established,and the corresponding convergence rate is given.The numerical simulation studies demonstrate that EmpSPA and SPA perform similarly for homogeneous population analysis,both of which are more accurate than regular normal distribution approximation and can control type one error rates,which verifies the correctness of the results of theoretical derivation.Secondly,based on saddlepoint approximation,we propose an algorithm SPA-G,that is applicable to genome-wide scale complex traits analysis and adjusted for population structure.For the widely used score test statistics in GWAS,we propose an assumption to treat genotypes as random variables.Based on this assumption,we propose a normal distribution approximation method and an empirical saddlepoint approximation method to approximate the null conditional distribution of score statistics given residuals.To fast and accurately calculate p-values,a hybrid test strategy combining the normal distribution approximation and saddlepoint approximation is adopted in SPA-G.To evaluate type Ⅰerror rates and powers of SPA-G,we carry out extensive simulation studies for case-control study and time-to-event phenotype analysis.Simulation results demonstrate that SPAG can control type Ⅰ error rates regardless of phenotypic distribution,allele frequencies,and population structure while retaining good statistical powers,which can obtain more accurate results.Finally,based on saddlepoint approximation,we propose a gene-environment interaction analysis Method,EmpGxE,that is applicable to a wide range of complex traits.EmpGxE fits a genotype-independent model and then uses a projection to attenuate the marginal genetic effect from GxE effect to construct a test statistic for marginal GxE effect.We propose a normal distribution approximation method and an empirical saddlepoint approximation method to approximate the null distribution of test statistics.To fast and accurately calculate p-values,a double hybrid test strategy that combines Wald test and the combination of normal distribution approximation and saddlepoint approximation is adopted in EmpGxE.To evaluate type Ⅰ error rates and powers of EmpGxE,we carry out extensive simulation studies for time-to-event phenotype GxE analysis.Simulation results demonstrate that EmpGxE can control type Ⅰ error rates regardless of phenotypic distribution and allele frequencies while retaining good statistical powers,which can obtain more reliable analysis results.The proposed algorithms SPA-G and EmpGxE based on saddlepoint approximation are applicable to more complex traits and more general genetic structures,and provide a new research idea for the application of saddlepoint approximation methods to GWAS,which is of great significance in both theory and practice. |