| Background:The common statistical strategy of genome-wide association study(GWAS)directly links the genetic variants to disease,ignoring the continuous biological process regarding genome→transcriptome→proteome→metabolome→phenotype.Although GWAS have successfully identified several susceptibility loci(single nucleotide polymorphisms,SNPs)for complex diseases,it is still limited in interpreting complex disease from the perspective of multi-omics,thus bringing up the issue of the follow-up clinical transformation.Therefore,integrating multiple omics data to deeply explore the mechanism underlying the SNP→disease association has become a hot topic in post-GWAS era.The GWAS association signals have been confirmed to account for only a small proportion of variance in disease risk,leading to the missing heritability issue.In addition,most of these GWAS signals were located in either intergenic region or non-coding region.In parallel to GWAS,expression quantitative trait loci(eQTL)study aims to detect the association between expression levels of specific genes and its cis-region SNPs.The SNPs associated with gene expression identified from eQTL study were highly overlapped with those associated disease identified in GWAS,indicating that genetic variants can affect complex disease through modulating gene expression.Statistically,to explore the relationship among SNPs,gene expression levels and complex diseases,it is often required to obtain data at these three levels from the one sample.However,most of the current large-scale GWAS often collect both SNP and phenotype data without gene expression data,while eQTL studies often collect SNP and gene expression data without phenotype data.These promote transcriptome-wide association studies(TWAS),which aims to test for gene-trait associations by integrating information from GWAS and eQTL studies.TWAS focuses on the level of specific gene expression regulated by its cis-SNPs.Bearing in mind that SNPs in the same genomic region of the same ethnic population have similar genetic structure,TWAS is able to integrate data from GWAS(SNP→disease)and eQTL(SNP→gene expression)studies with the same ethnicity.In statistical analysis,for a specific gene,TWAS often rely on a two-stage inference procedure.In the first stage,TWAS obtains the genetic effect estimates of SNPs on gene expression by constructing expression prediction models based on genotype and expression data from eQTL studies.In the second stage,TWAS relies on estimated genetic effects to predict the focal gene expression in GWAS data,and tests its association with the traits of interest by regression analysis.At present,many TWAS methods have been proposed.Although these methods have enriched the theory of TWAS to a certain extent,concerns still remained:(1)The inference procedures of most TWAS methods are not based on the joint-likelihood based inference framework,leading to the difficulty to guarantee the statistical properties of the model.Although the two-stage TWAS inference strategy is simple and feasible,it typically derives the genetic effect on the gene expression in the quite small eQTL study,and fails to account for the uncertainty in the parameter estimation of the first step,which may reduce the statistical efficiency.(2)Traditional TWAS analysis mainly focuses on association rather than causality.There are numerous confounding factors and widespread horizontal pleiotropy in omics data analysis.That is,SNP can affect the outcome through pathways other than or in addition to the gene of focus.If horizontal pleiotropy is not taken into account,test statistics will be greatly inflated,resulting in the false discovery of causal genes.To this end,fully controlling for horizontal pleiotropy and incorporating TWAS analysis into the causal framework will reduce false positive discoveries and improve the testing power.(3)Multi-trait TWAS analysis is still lacking.Many complex traits often combine together to reflect the same physiology level and share a common genetic basis.Consequently,performing multivariate analysis to test gene associations with multiple correlated traits jointly may lead to an appreciable power gain.(4)Most existing TWAS methods examine a single gene at a time,ignore the complex relationship among the expression levels of multiple genes and among the predicted gene expression of multiple genes as well as among the cis-SNPs that regulate different genes.Single gene TWAS analysis is unable to distinguish multiple genes with high correlation,resulting in false positives.Thus far,only two multi-gene TWAS methods have been developed.These two methods follow the traditional two-stage inference strategy and thus lead to power loss.In addition,these methods are failed in disentangling the putatively causal associations when multiple genes share cis-SNPs that are in linkage disequilibrium(LD)with each other.(5)Most TWAS methods focus on improving the power of test statistics for gene→complex trait association,ignoring the effect estimation of the causal gene which is helpful to guide the further experimental verification and drug target discovery.(6)The simulations for evaluating most TWAS methods are not realistic and beyond the real TWAS design,leading to the bias in assessing the performance of different methods.In this study,taking the horizontal pleiotropy into account,we aimed to construct multiple outcome Probabilistic Mendelian randomization with Egger assumption(moPMR-Egger)model and Gene-based Integrative Fine-mapping through conditional TWAS(GIFT)model,for both individual-level data and summary statistics,in a joint-likelihood framework.We performed comprehensive realistic simulations to assess the performance and compare it with existing methods.In particular,we will evaluate the unbiased effect estimation,the stability and power for test statistics,and finally summarized the application conditions of the model by further examining the sensitivity of the model against the key parameters.We also applied these two models to the large-scale TWAS analysis to evaluate its practicability.It is expected that these two methods can enrich the statistical theory of TWAS,provide strong statistical evidence for the genetic etiology of complex disease,narrow the scope of experimental targets,and produce new ideas for cross-omics data integration analysis.Methods:Based on the joint-likelihood theory,this study constructed the moPMR-Egger statistical model for TWAS analysis with multiple outcome traits under the framework of Mendelian Randomization(MR),by fully correcting the horizontal pleiotropy and taking advantage of the correlation among various traits.Meanwhile,this study also constructed the GIFT statistical model for TWAS analysis with multiple genes to increase power and reduce false positives,by accounting for the complex relationship among the expression levels of multiple genes,among the predicted gene expression of multiple genes,as well as the cis-SNPs that regulate different gene expression.These two novel TWAS methods are comprehensively evaluated by theoretical derivations,statistical simulations and real application studies.(1)Theoretical Models:For both individual-level data and summary statistics,two novel TWAS models were respectively constructed.By treating the SNP effect sizes as missing data,the parameter-expanded Expectation Maximization(PX-EM)algorithm is developed to accelerate the convergence rate for parameter estimation.In addition,the testing for the causal effect was conducted through the likelihood ratio test.(2)Simulations:Based on the real eQTL and GWAS data,taking the number of cis-SNPs and different LD structures of each gene into account,the cross-gene/region statistical simulation was designed to evaluate the constructed model and its corresponding test statistics from the following four aspects:①The stability of test statistics,that is,the stability for the type I error probability of the proposed statistic.We not only evaluate its closeness to the nominal level of 0.05,but also examined whether the distribution of P-values is uniform through Quantile-Quantile plot(Q-Q plot).② The power of test statistics,that is,the power of the constructed test statistics under different simulation settings(e.g.the sample size,the size of causal effect,the distribution of the SNP genetic effect).③The unbiased effect estimation,that is,the unbiased evaluation of the estimation of the causal effect and other parameters.④ The application conditions of the model,that is,the possible scenarios that the proposed model can be applied.(3)Real Data Applications:The European Genetic European Variation in Disease(GEUVADIS)dataset was used as eQTL data,and the UK Biobank(UKB)database was used as GWAS data.①For moPMR-Egger model,we analyzed five trait categories and further explore the relationship between different gene association patterns and biological pathogenesis in blood pressure categories.These five trait categories in UKB include the blood pressure category(Systolic Blood Pressure,SBP;Diastolic Blood Pressure,DBP),physical measures category(height;Body Mass Index,BMI;Forced Vital Capacity,FVC;Forced expiratory volume in one second-FVC ratio,F-F ratio),blood count category(platelet count,PC;red blood cell count,RBCC;eosinophils count,EC;white blood cell count,WBCC),white blood cell indices category(EC,WBCC),and red blood cell indices category(RBCC;red blood cell distribution width,RDW).②For GIFT model,we performed multi-gene TWAS analysis for six complex traits in the UKB that include blood pressures(SBP and DBP)and lipid traits(total cholesterol,TC;high density lipoprotein,HDL;low density lipoprotein,LDL;triglyceride,TG),and explored the potentially causal genes as well as related pathways for these complex traits.Results:This study developed two novel statistical methods including multi-trait moPMREgger and multi-gene GIFT models,for both individual-level and summary statistics data by integrating eQTL data with GWASs.The main findings are as follows:(1)moPMR-Egger model:1)Theoretical results:For multiple outcome traits,two models were respectively constructed for both individual-level data and summary statistics.The identifiability of the causal effects was proved under the decision-theoretic causal inference framework.The parameter-expanded Expectation Maximization(PX-EM)algorithm is developed to accelerate the convergence rate of the parameter estimation,where the SNP effect sizes are treated as missing data.In addition,causal effect test was conducted through the likelihood ratio test.2)Simulation results:①Testing and estimating the causal effects.moPMR-Egger provides well-calibrated type I error control both in the absence and presence of horizontal pleiotropic effects under the null.The null P-value distribution from moPMREgger remains regardless whether the genetic architecture underlying gene expression,regardless of the gene expression heritability,regardless whether the multiple traits are correlated or not and so on.Across a total of 152 alternative simulation scenarios,moPMREgger achieves an average of 44.10%power gain compared to other methods.In addition,moPMR-Egger produces accurate estimates of the causal effects both under the null and under various alternatives.②Testing and estimating horizontal pleiotropic effects.The P-values from moPMR-Egger on testing horizontal pleiotropy under the null is well calibrated across most scenarios.The P-values of moPMR-Egger become inflated when the genetic architecture underlying gene expression is sparse and when the gene affects more than two traits.While moPMR-Egger has comparable power with the other method when all traits are uncorrelated with each other,moPMR-Egger outperforms the other method in the presence of trait correlation.In addition,moPMR-Egger produces accurate estimates of the horizontal pleiotropic effects both under the null and under various alternatives.3)Application results:The results of five trait categories in the UKB showed moPMR-Egger identified 13.15%more gene associations than univariate approaches across different trait categories.By further exploring distinct regulatory mechanisms underlying SBP and DBP,we found that the genes with opposite causal effect directions on the two traits mostly regulate blood pressure by changing the elasticity and stiffness of blood vessels,while the genes with same causal effect directions on the two traits mostly regulate blood pressure through immune-related pathways.11.64%of genes with significant causal effects have horizontal pleiotropic effects.4)R package:The moPMR-Egger is implemented in the R package PMR and freely available on GitHub(https://github.com/yuanzhongshang/PMR).(2)GIFT model:1)Theoretical results:For a focal genomic region,two models were respectively constructed for both individual-level data and summary statistics.Similarly,PXEM algorithm is developed to accelerate the convergence rate of parameter estimation,where the SNP effect sizes are treated as missing data.In addition,the test of causal effect for each gene in a region was conducted through the likelihood ratio test.2)Simulation results:①GIFT produces calibrated P-values.In the complete null simulation settings where none of the genes in the region has a non-zero effect on the outcome trait,the test of GIFT yields calibrated type I error control,and so does the existing two-stage multi-gene TWAS methods.Importantly,only GIFT produces calibrated P-values for the null genes where both null and causal genes are present in the same genomic region.Even the causal genes in the region were not included in the analysis,GIFT still has the best type I error control,and these results were not affected by the number of causal genes in the region.②GIFT is powerful under a range of alternative simulations.GIFT is more powerful than the other multi-gene TWAS methods regardless whether the expression of the genes in the region are correlated or not,whether the genetic architecture underlying gene expression,whether different SNP effects on gene expression,whether different genes in the region display varying levels of heritability,regardless of the sample size of the eQTL study,the sample size of the GWAS,whether different genes in the region display varying proportion of phenotypic variance,regardless of the effect direction of the genes on the trait,or when the number of causal genes in a region.③ GIFT produces accurate estimates of the causal effect.In contrast,none of other methods was capable of estimating the causal gene effects on the trait.3)Application results:The results for six complex traits in UKB illustrated GIFT improves the set size of the putatively causal genes by 64.60%on average compared to existing TWAS fine-mapping methods.The enrichment analysis for putatively causal genes identified by GIFT highlights the importance of direct vessel regulation in determining blood pressures and the importance of lipid metabolism in regulating blood lipid levels.4)R package:The GIFT is implemented in the R package PMR and freely available on GitHub(htttps://yuangzhongshang.github.io/GIFT/).Conclusion:Given the disadvantages of the current TWAS methods involving the insufficient correction for horizontal pleiotropy,lack of joint analysis method for multiple related traits,insufficient characterization of complex gene relationships in the same region as well as lack of joint-likelihood inference,this study developed two novel statistical models,including multi-trait moPMR-Egger and multi-gene GIFT models,under a joint-likelihood framework.The main conclusions are as follows:(1)moPMR-Egger takes advantage of the correlation across multiple traits,is capable to test and control for potential horizontal pleiotropic effects,and tests its causal effects on multiple traits jointly,thus maximizing power while minimizing false associations for TWASs.moPMR-Egger achieves an average of 44.10%power gain compared to other three methods across a total of 152 alternative simulation scenarios.The results of five trait categories in the UKB showed moPMR-Egger identified 13.15%more gene associations than univariate approaches across trait categories.At the same time,moPMR-Egger is able to test and control for potential horizontal pleiotropic effects,achieving the simultaneous inference of both causal effects and pleiotropic effects.(2)GIFT models multiple genes in the same region,carries out conditional TWAS association tests by multivariate variance component model under the joint-likelihood framework,explicitly controls for both the expression correlation and cis-SNP LD across multiple genes in the local region,and accounts for the genetic expression prediction uncertainty.The simulation results prove that GIFT yields well type I error control,high statistical power with reduced false discoveries.The results for two blood pressure traits and four lipid traits in UKB showed GIFT improves the set size of the putatively causal genes by 64.60%on average compared to existing TWAS fine-mapping methods.(3)The applications of moPMR-Egger revealed the genes regulating blood pressure levels by changing the elasticity and stiffness of blood vessels have opposite directions of causal effects on SBP and DBP,while the genes regulating blood pressure through immune-related pathways have same directions for causal effects on SBP and DBP.Meanwhile,the applications of GIFT highlighted the importance of direct vessel regulation in determining blood pressures and the importance of lipid metabolism in regulating blood lipid levels.(4)Through rigorous theoretical proofs,comprehensive simulations as well as empirical analysis,both moPMR-Egger and GIFT methods show good statistical performance and describe different types of effects in TWAS analysis with different model parameters,thus have better interpretability.Both methods have enriched the statistical methods for TWAS analysis and provided the statistical support for exploring the etiology of complex diseases and narrowing the target for experimental verification. |