| With the rapid development of high-throughput genotyping technologies in recently years, genome-wide association study (GWAS) has emerged as one of the most important tools for identifying genetic variants involved in complex diseases. Over the past years, numerous GWAS studies have identified hundreds of associations of genomic regions and susceptibility loci with complex diseases or phenotypes. Although this has improved our understanding of genetic basis of these complex diseases and trait, there are still many analytic challenges in GWAS. Most existing methods for GWAS are single-locus-based approaches, in which each variant is tested individually for association with a specific phenotype in the whole genome-wide. However, such a single-locus-based analysis strategy of GWAS has many limitations. There are many statistical challenges for GWAS, such as how to incorporate biological information into a GWAS and how to mine from GWAS data for getting more information, and so on. It is vary apparent that new strategies and methods are urgently needed for GWAS.Here we proposed a hierarchical model GWAS strategy with the inclusion of prior biological information and applicated it in a real GWAS data. With the help of computer simulations, the statistical properties and the effectiveness for actural GWAS data were evaluated from application’s point of view, and the research details were as follows:In Section1, two simulated studies were conducted based on the prior biological information which simulated by binormial distributions and the results of gene function classification from a real GWAS data, respectively. Base on the two simulation studies, the effects of different prior biological information for hierarchical model (HM) were evaluated thoroughly:(1) Both hierarchical model and logistic regression (LR) model are less powerful and perform similarly when OR equal to or less than1.1at the GWAS significance level of1E-5and1E-7. However, HM always performs powerful than LR when the OR great than1.1.(2) The mean square errors (MSE) and the width of confidence intervals (WCI) of HM are always smaller than LR at all sets of parameters. It suggests that the biological information is helpful to improve the effect of parameters estimation in HM.(3) At all sets of parameters, the area under the ROC curve of HM is always greater than LR’s. On the one hand, it suggests that HM have more powerful to test the disease loci than LR. On the other hand, it also demonstrates that HM has more ability to decrease the false-positive findings.In Section2, three simulation studies were implemented to explore the effect of applying HM when incompleted information, additional noisy information and uninformative information were included, respectively. The main conclusions in this section are as follows:(1) The true relavent biological informations have a major impact on the performance of HM. If the true relavent biological informations were included in HM, even though other incompleted information or uninformative information were also included, the power of HM always greater than LR’s. On the contray, HM losted more power than LR without true relavent biological informations. The results of the area under the ROC curve for HM had the similar conclusions. (2) It is often assume that the genetic effect won’t be very strong in GWAS and the ORs are usually1.1,1.2, etc. In such condition, the power of HM was closed to LR wether the true relavent biological informations include in HM or not.(3) In HM GWAS analysis, the performance of HM was better than or close to LR even when the additional noise was included in HM.(4) In the sensitivity analysis, all the incompleted information, additional noisy information and uninformative information had different level of effects on the parameter estimation of HM. However, in all cases, HM had a better estimation than LR in terms of MSE and WCI.In Section3, the HM GWAS strategy was applied in a real GWAS data of lung cancer in Chinease Han populations.Firstly, the method for obtaining and using the different kind of biological informations were introduced in detail based on the public bioinformation databases. Secondly, for the lung cancer GWAS data, the information matrix was constructed including three kind of biological informations, such as evolutionary conservation, gene function and linkage disequilibrium. Thirdly, the HM GWAS strategy was used to analysis the lung cancer GWAS data as well as LR. The results showed that the two methods have the different ability to test the same loci in the same genome regions. The point estimatins of HM were slightly lower than LR’s, and HM got more narred width of CI than LR’s.In conclusion, the effects of the HM GWAS strategy including biological information were evaluated by simulations and real GWAS data. HM had the same or greater power than LR in GWAS, and the false positive rate was well controlled at the same time. It will help to creat a better understanding of the genetic mechanism of complex diseases. The above HM GWAS strategy incuding biological information answered the questions from the biologists quite well and deserved to be explored widely in future work. |