Font Size: a A A

The Study Of Gene Set Analysis Methods On Gene Expression Profiles And Its Applications In Medicine

Posted on:2010-11-20Degree:MasterType:Thesis
Country:ChinaCandidate:W J CaoFull Text:PDF
GTID:2144360275972704Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Microarrays are at the center of a revolution in biotechnology, allowing researchers to simultaneously monitor the expression of tens of thousands of genes, having been widely used in medical research. The main challenge faced by the researchers is to extract useful information from such gene expression profiles and then implement biological interpretation of such results. At present, all kinds of different research groups based on pre-defined gene set proposed different enrichment analysis methods. In this paper, these methods will be roughly divided into two categories: Single Gene Analysis(SGA) and Gene Set Analysis (GSA). The purpose of these methods is to detect different expression genes from which we expect to prevent and cure diseases. The conclusions coming from SGA are very limited because it can not explain effectively the biological characteristics, and not considering the relationship between genes. Since gene set enrichment analysis method is represented by Mootha in 2003, the statistical analysis scholars and biological information scholars widely focused on GSA. However, there have been no generally accepted theories and effective gene set analysis meathods which can filter correctly different expression gene set at present since the microarray data itself holds the unique characteristics of multi-dimensional, small sample size and complicated relationship between genes. In this thesis,we will see the statistical theory and its applications of gene set analysis methods using computer technology and the Monte Carlo simulation combined with real gene expression microarray data. The main contents include the reasonableness of the null hypothesis of different gene set analysis approaches, controlling methods of type one error and the validity of detecting differentially expressed genes set (Different Expression Gene set, DEGs) and so on. At present, we have done some jobs as follows:1. The basic steps of microarray experiment, bioinformatical database used annotated gene expression level and single-gene analysis methods will be briefly introduced. Based on these the approaches of gene set analysis will be extensive reviewed. These gene set enrichment analysis methods will be assessed according to the definition of gene sets, the null hypotheses framework and the build of theoretic distribution of statistical score.2. Various GSA methods have been developed based on different null hypotheses which can be deviede into 3 groups: competitive (Q1), self-contained (Q2) and mixed tests (Q3). A large number of research groups believe that the methods based on self-contained are better than those based on competitive null hypotheses. But now we still do not know which one is the reasonable one. In order to explore this issue, this study compares the distribution of P-values obtained from the three hypotheses Q1, Q2 and Q3 on simulated data. The results showed that the self-contained test detected most of gene sets as DEGs, but the fause discovery rate (FDR) was high. The competitive approach recognized very little DEGs. In order to achieve higher correct rate it weakened its test power. The mixed approach exhibited an intermediate performance. Our group prefers the mixed approach (Q3) to avoid the clear drawbacks of the other methods, but recommends using all the methods simultaneously, if possible, with biological analyses.3. Since the probability density function of gene-set score is always unknown, we generally obtain the distribution of it through permutation or bootstrap. We usually believe that the permutation methods are better than bootstrap ones, but we found two types of experiments results were roughly the same on simulated data. ROC curve analysis showed that the method based on bootstrap is a litter better than which one based on permutation because the AUC under the bootstrap method is larger than the permutation method. The analysis results elaborate that the bootstrap method is a litter better than the permutation method under the same conditions.4. Assumed that the genes are independent, we compare the test power of different GSA approaches in detecting DEGs on simulated data which generated using SAS 9.13. The results showed that the specificity and sensitivity of Efron's GSA were higher than other methods, and the power of SAFE, Globaltest only a litter lower than Efron's GSA.5. Due to the complex correlation between genes, we considered this relationship in simulated data. From the analysis results, we found that Eforn's GSA has totally lost its ability to judge. It almost can not recognize DEGs which we simulated in gene expression profiles. Howere, the test power of PCOT2 and Globaltest is very significant. They can identify DEGs which we have setted in simulated microarray data.6. Using different gene set analysis methods to analysis two well-known real gene experimental data in order to compare their test power. The conclusions further confirmed that PCOT2 and Globaltest methods which take into account the coorelation between genes is superior to other methods. Besides, we found that the Globaltest method can identify more DEGs, and its FDR is lower than that of PCOT2 by 19 percent. Combined the results of simulated data with real gene expression level, our group prefers those methods which using model (such as random effect logistic model) to analysis the microarray data.The main innovative points of this thesis are as follows:â‘ We compared the influence of different null hypotheses and methods of generating theoretic distributation of gene-set score to the analysis results on simulated data.â‘¡Taking into account the correlation between genes in simulated experimental data , we compared the GSA test power according to those data which simulated the correlation relationship within each gene-set seperately.â‘¢Simulated experimental results showed that the gene set methods based on model building can effectly consider the relationship between genes.â‘£After compared GSA methods utilizing real gene expression data, we say that Globaltest is a more effective method used to analysis microarray data.In this topic, we mainly explore and discuss some relative issues about GSA methods based on the statistical theory in gene expression data analysis, and give some approaches which we think they are more effective. Look forward to lay a sound foundation for the research project of science and technology projects in Shaanxi Province (Different expressional information in microarray data mining and its application, number: 2008K04-02), especially in gene set analysis.
Keywords/Search Tags:Microarray data, Statistical Inference, Gene Set, Gene Set Analysis, Different Expression Gene set, FDR, Monte Carlo Simulation
PDF Full Text Request
Related items