Font Size: a A A

A Comparative Study Of Gene Set Enrichment Analysis In Tumour Biomarker Identification

Posted on:2014-12-08Degree:MasterType:Thesis
Country:ChinaCandidate:W ZhangFull Text:PDF
GTID:2254330392466873Subject:Epidemiology and Health Statistics
Abstract/Summary:PDF Full Text Request
Nowadays molecular biology represents one of the most interesting topics in medicaloncology, because it provides a global and detailed view on the molecular changesinvolved in tumour diagnosis and prognosis, to discovering new biomarkers and noveltherapeutic targets.The analysis of differentially expressed gene sets is an effective way to unveil theunderlying biological trends in gene expression data set. There are rich body of statisticaltests available for gene sets study, particularly for differentially expressed between twodifferent phenotypes. Several comparative studies addressed the relative performance ofsuch tests, each study concluded in different aspects. Here we selected four analyzingmethods which emphasize that the most fundamental difference among these approachesis formulated in terms of univariate or multivariate their tests based. All are the same nullhypothesis, self-contained hypothesis, the gene sets are all defined by KEGG, the evaluation of their performance was carried out on real datasets of tumors. Moreover,weanalyzed the accuracy of these methods for detecting potencial tumour biomarkersefficiently.1. The procedure of gene chip experimental design, microarray data preprocessingand standardization were introduced briefly. The statistical methods of detectingdifferentially expressed gene sets and their fundamental null hypothesis were reviewedextensively.2. The purpose of a gene set-level statistical test is to decide whether a gene set isdistinct in some statistically significant way. A gene set statistic can be defined in terms ofproperties of the genes in the set, such as the mean, median, variance, etc. of a gene-levelor the whole set-level. There are two null hypotheses as defined by Goeman et al,Q1andQ2,the background distribution is obtained by shuffling genes and phenotypesseparately,Q2is generally prefered because it preserves the relationship of genes in the setand directly addresses the question of finding gene sets whose expression changescorrelates with each phenotype. We used four Q2statistical tests, SAG-GS、GAGE、GlobalAncova and MANOVA, whose statistics in terms of gene-level and set-levelindividually. Analysis of large gene expression data sets in the presence and absence of aphenotype lead to the selection of a group of genes serving as biomarkers jointlypredicting the phenotype. As reported, univariate approaches selecting gene sets, althoughcomputationally efficient, often ignor gene interactions inherent in the biological data. Onthe other hand, multivariate approaches selecting gene subsets are known to have a higherrisk of selecting spurious gene subsets due to the overfitting of the vast number of genesubsets evaluated. For getting well know which is more efficient between the twostatistical ways, we selected4representative methods and made the study for comparativeevaluation in cancer datasets.3. We used the60human cancer cell lines microarray expressions dataset(theNCI60),assembled by the National Cancer Institute for anticancer drug discovery. Werestricted our attention on genes where the mutation occurred in more than ten cell-linesand selected three gene-mutation based phenotypes(i.e. mutated vs wild-type) in thedataset: p53、 PTENand p16,for each of the three genes defined the phenotype, we proposed those gene sets containing the specific gene as "true positive", in the sense that agood gene set analysis method should identify those gene sets as being associated with thephenotype. Then selecting another three corresponding genesRAC1、PRKAR2BandPRKACBthat without any close links with the three mutated genes ahead, and define the"truly negative". We compared sensitivity(true positive rate)and specificity(true negativerate),conducted Receiver Operating Characteristic(ROC) Analysis.4. The four choiced gene set analysis methods were analyzed in real tumour casestudies, one colorectal cancer data(GSE4107) and non-small cell lung cancerdata(GSE3593). The selected differentially expressed gene sets were validated bybiological criteria. The better analysis method should identify those gene sets based onbiological evidences much more. In summary, our study designed for biologicalevaluation illustrated some appreciable performance differences among the four gene-setenrichment analysis methods, for identifying truly effective gene-expression-analysis toolsfor biology and medicine. We found that univariate approaches are more effective thanmultivariate tests, and it indicated that univariate tests are preferred in tumore biomarkerscreening research. Among the four approaches, GAGE was the best choice.
Keywords/Search Tags:Gene Expression Profiles, Gene set enrichment, SAM-GS, GAGE, GlobalAncova, MANOVA, Biomarker
PDF Full Text Request
Related items