Font Size: a A A

Statistical learning and Behrens-Fisher distribution methods for heteroscedastic data in microarray analysis

Posted on:2011-11-27Degree:Ph.DType:Thesis
University:University of South FloridaCandidate:Manandhar Shrestha, Nabin KFull Text:PDF
GTID:2444390002967492Subject:Biology
Abstract/Summary:PDF Full Text Request
The aim of the present study is to identify the differentially expressed genes between two different conditions and apply it in predicting the class of new samples using the microarray data. Microarray data analysis poses many challenges to the statisticians because of its high dimensionality and small sample size, dubbed as "small n large p problem". Microarray data has been extensively studied by many statisticians and geneticists. Generally, it is said to follow a normal distribution with equal variances in two conditions, but it is not true in general. Since the number of replications is very small, the sample estimates of variances are not appropriate for the testing. Therefore, we have to consider the Bayesian approach to approximate the variances in two conditions. Because the number of genes to be tested is usually large and the test is to be repeated thousands of times, there is a multiplicity problem. To remove the defect arising from multiple comparison, we use the False Discovery Rate (FDR) correction. Applying the hypothesis test repeatedly gene by gene for several thousands of genes, there is a great chance of selecting false genes as differentially expressed, even though the significance level is set very small. For the test to be reliable, the probability of selecting true positive should be high. To control the false positive rate, we have applied the FDR correction, in which the p-values for each of the gene is compared with its corresponding threshold. A gene is, then, said to be differentially expressed if the p-value is less than the threshold.;We have developed a new method of selecting informative genes based on the Bayesian Version of Behrens-Fisher distribution which assumes the unequal variances in two conditions. Since the assumption of equal variances fail in most of the situation and the equal variance is a special case of unequal variance, we have tried to solve the problem of finding differentially expressed genes in the unequal variance cases. We have found that the developed method selects the actual expressed genes in the simulated data and compared this method with the recent methods such as Fox and Dimmic's t-test method, Tusher and Tibshirani's SAM method among others.;The next step of this research is to check whether the genes selected by the proposed Behrens-Fisher method is useful for the classification of samples. Using the genes selected by the proposed method that combines the Behrens Fisher gene selection method with some other statistical learning methods, we have found better classification result. The reason behind it is the capability of selecting the genes based on the knowledge of prior and data. In the case of microarray data due to the small sample size and the large number of variables, the variances obtained by the sample is not reliable in the sense that it is not positive definite and not invertible. So, we have derived the Bayesian version of the Behrens Fisher distribution to remove that insufficiency. The efficiency of this established method has been demonstrated by applying them in three real microarray data and calculating the misclassification error rates on the corresponding test sets. Moreover, we have compared our result with some of the other popular methods, such as Nearest Shrunken Centroid and Support Vector Machines method, found in the literature.;We have studied the classification performance of different classifiers before and after taking the correlation between the genes. The classification performance of the classifier has been significantly improved once the correlation was accounted. The classification performance of different classifiers have been measured by the misclassification rates and the confusion matrix.;The other problem in the multiple testing of large number of hypothesis is the correlation among the test statistics. We have taken the correlation between the test statistics into account. If there were no correlation, then it will not affect the shape of the normalized histogram of the test statistics. As shown by Efron, the degree of the correlation among the test statistics either widens or shrinks the tail of the histogram of the test statistics. Thus the usual rejection region as obtained by the significance level is not sufficient. The rejection region should be redefined accordingly and depends on the degree of correlation. The effect of the correlation in selecting the appropriate rejection region have also been studied.
Keywords/Search Tags:Method, Data, Genes, Differentially expressed, Microarray, Correlation, Rejection region, Distribution
PDF Full Text Request
Related items